Getting to know SQL for beginners
Have you wondered how to go about studying SQL? I am pretty sure you may have heard about this term before but wondered how is it applicable to our daily work.
First of all, we have to understand why do we need SQL for our work. SQL is known as Structured Query Language, domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS) — taken from wikipedia.
As a starter, most of us arelooking at data at a very small scale. Normally, the files are in the form of Microsoft Excel. You may observe that it is in rows and columns format as shown below. You will normally have data attributes(headers) that tells you what kind of data you can expect in the rows.
The data view is understandable and easy to read. You can easily perform your analysis with Excel features like sum(), count() or pivot tables. In fact, the in-built pivot table feature is powerful enough to solve most of your business problems.
Data too large to handle
Even though Excel is very versatile and user friendly for many people, it is not able to handle over a million rows. If you have a data set that contains a few million rows, you will likely see the rows to be cut off at the one million mark. Also, you will have a hard looking for your data and experience performance lag(you may never get to open your excel file at times).
Because of these issues, we turn to use data warehouses which are able to store big amount of data. In order to perform data analysis on these data, we would have to use SQL to extract the data.
Ability to join different tables
Just imagine that an excel file has many different tabs and you want to combine the data into a single sheet. You may have to do quite a lot of data massaging by performing multiple vlookups ,indexes and match. There is also a possibility for human error as we have to manually drag the formula across the rows or columns. There is a also risk that you may overwrite your data.
In a data warehouse, there are many different tables that contains different types of information. By using the join conditions with SQL, you are able to combine these huge datasets together with minimal error. As you are only performing the extraction process, you will not be over writing any data if you were to change the logic of your extraction.
The main parts of SQL
I will not be discussing in details about the SQL coding but in general there are 4 main areas in SQL commands. I have listed as the 4 main parts of the commands as below.
Data Manipulation Language
Essentially, this is the most important part of SQL you will need to know as a data analyst. It contains the language that you need to query from the database. You will tend to perform your data mining process over here by joining and aggregating your results to a meaningful form. Usually the SQL query will be take a form as below.
SELECT * FROM CUSTOMER_TABLE
This statement just means to extract all data from the customer table in the database. Beware that it can affect your computer’s performance as you are extracting all the data from the table. In that case, you may want to limit your rows to probably 100 for your viewing purpose.
Data Definition Language
Before any data can be ingested, the structure of the data tables has to be created. This is very important because you need to know what kind of data fits into which columns. You need to define the data type like whether it is string, date or integer. The data will then be able to flow into these tables. Otherwise, the data will be rejected and you will have an empty table. An example of a data definition language looks like the code as below.
CREATE TABLE visits (
visit_id INT PRIMARY KEY ,
first_name VARCHAR (50) NOT NULL,
last_name VARCHAR (50) NOT NULL,
visited_at DATETIME,
phone VARCHAR(20)
);
Data Control Language
This language involves granting permission to certain users. As data can be sensitive and confidential, only some people with appropriate authority are allowed to view the data. The person involved with giving access is normally the Database Administrator in which he hold the rights to give users the permission to view the data.
In terms of language wise, it is not very difficult. It is like having an on or off switch to give the permission to the individual users.
Transaction Control Language
Creating data pipelines for an organisation is not an easy task. I have participated in some of the processes myself. What happens if the transactions were wrong or the logic was incorrect? Like how you do a back up for your computer, this language helps to provide additional security to the transaction environment.
When creating a flow or procedure, there may be a situation that you may over-subtract a balance amount, leading to a negative value. Based on your logic, this is not supposed to happen. Therefore, you can implement a checkpoint and rollback if such a scenario arises. This way, it will revert to its original form to the save point you had. However, as a data analyst, you need not to be too concern about this topic as it tends to be handled by the data engineering team.
How to get started?
Personally, I would recommend giving a try to practise SQL queries on the web browsers. One simple website is in this link as below. It has a simple interface which you can begin some hands-on for it. Sometimes, I used this website for my classes or workshops because it is less likely for students to have technical issues with it.
Once you are more comfortable with SQL queries, you can move on to installing RDBMS software onto your PC or laptop. This way, you are able to import better datasets and replicate how the actual world industry uses their SQL.
I hope this has been helpful as a start. There are many free online resources you can access to learn the technical aspects of SQL. Always keep learning!
For more questions about my journey, you can reach me at hello@goingmeraki.com or visit my another blog at https://goingmeraki.com