SQL for Data Analysis

Tirendaz AI
11 min readMar 27, 2022

SQL is a language that communicates with databases. Despite the emergence of new technologies, knowing SQL is still important for data analysis. You need to know SQL queries to pull data with cloud-based solutions like Amazon Redshift and Google BigQuery.

Photo by Mimi Thian on Unsplash

A database is a place to store data in an organized way. SQL is a language that communicates with databases. SQL stands for Structured Query Language. SQL is one of the oldest programming languages ​​and is still widely used today.

Let’s see what I’m going to talk about in this blog post:

  • What are SQL and database.?
  • What are structured and unstructured data?
  • Benefits of SQL
  • SQL vs Python and R
  • What is a data model?
  • SQL commands in action

Before getting started, we produce content about data science, artificial intelligence, and programming. Don’t forget to subscribe to our youtube channel.

Let’s dive into what SQL is.

What is SQL?

There is some debate as to whether SQL is or isn’t a programming language. I can say that SQL is not a general programming language like C or Python. For example, you cannot make a website with SQL. But SQL is a powerful language for working with data in databases. SQL is used to access and manipulate data in databases.

SQL databases are also called relational databases. Because these databases are made up of relationships. Databases consist of one or more schemas.

A database that consists of the schemas

A schema contains objects such as tables, views, and functions. Tables consist of columns containing data. Before adding any data to the table, you should decide on the structure of the table beforehand.

A table that includes information of students

As you see in the above table, the name of the first column is “ID” and this column is an integer type. The name of the second column is “Name” and this column is a string type. The name of the third column is “Age” and this column is an integer type

So far I briefly defined the SQL and the database. Now let’s look at data structures.

Structured Vs Unstructured Data

Data is generally divided into two as structured and unstructured. Structured data can be stored in tables. Most databases are designed for structured data. In these databases, each attribute is in columns and examples are in rows.

Unstructured data, on the other hand, unlike structured data, data models or data types are not predetermined. For example, images, videos, audio files are examples of unstructured data. Unstructured data can be stored in NoSQL databases. NoSQL means not only SQL. It became very popular after 2010 because you can store both structured and unstructured data using NoSQL.

An example for NoSQL databases

In NoSQL databases, data is stored with key-value pairs and the biggest feature of these databases is that they are dynamic. The most popular NoSQL database is MongoDB. In NoSQL, a table is called “collection”, a column is called “field”, and a row is called “document”. In this section, I talked about NoSQL databases. Let’s take a look at the benefits of SQL.

Benefits of SQL

There are many benefits to using SQL. SQL is flexible and you can easily connect to SQL database with this language. Most of the data in the world are tabular data. SQL is a standard language for working with data in databases. You can connect to databases and analyze your data with SQL using languages ​​such as Python or R.

SQL is easy to learn because its syntax is simple. You can quickly learn keywords and work with databases. I briefly mentioned the benefits of SQL. Let’s compare SQL with R and Python.

SQL vs R & Python

Although SQL is a popular language for data analysis, it is not the only option. You can use R and Python for data analysis. R is a statistical and graphical language, while Python is a general-purpose language used specifically for data analysis. Let’s compare these two languages ​​with SQL.

Working the databases server

First of all, SQL runs on a database server and takes full advantage of computational resources. R and Python usually run on local machines. Of course, there are exceptions here. For example, you can also work on a database laptop or you can run R and Python on the server. Using SQL to update reports periodically is a good choice, as new data is often added to databases.

Storing the data

The second difference is about storing data. Relational databases always store data in tables. So SQL assumes this structure for every query. R and Python use structures such as variables, lists, or dictionaries to store data.

Loop

Another difference between SQL and most other computer programming languages is the loop. As you know, loops continue until a certain condition is met. For R and Python, you can use loops with keywords like for and while. For example, SQL groupings loop over the dataset without writing any additional code.

You may ask that is there no downside to SQL? Let’s take a look at some disadvantages of SQL.

Drawbacks of SQL

One disadvantage of SQL is that the data must be in a database. However, you can import locally stored files with R or Python and then you can work with these files. You can also connect to databases with R and Python. For example, you can connect to the database and run SQL queries using dplyr for R and SQLAlchemy for Python.

Another disadvantage of SQL is that it is insufficient for advanced statistical analysis. For example, you can do complex analysis and machine learning projects with R and Python. But in SQL, you can do simple statistical analysis.

So far I compared SQL with R and Python languages. I’ve seen the advantages and disadvantages of each language. Now let’s look at database management systems.

Database Management Systems (DBMS)

Database management systems (DBMS) are used to work with a database. A database management system is a software used specifically for relational databases.

You can use database management software such as MySQL, PostgreSQL, and SQLite to manage data. The syntaxes of these softwares are based on SQL but have some differences. Let’s look at the commands to print the first 10 rows of data in 5 popular database management systems.

The commands are the same for MySQL, PostgreSQL, and SQLite.

SELECT * FROM age LIMIT 10;

For Microsoft SQL Server it is as follows:

SELECT TOP 10 * FROM age;

For Oracle Database it is as follows:

SELECT * FROM age WHERE ROWNUM <= 10;

As you can see, there are minor differences in database management systems software.

You can install and use software such as MySQL, PostgreSQL on your computer to write SQL codes. I’ll use SQLite in this blog post.

SQLite is free and doesn’t take up much space on your computer. SQLite is king for simple projects. Let’s open a command prompt window to use SQLite. To start sqlite3, you type sqlite3 and press enter.

sqlite3

But with this command, everything runs in memory. This means that changes are not saved when you close SQLite.

Connecting to SQLite

If you notice, here is a red warning in the form of a “transient in-memory database” as a temporary memory database. If you want to save changes, you must connect to a database. Let’s create a database. To do this, let’s exit this temporary database. Let’s use the .quit command to get out of here.

.quit

Now let’s write sqlite3 first and then the name of the database.

sqlite3 my_demo.db

So we created a new database named my_demo.DB. Let’s take a closer look at what an SQL query is.

In the SQL world, there is a general abbreviation called CRUD. This abbreviation stands for “Create, Read, Update, and Delete”. These 4 operators are used very often when working with a database.

Let’s create a demo table now. Let’s determine the names and data types of the columns that will take place in this table.

CREATE TABLE demo (id int, num int);

Let’s add two values ​​inside this table.

INSERT INTO demo VALUES (1, 100), (2, 200);

Let’s select the first row last.

SELECT * FROM demo LIMIT 1;# Output:
1|100

It is very easy to create a table in the database and add data to it. Let’s see the database we’re working with.

.database# Output:
my_demo.db

Now let’s see the table we are working with.

.table# Output:
demo

So what is a SQL query? You use SQL queries to see and manipulate data. Data analysts and data scientists use these queries frequently. For example, let’s write a SQL query to see all the data in a table.

SELECT * FROM demo;#Output:
1|100
2|200

You can see all the values ​​in the table. When working on the command line, don’t forget to put a semicolon at the end of the commands to indicate that the command is finished. Now let’s look at a data model.

A Data Model

A data model is a visual that shows the summary of all related tables in the database.

For example, let’s look at the data model that shows students’ grades.

A data model

In this data model, there are two tables named student and grades. As you know, tables are made up of columns. These columns are sometimes called attributes or fields. Each name like student_id inside these rectangles is the column name.

The student table has an abbreviation as PK. This PK refers to the primary key. The primary key identifies each row of data in a table as unique. For example, in the Student table, the primary key is the student_id column. This means “student_id is different for each row of data”.

There is also a foreign key. A foreign key in one table points to a primary key in another table. For example, student_id in the Grades table is a foreign key. This means that the values ​​in this column are matched to the student_id in the Student table.

By using these keys, we can associate the tables with each other. A student has multiple grades for each course, right? For example, a student can take multiple courses such as machine learning, deep learning, and SQL. If a value corresponds to more than one value, this relationship is called a one-to-many relationship.

In this section, I briefly explained what a data model is. Now let’s move on and do the practice.

Practice

First, let’s create a database.

The database that I am going to work

There are two tables here, EMP and DEPT. The EMP table consists of 14 lines with only numeric, text, and date values, and the DEPT table consists of 4 lines with numerical and text values.

Now let’s show the basic commands. First I select the column names and then I write the table name.

SELECT empno, ename
FROM emp;
# Output:
7369|SMITH
7499|ALLEN
7521|WARD
7566|JONES
7654|MARTIN
7698|BLAKE
7782|CLARK
7788|SCOTT
7839|KING
7844|TURNER
7876|ADAMS
7900|JAMES
7902|FORD
7934|MILLER

You may also want to select all columns. The asterisk symbol is used to select all columns.

SELECT *
FROM emp;
#Output:
7369|SMITH|CLERK|7902|1980-12-17|800||20
7499|ALLEN|SALESMAN|7698|1981-02-20|1600|300|30
7521|WARD|SALESMAN|7698|1981-02-22|1250|500|30
7566|JONES|MANAGER|7839|1981-04-02|2975||20
7654|MARTIN|SALESMAN|7698|1981-09-28|1250|1400|30
7698|BLAKE|MANAGER|7839|1981-05-01|2850||30
7782|CLARK|MANAGER|7839|1981-06-09|2450||10
7788|SCOTT|ANALYST|7566|1982-12-09|3000||20
7839|KING|PRESIDENT||1981-11-17|5000||10
7844|TURNER|SALESMAN|7698|1981-09-08|1500|0|30
7876|ADAMS|CLERK|7788|1983-01-12|1100||20
7900|JAMES|CLERK|7698|1981-12-03|950||30
7902|FORD|ANALYST|7566|1981-12-03|3000||20
7934|MILLER|CLERK|7782|1982-01-23|1300||10

WHERE is used to select specific rows. For example, let’s see all employees with department number 20.

SELECT *
FROM emp
WHERE deptno = 20;
#Output:
7369|SMITH|CLERK|7902|1980-12-17|800||20
7566|JONES|MANAGER|7839|1981-04-02|2975||20
7788|SCOTT|ANALYST|7566|1982-12-09|3000||20
7876|ADAMS|CLERK|7788|1983-01-12|1100||20
7902|FORD|ANALYST|7566|1981-12-03|3000||20

You can change the names of the columns if you want. For example, let’s write salary instead of sal. and let’s see the employees whose salary is less than 5000.

SELECT sal as salary
FROM emp
WHERE salary < 5000;
#Output:
800
1600
1250
2975
1250
2850
2450
3000
1500
1100
950
3000
1300

GROUP BY is used to group rows. Now let’s group by department numbers and see the average salaries of the departments.

SELECT deptno, AVG(sal)
FROM emp
GROUP BY deptno;
#Output:
10|2916.66666666667
20|2175.0
30|1566.66666666667

If you want, you can round the number with the ROUND command. For example, let’s round-up according to the 1st digit after the comma.

SELECT deptno, ROUND(AVG(sal))
FROM emp
GROUP BY deptno;
#Output:
10|2917.0
20|2175.0
30|1567.0

The HAVING clause allows the results to be filtered after the GROUP BY has been applied. To show this, let’s first see the job positions. I am going to use COUNT to see the number of groups.

SELECT job, COUNT(*)
FROM emp
GROUP BY job;
#Output:
ANALYST|2
CLERK|4
MANAGER|3
PRESIDENT|1
SALESMAN|4

Now let’s choose only 4 employees. To do this, I’m going to use HAVING clause.

SELECT job, COUNT(*)
FROM emp
GROUP BY job
HAVING COUNT(*) = 4;
#Output:
CLERK|4
SALESMAN|4

ORDER BY is used to sort the results. To show this clause, let’s select names and jobs and sort them by name.

SELECT ename, job
FROM emp
ORDER BY ename;
#Output:
ADAMS|CLERK
ALLEN|SALESMAN
BLAKE|MANAGER
CLARK|MANAGER
FORD|ANALYST
JAMES|CLERK
JONES|MANAGER
KING|PRESIDENT
MARTIN|SALESMAN
MILLER|CLERK
SCOTT|ANALYST
SMITH|CLERK
TURNER|SALESMAN
WARD|SALESMAN

If we want to sort in reverse order, you can use DESC keyword which is the abbreviation of DESCENDING.

SELECT ename, job
FROM emp
ORDER BY ename DESC;
#Output:
WARD|SALESMAN
TURNER|SALESMAN
SMITH|CLERK
SCOTT|ANALYST
MILLER|CLERK
MARTIN|SALESMAN
KING|PRESIDENT
JONES|MANAGER
JAMES|CLERK
FORD|ANALYST
CLARK|MANAGER
BLAKE|MANAGER
ALLEN|SALESMAN
ADAMS|CLERK

To see a table quickly, you may want to see a few rows instead of printing the whole table. To do this, you can use LIMIT clause. For example, let’s see the first five rows of the emp table.

SELECT *
FROM emp
LIMIT 5;
#Output:
7369|SMITH|CLERK|7902|1980-12-17|800||20
7499|ALLEN|SALESMAN|7698|1981-02-20|1600|300|30
7521|WARD|SALESMAN|7698|1981-02-22|1250|500|30
7566|JONES|MANAGER|7839|1981-04-02|2975||20
7654|MARTIN|SALESMAN|7698|1981-09-28|1250|1400|30

In this section, I have explained the basic SQL queries.

Conclusion

SQL is a language that communicates with databases. Data is usually stored in databases. Knowing SQL is an important skill for data analysts. In this blog post, I talked about SQL for data analysis. In summary,

  • First of all, I explained SQL and databases.
  • I covered structured and unstructured data.
  • I talked about the benefits of SQL and compared SQL with R and Python.
  • I discussed what a data model is.
  • Lastly, I showed SQL queries in a database.

That’s it. I hope you enjoy it.

Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn 👍

Resources

  • Zhao, A. (2021). SQL Pocket Guide: A Guide to SQL Usage (4th ed.). O’Reilly Media.
  • Tanimura, C. (2021b). SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights (1st ed.). O’Reilly Media.
  • Molinaro, A., & Graaf, R. D. (2020). SQL Cookbook: Query Solutions and Techniques for All SQL Users (2nd ed.). O’Reilly Media.

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--