4.5 Databases and SQL
4.5.1 History
Databases have been around for many years to efficiently organize, store, retrieve, and update data systematically. In the past, statisticians and analysts usually dealt with small datasets stored in text or spreadsheet files and often did not interact with database systems. Students from the traditional statistics department usually lack the necessary database knowledge. However, as data grow bigger, database knowledge becomes essential and required for statisticians, analysts and data scientists in an enterprise environment where data are stored in some form of database systems. Databases often contain a collection of tables and the relationship among these tables (i.e. schema). The table is the fundamental structure for databases that contain rows and columns similar to data frames in R or Python. Database management systems (DBMS) ensure data integration and security in real time operations. There are many different DBMS such as Oracle, SQL Server, MySQL, Teradata, Hive, Redshift and Hana. The majority of database operations are very similar among different DBMS, and Structured Query Language (SQL) is the standard language to use these systems.
SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987. The most recent version is published in December 2016. For typical users, fundamental knowledge is nearly the same across all database systems. In addition to the standard features, each DBMS providers include their own specific functions and features. So, for the same query, there may be slightly different implementations (i.e. SQL script) for different systems. In this section, we use the Databricks’ SQL implementation (i.e. all the SQL scripts can run in Databricks SQL notebook).
More recent data is stored in a distributed system such as Hive for disk storage or Hana for in-memory storage. Most relational databases are row-based (i.e. data for each row are stored closely), whereas analytics workflows often favor column-based systems (i.e. data for each column are stored closely). Fortunately, as a database user, we only need to learn how to write SQL scripts to retrieve and manipulate data. Even though there are different implantations of SQL across different DBMS, SQL is nearly universal across relational databases including Hive and Spark, which means once we know SQL, our knowledge can be transferred among different database systems. SQL is easy to learn even if you do not have previous experience. In this session, we will go over the key concepts in the database and SQL.
4.5.2 Database, Table and View
A database is a collection of tables that are related to each other. A database has its own database name and each table has its name as well. We can think a database is a “folder” where tables within a database are “files” within the folder. A table has rows and columns exactly as an R or Python pandas data frame. Each row (also called record) represents a unique instance of the subject and each column (also called field or attribute) represents a characteristic of the subject on the table. For each table, there is a special column called the primary key which uniquely identifies each of its records.
Tables within a specific database contain related information and the schema of a database illustrates all fields in every table as well as how these tables and fields relate to each other (i.e. the structure of a database). Tables can be filtered, joined and aggregated to return specific information. The view is a virtual table composed of fields from one or more base tables. The view does not store data and only store table structure. The view is also referred to as a saved query. The view is typically used to protect the data stored in the table and users can only query information from a view and cannot change or update its contents.
We will use two simple tables to illustrate basic SQL operations. These two tables are from an R dataset which contains the 50 states’ population and income (https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/state.html). The first table is called divisions
which has two columns: state
and division
and the first few rows are shown in the following table:
state | division |
---|---|
Alabama | East South Central |
Alaska | Pacific |
Arizona | Mountain |
Arkansas | West South Central |
California | Pacific |
The second table is called metrics
which contains three columns: state
, population
and income
and first few rows of the table are shown below:
state | population | income |
---|---|---|
Alabama | 3615 | 3624 |
Alaska | 365 | 6315 |
Arizona | 2212 | 4530 |
Arkansas | 2110 | 3378 |
California | 21198 | 5114 |
To illustrate missing information, three more rows are added at the end of the original division table with state Alberta, Ontario, and Quebec with their corresponding division NULL. We first creat these two tables and save them as csv files, and then we upload these two files as Databricks tables.
4.5.3 Basic SQL Statement
After logging into Databricks and creating two tables, we can now create a notebook and choose the default language of the notebook to be SQL. There are a few very easy SQL statements to help us understand the database and table structure:
show database
: show current databases in the systemcreate database db_name
: create a new database with namedb_name
drop database db_name
: delete databasedb_name
(be careful when using it!)use db_name
: set up the current database to be usedshow tables
: show all the tables within the currently used databasedescribe tbl_name
: show the structure of table with nametbl_name
(i.e. list of column name and data type)drop tbl_name
: delete a table with nametbl_name
(be careful when using it!)select * from metrics limit 10
: show the first 10 rows of a table
If you are familiar with a procedural programming language such as C and FORTRAN or scripting languages such as R and Python, you may find SQL code a little bit strange. We should view SQL code by each specific chunk where it defines a specific task. SQL codes descript a specific task and DBMS will run and finish the task. SQL does not follow typical procedure program rules and we can think SQL is “descriptive” (i.e. we describe what we want using SQL and DBMS figures out how to do it).
4.5.3.1 SELECT
Statement
SELECT
is the most used statement in SQL, especially for database users and business analysts. It is used to extract specific information (i.e. column or columns) FROM
one or multiple tables. It can be used to combine multiple tables. WHERE
can be used in the SELECT
statement to selected rows with specific conditions (i.e. filters). ORDER BY
can be used in the SELECT
statement to order the results in descending or ascending order of one or multiple columns. We can use *
after SELECT
to represent all columns in the table, or specifically write the column names separated by a comma. Below is the basic structure of a SELECT
statement:
SELECT Col_Name1, Col_Name2
FROM Table_Name
WHERE Specific_Condition
ORDER BY Col_Name1, Col_Name2;
Here Specific_Condition
is the typical logical conditions and only columns with TRUE
for this condition will be chosen. For example, if we want to choose states and its total income where the population larger than 10000 and individual income less than 5000 with the result order by state name, we can use the following query:
select state, income*population as total_income
from metrics
where population > 10000 and income < 5000
order by state
The SELECT
statement is used to slicing and dicing the dataset as well as create new columns of interest (such as total_income
) using basic computation functions.
4.5.3.2 Aggregation Functions and GROUP BY
We can also use aggregation functions in the SELECT
statement to summarize the data. For example, count(col_name)
function will return the total number of not NULL
rows for a specific column. Other aggregation function on numerical values include min(col_name)
, max(col_name)
, avg(col_name)
. Let’s use the metrics
table again to illustrate aggregation functions. For aggregation function, it takes all the rows that match WHERE condition (if any) and return one number. The following statement will calculate the maximum, minimum, and average population for all states starts with letter A to E.
select sum(population) as sum_pop, max(population) as
min(population) as min_pop, avg(population)
max_pop, as avg_pop, count(population) as count_pop
from metrics
where substring(state, 1, 1) in ('A', 'B', 'C', 'D', 'E')
The results from the above query only return one row as expected. Sometimes we want to find the aggregated value based on groups that can be defined by one or more columns. Instead of writing multiple SQL to calculate the aggregated value for each group, we can easily use the GROUP BY to calculate the aggregated value for each group in the SELECT
statement. For example, if we want to find how many states in each division, we can use the following:
select division, count(state) as number_of_states
from divisions
group by division
Another special aggregation function is to return distinct values for one column or a combination of multiple columns. Simple use SELECT DISTINCT col_name1, col_name2
in the first line of the SELECT
statement.
4.5.3.3 Join Multiple Tables
The database system is usually designed such that each table contains a piece of specific information and oftentimes we need to JOIN multiple tables to achieve a specific task. There are few types typically JOINs: inner join (keep only rows that match the join condition from both tables), left outer join (rows from inner join + unmatched rows from the first table), right outer join (rows from inner join + unmatched rows from the second table) and full outer join (rows from inner join + unmatched rows from both tables). The typical JOIN
statement is illustrated below:
SELECT a.col_name1 as var1, b.col_name2 as var2
FROM tbl_one as a
LEFT JOIN tabl_two as b
ON a.col_to_match = b.col_to_match
For example, let us join the division table and metrics table to find what is the average population and income for each division, and the results order by division names:
select a.division, avg(b.population) as avg_pop,
avg(b.income) as avg_inc
from divisions as a
inner join metrics as b
on a.state = b.state
group by division
order by division
4.5.4 Advanced Topics in Database
There are many advanced topics such as how to efficiently query data using index; how to take care of data integrity when multiple users are using the same table; algorithm behind data storage (i.e. column-wise or row-wise data storage); how to design the database schema. Users can learn these advanced topics gradually. We hope the basic knowledge covered in this section will kick off the initial momentum to learn SQL. As you can see, it is easy to write SQL statement to retrieve, join, slice, dice and aggregate data. The SQL notebook that contains all the above operations is included in the book’s website.