4.5 Databases and SQL

4.5.1 History

Databases have been around for many years to efficiently organize, store, retrieve, and update data systematically. In the past, statisticians and analysts usually dealt with small datasets stored in text or spreadsheet files and often did not interact with database systems. Students from the traditional statistics department usually lack the necessary database knowledge. However, as data grow bigger, database knowledge becomes essential and required for statisticians, analysts and data scientists in an enterprise environment where data are stored in some form of database systems. Databases often contain a collection of tables and the relationship among these tables (i.e. schema). The table is the fundamental structure for databases that contain rows and columns similar to data frames in R or Python. Database management systems (DBMS) ensure data integration and security in real time operations. There are many different DBMS such as Oracle, SQL Server, MySQL, Teradata, Hive, Redshift and Hana. The majority of database operations are very similar among different DBMS, and Structured Query Language (SQL) is the standard language to use these systems.

SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987. The most recent version is published in December 2016. For typical users, fundamental knowledge is nearly the same across all database systems. In addition to the standard features, each DBMS providers include their own specific functions and features. So, for the same query, there may be slightly different implementations (i.e. SQL script) for different systems. In this section, we use the Databricks’ SQL implementation (i.e. all the SQL scripts can run in Databricks SQL notebook).

More recent data is stored in a distributed system such as Hive for disk storage or Hana for in-memory storage. Most relational databases are row-based (i.e. data for each row are stored closely), whereas analytics workflows often favor column-based systems (i.e. data for each column are stored closely). Fortunately, as a database user, we only need to learn how to write SQL scripts to retrieve and manipulate data. Even though there are different implantations of SQL across different DBMS, SQL is nearly universal across relational databases including Hive and Spark, which means once we know SQL, our knowledge can be transferred among different database systems. SQL is easy to learn even if you do not have previous experience. In this session, we will go over the key concepts in the database and SQL.

4.5.2 Database, Table and View

A database is a collection of tables that are related to each other. A database has its own database name and each table has its name as well. We can think a database is a “folder” where tables within a database are “files” within the folder. A table has rows and columns exactly as an R or Python pandas data frame. Each row (also called record) represents a unique instance of the subject and each column (also called field or attribute) represents a characteristic of the subject on the table. For each table, there is a special column called the primary key which uniquely identifies each of its records.

Tables within a specific database contain related information and the schema of a database illustrates all fields in every table as well as how these tables and fields relate to each other (i.e. the structure of a database). Tables can be filtered, joined and aggregated to return specific information. The view is a virtual table composed of fields from one or more base tables. The view does not store data and only store table structure. The view is also referred to as a saved query. The view is typically used to protect the data stored in the table and users can only query information from a view and cannot change or update its contents.

We will use two simple tables to illustrate basic SQL operations. These two tables are from an R dataset which contains the 50 states’ population and income (https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/state.html). The first table is called divisions which has two columns: state and division and the first few rows are shown in the following table:

state division
Alabama East South Central
Alaska Pacific
Arizona Mountain
Arkansas West South Central
California Pacific

The second table is called metrics which contains three columns: state, population and income and first few rows of the table are shown below:

state population income
Alabama 3615 3624
Alaska 365 6315
Arizona 2212 4530
Arkansas 2110 3378
California 21198 5114

To illustrate missing information, three more rows are added at the end of the original division table with state Alberta, Ontario, and Quebec with their corresponding division NULL. We first creat these two tables and save them as csv files, and then we upload these two files as Databricks tables.

4.5.3 Basic SQL Statement

After logging into Databricks and creating two tables, we can now create a notebook and choose the default language of the notebook to be SQL. There are a few very easy SQL statements to help us understand the database and table structure:

  • show database: show current databases in the system
  • create database db_name: create a new database with name db_name
  • drop database db_name: delete database db_name (be careful when using it!)
  • use db_name: set up the current database to be used
  • show tables: show all the tables within the currently used database
  • describe tbl_name: show the structure of table with name tbl_name (i.e. list of column name and data type)
  • drop tbl_name: delete a table with name tbl_name (be careful when using it!)
  • select * from metrics limit 10: show the first 10 rows of a table

If you are familiar with a procedural programming language such as C and FORTRAN or scripting languages such as R and Python, you may find SQL code a little bit strange. We should view SQL code by each specific chunk where it defines a specific task. SQL codes descript a specific task and DBMS will run and finish the task. SQL does not follow typical procedure program rules and we can think SQL is “descriptive” (i.e. we describe what we want using SQL and DBMS figures out how to do it).

4.5.3.1 SELECT Statement

SELECT is the most used statement in SQL, especially for database users and business analysts. It is used to extract specific information (i.e. column or columns) FROM one or multiple tables. It can be used to combine multiple tables. WHERE can be used in the SELECT statement to selected rows with specific conditions (i.e. filters). ORDER BY can be used in the SELECT statement to order the results in descending or ascending order of one or multiple columns. We can use * after SELECT to represent all columns in the table, or specifically write the column names separated by a comma. Below is the basic structure of a SELECT statement:

SELECT Col_Name1, Col_Name2
FROM Table_Name
WHERE Specific_Condition
ORDER BY Col_Name1, Col_Name2;

Here Specific_Condition is the typical logical conditions and only columns with TRUE for this condition will be chosen. For example, if we want to choose states and its total income where the population larger than 10000 and individual income less than 5000 with the result order by state name, we can use the following query:

select state, income*population as total_income
from metrics
where population > 10000 and income < 5000
order by state

The SELECT statement is used to slicing and dicing the dataset as well as create new columns of interest (such as total_income) using basic computation functions.

4.5.3.2 Aggregation Functions and GROUP BY

We can also use aggregation functions in the SELECT statement to summarize the data. For example, count(col_name) function will return the total number of not NULL rows for a specific column. Other aggregation function on numerical values include min(col_name), max(col_name), avg(col_name). Let’s use the metrics table again to illustrate aggregation functions. For aggregation function, it takes all the rows that match WHERE condition (if any) and return one number. The following statement will calculate the maximum, minimum, and average population for all states starts with letter A to E.

select sum(population) as sum_pop, max(population) as
max_pop, min(population) as min_pop, avg(population)
as avg_pop, count(population) as count_pop
from metrics
where substring(state, 1, 1) in ('A', 'B', 'C', 'D', 'E')

The results from the above query only return one row as expected. Sometimes we want to find the aggregated value based on groups that can be defined by one or more columns. Instead of writing multiple SQL to calculate the aggregated value for each group, we can easily use the GROUP BY to calculate the aggregated value for each group in the SELECT statement. For example, if we want to find how many states in each division, we can use the following:

select division, count(state) as number_of_states
from divisions
group by division

Another special aggregation function is to return distinct values for one column or a combination of multiple columns. Simple use SELECT DISTINCT col_name1, col_name2 in the first line of the SELECT statement.

4.5.3.3 Join Multiple Tables

The database system is usually designed such that each table contains a piece of specific information and oftentimes we need to JOIN multiple tables to achieve a specific task. There are few types typically JOINs: inner join (keep only rows that match the join condition from both tables), left outer join (rows from inner join + unmatched rows from the first table), right outer join (rows from inner join + unmatched rows from the second table) and full outer join (rows from inner join + unmatched rows from both tables). The typical JOIN statement is illustrated below:

SELECT a.col_name1 as var1, b.col_name2 as var2
FROM tbl_one as a
LEFT JOIN tabl_two as b
ON a.col_to_match = b.col_to_match

For example, let us join the division table and metrics table to find what is the average population and income for each division, and the results order by division names:

select a.division, avg(b.population) as avg_pop,
avg(b.income) as avg_inc
from divisions as a
inner join metrics as b
on a.state = b.state
group by division
order by division

4.5.3.4 Add More Content into a Table

We can use the INSERT statement to add additional rows into a particular table, for example, we can add one more row to the metrics table by using the following query:

insert into metrics
values ('Alberta', 4146, 7370)

4.5.4 Advanced Topics in Database

There are many advanced topics such as how to efficiently query data using index; how to take care of data integrity when multiple users are using the same table; algorithm behind data storage (i.e. column-wise or row-wise data storage); how to design the database schema. Users can learn these advanced topics gradually. We hope the basic knowledge covered in this section will kick off the initial momentum to learn SQL. As you can see, it is easy to write SQL statement to retrieve, join, slice, dice and aggregate data. The SQL notebook that contains all the above operations is included in the book’s website.