SQL for Data Science

4.1 Subqueries

Define subqueries
Discuss advantages and disadvantages of using subqueries
Explain how subqueries help us merge data from two or more tables
Write more efficient subqueries

Subqueries are queries embedded into other queries. Since data is stored in multiple tables, you may need to get information from multiple sources. Data scientists often use subqueries to:

Select specific records or columns and then use the criteria as filtering criteria for the next thing they want to select.
Add additional criteria like filtering criteria that are not in your current table from another table into your query.

Example: Need to know the region each customer is from who has had an order with freight over 100. The freight information is in table Orders and the customer information is in table Customers.

SELECT customerID, CompanyName, Region
FROM Customers
WHERE customerID in (SELECT customerID FROM Orders WHERE Freight > 100);

DBMS is performing two operations:

Get the customer ID for the orders selected
Pass that to the WHERE clause and processing the overall SELECT statement

Note that it always perform the innermost SELECT portion first.

Best Practices and Considerations
- There is no limit to the number of subqueries you can have
- Performance slows when you nest too deeply.
- Subquery selects can only retrieve a single column

Also, when you use subqueries, the code gets harder to understand. One of the best practices for writing subqueries is to be clear and consistent with indenting to make it easier to read. This website will pre-format SQL code for you: www.poorsql.com

Subqueries can also be used as calculations.

Example: Get the total number of orders by each customer

SELECT customer_name,
    customer_state (SELECT COUNT(*) AS orders FROM Orders WHERE 
        Orders.customer_id = Customer.customer_id) AS orders
FROM customers
ORDER BY Customer_name

As you can see, we’re selecting customer name, the region and then just as if we were going to select a different column name we have a whole subquery in there. We have (SELECT COUNT(*) AS orders FROM Orders, and count these based on the customer IDs. It’s aggregating the orders based on the customer IDs ( Orders.customer_id = Customer.customer_id). In the end, we order the table by customer name.