Here are some SQL puzzles to practice your skills within the sql-cli-playground environment. For each puzzle, first run the provided Python code snippets in the CLI to set up the necessary tables. Then, write and execute the SQL query to solve the puzzle.
Give me some high-level approaches for how to solve this SQL puzzle (just the barest hint - DO NOT solve it for me).
Task: Select the name and species of all pets from the pets table.
Setup Code (Run in CLI):
# Run these lines one by one in the CLI
/create pets2
pets2['id'] = [1, 2, 3, 4, 5]
pets2['name'] = ['Buddy', 'Lucy', 'Max', 'Daisy', 'Charlie']
pets2['species'] = ['Dog', 'Cat', 'Dog', 'Dog', 'Cat']
pets2['age'] = [3, 5, 2, 7, 1]Expected Output: A table showing only the name and species columns for all 5 pets.
Task: Find the names of all dogs older than 3 years from the pets table.
Setup Code: (Use the pets table created in Puzzle 1)
Expected Output: A table showing the names 'Daisy'.
Task: Calculate the average age for each pet species in the pets table.
Setup Code: (Use the pets table created in Puzzle 1)
Expected Output: A table showing each species and its corresponding average age (e.g., Dog: 4.0, Cat: 3.0).
Task: List the names of pets and the names of their owners.
Setup Code (Run in CLI):
# Assumes 'pets' table exists from Puzzle 1
/create owners
owners['owner_id'] = [101, 102, 103, 104, 105]
owners['name'] = ['Alice', 'Bob', 'Carol', 'David', 'Eve']
owners['pet_id'] = [1, 2, 3, 4, 6] # Note: Pet ID 5 has no owner, Pet ID 6 doesn't exist in petsExpected Output: A table showing pet names alongside their owner's name for pets 1 through 4. Buddy/Alice, Lucy/Bob, Max/Carol, Daisy/David.
Task: Find the name and salary of the employee with the highest salary in each department.
Setup Code (Run in CLI):
/create employees
employees['emp_id'] = [1, 2, 3, 4, 5, 6, 7]
employees['name'] = ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace']
employees['department'] = ['HR', 'Engineering', 'Engineering', 'HR', 'Sales', 'Sales', 'Engineering']
employees['salary'] = [60000, 80000, 85000, 65000, 70000, 75000, 90000]Hint: You might need GROUP BY and an aggregate function like MAX(), potentially in a subquery or using window functions if your SQLite version supports them well (ROW_NUMBER()).
Expected Output: A table showing the highest earner(s) from each department:
- HR: David, 65000
- Engineering: Grace, 90000
- Sales: Frank, 75000
Task: Find the names of employees who earn more than their managers. Assume direct managers are listed in a separate table.
Setup Code (Run in CLI):
# Assumes 'employees' table exists from Puzzle 5
/create managers
managers['emp_id'] = [1, 2, 3, 5, 6, 7] # Alice, Bob, Charlie, Eve, Frank, Grace
managers['manager_id'] = [4, 7, 7, -1, 5, 2] # David, Grace, Grace, NULL (Top), Eve, Bob
# Map: Alice->David, Bob->Grace, Charlie->Grace, Eve->(None), Frank->Eve, Grace->BobHint: You'll need to join the employees table with itself (self-join) or use a subquery to compare an employee's salary with their manager's salary obtained via the managers table. Handle the case where an employee might not have a manager (manager_id = -1 or similar).
Expected Output: A table showing employees who earn more than their manager. Based on the data:
- Grace (90000) earns more than Bob (80000).
- Bob (80000) does not earn more than Grace (90000).
- Frank (75000) earns more than Eve (70000).
Task: List all employees and their corresponding manager's name. Include employees who do not have a manager listed in the managers table.
Setup Code: (Use employees and managers tables from Puzzle 5 & 6)
Hint: How do you include rows from one table even when there's no matching row in the table you are joining it with?
Expected Output: A table showing each employee's name and their manager's name. Employees without managers (like Eve) should still appear, perhaps with a NULL value for the manager's name.
Task: Count the number of employees in the 'Engineering' department versus the total number of employees in all other departments combined.
Setup Code: (Use employees table from Puzzle 5)
Hint: Think about how you can use functions like COUNT or SUM along with a condition (like an IF or CASE statement) inside the aggregation.
Expected Output: A table showing two counts, one for 'Engineering' employees and one for 'Other Departments'. (e.g., Engineering: 3, Other: 4)
Task: Find employees hired in the year 2023.
Setup Code (Run in CLI):
# Assumes 'employees' table exists from Puzzle 5
employees['hire_date'] = ['2022-05-10', '2023-01-15', '2023-08-20', '2022-11-01', '2024-02-28', '2023-06-10', '2022-03-15']Hint: Most SQL dialects provide functions to extract parts of a date (like the year).
Expected Output: A table listing the names of employees hired in 2023 (Bob, Charlie, Frank).
Task: Find all employees whose names start with the letter 'A'.
Setup Code: (Use employees table from Puzzle 5)
Hint: Look for SQL operators specifically designed for pattern matching in strings.
Expected Output: A table listing the employee(s) whose name starts with 'A' (Alice).
Task: List the names of projects assigned to employees working in the 'Engineering' department.
Setup Code (Run in CLI):
# Assumes 'employees' table exists from Puzzle 5
/create projects
projects['proj_id'] = ['P1', 'P2', 'P3', 'P4']
projects['proj_name'] = ['Website Redesign', 'Database Migration', 'Mobile App', 'Security Audit']
/create assignments
assignments['assign_id'] = [101, 102, 103, 104, 105, 106]
assignments['emp_id'] = [2, 3, 7, 1, 3, 5] # Bob, Charlie, Grace, Alice, Charlie, Eve
assignments['proj_id'] = ['P1', 'P1', 'P2', 'P3', 'P4', 'P3']Hint: You'll need to link three tables together: employees to assignments, and assignments to projects.
Expected Output: A table listing the project names associated with 'Engineering' employees (Bob, Charlie, Grace). Expected projects: Website Redesign, Database Migration, Security Audit.
Task: Find all salary values that are paid to more than one employee.
Setup Code: (Use employees table from Puzzle 5)
Hint: How can you group rows based on a specific value and then count how many rows fall into each group? How do you filter these groups?
Expected Output: A table listing salary values that appear more than once. (In the sample data, no salary is duplicated, but the query should work if there were duplicates). If Alice also made 80000, the output should show 80000.
Task: Find the name and salary of the employee with the 3rd highest salary. (Assume no ties for simplicity, or define how to handle ties).
Setup Code: (Use employees table from Puzzle 5)
Hint: Can you order the employees by salary and then somehow select only the row at a specific position? Think about subqueries or LIMIT/OFFSET.
Expected Output: A table showing the name and salary of the 3rd highest paid employee (Bob, 80000).
Task: Display the total salary budget for each department, but with departments as columns rather than rows.
Setup Code: (Use employees table from Puzzle 5)
Hint: Similar to conditional aggregation, use SUM with conditions (e.g., CASE) to sum salaries only for specific departments, creating separate columns for each.
Expected Output: A single row table with columns like HR_Total, Engineering_Total, Sales_Total, showing the sum of salaries for each. (HR: 125000, Engineering: 255000, Sales: 145000)
Task: List the names of all employees who are in the 'HR' department or the 'Sales' department.
Setup Code: (Use employees table from Puzzle 5)
Hint: How do you combine the results of two separate SELECT statements into a single list?
Expected Output: A table listing the names of employees in HR or Sales (Alice, David, Eve, Frank).
Task: For each employee, find the average salary of their specific department.
Setup Code: (Use employees table from Puzzle 5)
Hint: Can a subquery refer to values from the outer query that contains it? Use this to calculate the average salary only for the current employee's department within the subquery.
Expected Output: A table showing each employee's name, department, salary, and a column with the average salary for their department (e.g., Alice, HR, 60000, 62500; Bob, Engineering, 80000, 85000; etc.).
Task: Categorize customers into 'Low', 'Medium', or 'High' spenders based on their total purchase amount. Thresholds: Low < 100, Medium < 500, High >= 500.
Setup Code (Run in CLI):
/create customers
customers['customer_id'] = [1, 2, 3, 4, 5, 6]
customers['name'] = ['Liam', 'Olivia', 'Noah', 'Emma', 'Oliver', 'Ava']
/create purchases
purchases['purchase_id'] = [101, 102, 103, 104, 105, 106, 107, 108]
purchases['customer_id'] = [1, 2, 1, 3, 4, 2, 5, 6]
purchases['amount'] = [50.0, 120.0, 30.0, 450.0, 600.0, 80.0, 25.0, 700.0]Hint: You'll need to aggregate purchase amounts per customer first. Then, use a CASE statement to apply the category labels based on the total amount.
Expected Output: A table showing each customer's ID, name, total purchase amount, and their spending category ('Low', 'Medium', or 'High').
Task: For each product on each date, find the sales amount from the previous day. Assume dates are consecutive where data exists.
Setup Code (Run in CLI):
/create daily_sales
daily_sales['sale_date'] = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03']
daily_sales['product_id'] = [10, 10, 10, 20, 20, 20]
daily_sales['sales_amount'] = [200, 210, 215, 70, 75, 80]
# Convert date string to actual date type if needed by the specific SQL environment
# In sqlite-web or similar, date strings might work directly for comparisons
# If not, you might need: daily_sales['sale_date'] = pd.to_datetime(daily_sales['sale_date']) # before loading - anyway this works fine for the current sqlite3 environmentHint: Window functions like LAG() are ideal for this. You need to partition by the product and order by date to look back correctly. Handle the first day for each product where there's no previous day.
Expected Output: A table showing sale date, product ID, sales amount, and the previous day's sales amount (or NULL/0 for the first day).
Task: Find all rows in the sensor_readings table where the temperature reading is missing (NULL).
Setup Code (Run in CLI):
/create sensor_readings
sensor_readings['reading_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
sensor_readings['sensor_id'] = ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'B', 'A', 'C', 'B', 'A']
sensor_readings['timestamp'] = ['2023-05-01 10:00', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:10', '2023-05-02 10:00', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-04 10:10']
sensor_readings['temperature'] = [25.5, 22.0, None, 30.1, None, 26.0, 28.0, 29.2, 27.3, None, 28.5, 27.8] # Using Python's None for NULL
sensor_readings['humidity'] = [60, 65, 62, 55, 68, 61, 63, 64, 66, 67, 69, 90]
**Hint:** SQL has a specific operator to check if a value `IS NULL`.
**Expected Output:** A table showing the full rows (all columns) for readings where the temperature is NULL (reading_ids 3 and 5).
---
## Puzzle 20: Data Sampling (Medium)
**Task:** Select approximately 50% of the user sessions randomly from the `user_sessions` table.
**Setup Code (Run in CLI):**
```python
/create user_sessions
user_sessions['session_id'] = range(1, 21) # 20 sessions
user_sessions['user_id'] = [ (i % 5) + 1 for i in range(20)] # Users 1-5
user_sessions['duration_seconds'] = [ (i * 10) + 30 for i in range(20)]Hint: Most SQL dialects have a function to generate random numbers (e.g., RANDOM() or RAND()). You can use this in an ORDER BY clause and LIMIT the result, or use it in a WHERE clause to filter rows probabilistically. The exact method might vary slightly based on the SQL flavor (SQLite's RANDOM() is common).
Expected Output: A table containing roughly half (around 10 rows) of the original user_sessions table, selected randomly. The exact rows will vary on each execution.
Task: Calculate the conversion rate (percentage of users who made a purchase) for users in the 'Control' group versus the 'Treatment' group.
Setup Code (Run in CLI):
/create experiment_users
experiment_users['user_id'] = range(1, 11) # 10 users
experiment_users['group_name'] = ['Control'] * 5 + ['Treatment'] * 5
/create user_actions
user_actions['action_id'] = range(101, 116) # 15 actions
user_actions['user_id'] = [1, 2, 6, 3, 7, 8, 1, 4, 9, 6, 2, 10, 5, 7, 3] # Some users repeat actions
user_actions['action_type'] = ['view', 'view', 'view', 'click', 'view', 'purchase', 'click', 'view', 'purchase', 'click', 'purchase', 'view', 'view', 'purchase', 'purchase']Hint: First, identify which users made a purchase. Then, join this information with the experiment group assignments. Finally, aggregate counts per group (total users, converted users) and calculate the rate (converted / total * 100.0). Use COUNT(DISTINCT user_id) carefully.
Expected Output: A table showing two rows, one for 'Control' and one for 'Treatment', with their respective conversion rates. (Control: User 3 & 5 purchased? 2 / 5 users. Treatment: User 7, 8, 9 purchased? 3 / 5 users).
Task: Calculate the 3-day moving average of website visits for the entire site.
Setup Code (Run in CLI):
/create website_traffic
website_traffic['log_date'] = ['2023-03-01', '2023-03-02', '2023-03-03', '2023-03-04', '2023-03-05', '2023-03-06', '2023-03-07']
website_traffic['visits'] = [1000, 1100, 1050, 1200, 1150, 1300, 1250]
# Ensure log_date is treated as a date for orderingHint: This is a classic use case for window functions. You need AVG(visits) over a window defined by ORDER BY log_date that includes the current row and the 2 preceding rows (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW).
Expected Output: A table showing each date and the corresponding 3-day moving average of visits. The first two days will have averages based on fewer than 3 days.
Task: For each customer, calculate their total number of purchases (Frequency) and the total amount spent (Monetary Value). (Recency is harder without transaction dates, so we skip it here).
Setup Code: (Use customers and purchases tables from Puzzle 11)
Hint: You need to GROUP BY customer and use two different aggregate functions: COUNT() for frequency and SUM() for monetary value. Join with the customers table to get names.
Expected Output: A table showing customer_id, name, Frequency (count of purchases), and Monetary (sum of purchase amounts) for each customer.
Task: Identify users whose average session duration is more than 1.5 times the overall average session duration across all users.
Setup Code: (Use user_sessions table from Puzzle 14)
Adjust the avg duration for user 1 to be only 10 and user 5 to be 250:
user_sessions.loc[user_sessions.user_id == 1,'duration_seconds'] = 10
user_sessions.loc[user_sessions.user_id == 5,'duration_seconds'] = 250Hint: First, calculate the overall average duration. Then, calculate the average duration per user. Finally, select users whose average duration exceeds the overall average multiplied by 1.5. Subqueries or CTEs (Common Table Expressions) are helpful here.
Expected Output: A table listing the user_id and their average session duration for users identified as having anomalously long average sessions.
Task: Find the minimum and maximum values for the feature1 column in the ml_data table. This is often needed for Min-Max scaling: (value - min) / (max - min).
Setup Code (Run in CLI):
/create ml_data
ml_data['id'] = range(1, 8)
ml_data['feature1'] = [10.5, 12.1, 8.3, 15.0, 9.9, 11.2, 14.5]
ml_data['feature2'] = [100, 110, 95, 120, 105, 108, 115]
ml_data['label'] = [0, 1, 0, 1, 0, 1, 1]Hint: Use the MIN() and MAX() aggregate functions. You don't need a GROUP BY if you want the overall min/max across the whole table.
Expected Output: A single row table showing the minimum and maximum values found in feature1. (Min: 8.3, Max: 15.0).
Follow-up: If you want to scale the feature1 column, you can use the formula: (value - min) / (max - min).
Task: Given a table of actual labels and predicted labels, count the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). Assume '1' is the positive class and '0' is the negative class.
Setup Code (Run in CLI):
/create predictions
predictions['id'] = range(1, 11) # 10 predictions
predictions['actual_label'] = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0]
predictions['predicted_label'] = [1, 1, 0, 1, 0, 1, 1, 0, 0, 0]Hint: This requires conditional aggregation. Use SUM() with CASE statements for each of the four conditions:
- TP: actual = 1 AND predicted = 1
- FP: actual = 0 AND predicted = 1
- TN: actual = 0 AND predicted = 0
- FN: actual = 1 AND predicted = 0
Expected Output: A single row table with four columns: TP, FP, TN, FN, showing the counts for each category based on the provided data.
Task: Replace missing temperature values in the sensors2 table with the average temperature for the same sensor on the same day.
Setup Code:
/create sensors2
sensors2['reading_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
sensors2['sensor_id'] = ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'B', 'A', 'C', 'B', 'A']
sensors2['timestamp'] = ['2023-05-01 10:00', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:10', '2023-05-02 10:00', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-04 10:10']
sensors2['temperature'] = [25.5, 22.0, None, 30.1, None, 26.0, 28.0, 29.2, 27.3, None, 28.5, 27.8] # Using Python's None for NULL
sensors2['humidity'] = [60, 65, 62, 55, 68, 61, 63, 64, 66, 67, 69, 90]
**Hint:** You'll need to calculate the average temperature per sensor per day first. Then, use this in an `UPDATE` statement or a `SELECT` with `COALESCE()` to replace missing values.
**Expected Output:** The modified `sensors2` table with missing `temperature` values replaced.
---
## Puzzle 28: Data Quality Check - Duplicate Rows (Easy)
**Task:** Identify and count duplicate rows in the `user_sessions` table.
**Setup Code:** (Use `user_sessions` table from Puzzle 14) and add a duplicate row:
```python
user_sessions.loc[(len(user_sessions) + 1)] = [1, 1, 10]Hint: Use COUNT(*) with GROUP BY on all columns. Then, filter for groups with more than one row.
Expected Output: A table showing the duplicate rows and their counts.
Task: Scale the feature1 and feature2 columns in the ml_data table using Min-Max scaling.
Setup Code: (Use ml_data table from Puzzle 25)
Hint: Calculate the minimum and maximum for each feature. Then, apply the Min-Max scaling formula: (value - min) / (max - min).
Expected Output: The ml_data table with feature1 and feature2 scaled between 0 and 1.
Task: Prepare data for plotting the distribution of feature1 values in the ml_data table by aggregating into bins.
Setup Code: (Use ml_data table from Puzzle 25)
Hint: Decide on a bin size, then use GROUP BY with a CASE statement to assign each value to a bin. Count the number of values in each bin.
Expected Output: A table showing the bins and their respective counts, ready for plotting.
Task: Using the sales table, rank each sales representative by their total sales amount, showing the rep_id, total_sales, and assign a sales_rank.
Setup Code (Run in CLI):
/create sales
sales['sale_id'] = [1, 2, 3, 4, 5, 6]
sales['rep_id'] = [101, 102, 101, 103, 102, 101]
sales['amount'] = [500, 300, 200, 400, 600, 150]Hint: Use SUM(amount) OVER (PARTITION BY rep_id) to compute totals and DENSE_RANK() OVER (ORDER BY total_sales DESC) to assign ranks.
Expected Output: A table with columns rep_id, total_sales, and sales_rank.
Task: Calculate cumulative revenue month over month using the sales_series table. Include month, revenue, and cumulative_revenue.
Setup Code:
/create sales_series
sales_series['month'] = ['2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12']
sales_series['amount'] = [1200, 1500, 1300, 0, 1600, 1400, 1700, 1800, 1550, 1650, 1750, 1850]Hint: Use SUM(amount) OVER (ORDER BY month ROWS UNBOUNDED PRECEDING).
Expected Output: A table sorted by month showing cumulative_revenue.
Task: Find the top 2 highest paid employees in each department using the employees2 table. Return department, employee_name, and salary.
Setup Code:
/create employees2
employees2['emp_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
employees2['name'] = ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack']
employees2['department'] = ['HR', 'Engineering', 'Engineering', 'HR', 'Sales', 'Sales', 'Engineering', 'HR', 'Engineering', 'Sales']
employees2['salary'] = [60000, 80000, 85000, 65000, 70000, 75000, 90000, 30000, 45000, 25000]Hint: Use ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) and filter on row_num <= 2.
Expected Output: The top 2 earners per department.
Task: List each employee along with their manager's name using the employees table from Puzzle 5.
Setup Code: (Use employees table from Puzzle 5)
Hint: Perform a self-join on employees matching manager_id to employee_id.
Expected Output: Columns employee_id, employee_name, and manager_name.
Task: Show the full management chain for each employee using the employees2 table from Puzzle 33 and the managers2 table defined below.
Setup Code: (Use employees2 table from Puzzle 33 and the managers2 table defined below)
/create managers2
managers2['emp_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
managers2['manager_id'] = [2, 7, 6, 2, 6, 7, None, 1, 3, 5]Hint: Use a WITH RECURSIVE CTE to traverse the hierarchy.
Expected Output: A table listing employee_id, employee_name, and the full path of managers above them.
Task: Identify customers who have never made a purchase from the customers and purchases tables used in Puzzle 9.
Setup Code: (Use customers and purchases tables from Puzzle 9)
Add customers who haven't made purchases:
customers.loc[(len(customers) + 1)] = [7, 'Bob']
customers.loc[(len(customers) + 1)] = [8, 'Mary']
Hint: Use LEFT JOIN or EXCEPT to find customers with no matching purchases.
Expected Output: A list of customer IDs or names with no purchases.
Task: Find products that appear in both the product_sales table and a returns table.
Setup Code:
/create product_sales
product_sales['product_id'] = [101, 102, 103, 104, 102]
product_sales['quantity'] = [10, 20, 30, 40, 15]
/create returns
returns['return_id'] = [1, 2, 3]
returns['product_id'] = [102, 104, 106]
returns['quantity'] = [2, 5, 1]Hint: Use INTERSECT to find shared product_ids.
Expected Output: A list of product_ids present in both tables.
Task: Pivot the product_sales2 table so months become columns and rows show product_id with total sales per month.
Setup Code:
/create product_sales2
product_sales2['month'] = ['2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06']
product_sales2['product_id'] = [101, 102, 103, 104, 102, 103]
product_sales2['quantity'] = [10, 20, 30, 40, 15, 20]
product_sales2['price'] = [1, 2, 3, 4, 2, 3]Hint: Use CASE expressions with SUM() and GROUP BY, or a PIVOT clause if supported.
Expected Output: A pivoted table with product_id and monthly sales columns.
Task: Unpivot a survey table with columns response_id, q1, q2, and q3 into a tall format.
Setup Code: (Create survey table with sample answers for q1, q2, and q3.)
/create survey
survey['response_id'] = [101, 102, 103, 104]
survey['q1'] = ['High', 'Medium', 'Low', 'High'] # satisfaction level
survey['q2'] = ['Yes', 'No', 'Yes', 'No'] # would recommend
survey['q3'] = ['Daily', 'Weekly', 'Monthly', 'Daily'] # usage frequencyHint: Use UNION ALL or the UNPIVOT operator.
Expected Output: Columns response_id, question, and answer.
Task: Extract nested values from a JSON column in a json_data table, e.g., get user.id and order.total.
Setup Code: (Create json_data table with a data JSON column.)
/create json_data
json_data['id'] = [1, 2, 3, 4]
json_data['data'] = [
'{"user": {"id": 101, "name": "John"}, "order": {"total": 100}}',
'{"user": {"id": 102, "name": "Jane"}, "order": {"total": 200}}',
'{"user": {"id": 103, "name": "Bob"}, "order": {"total": 300}}',
'{"user": {"id": 104, "name": "Alice"}, "order": {"total": 400}}'
]Hint: Use JSON_EXTRACT(), ->>, or equivalent functions.
Expected Output: A table with user_id and order_total.
Task: Find articles containing the keyword "database" in an articles table.
Setup Code:
/create articles
articles['article_id'] = [1, 2, 3, 4, 5]
articles['content'] = ['database management', 'design', 'database optimization', 'warehouse performance', 'internal database security']Hint: Use LIKE '%database%' or full-text search functions if available.
Expected Output: Rows of article_id and a snippet of content.
Task: Calculate a 3-point moving average in a measurements table, excluding the current row.
Setup Code:
/create measurements
measurements['reading_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
measurements['value'] = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]Hint: Define window frame with ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING.
Expected Output: Columns reading_id, value, and moving_avg.
Task: Identify employees in the top 10% salary bracket from the employees2 table in Puzzle 33.
Setup Code: (Use employees2 table from Puzzle 33.)
Hint: Use CUME_DIST() or PERCENT_RANK().
Expected Output: A list of employee_id and salary for the top 10%.
Task: Count the number of orders in each status category from the order_status table.
Setup Code:
/create order_status
order_status['order_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
order_status['status'] = ['Shipped', 'Pending', 'Shipped', 'Delivered', 'Pending', 'Shipped', 'Delivered', 'Shipped', 'Pending', 'Delivered']Hint: Use SUM(CASE WHEN status = '...' THEN 1 ELSE 0 END) grouped by status.
Expected Output: Counts for each status value.
Task: For each customer, show their most recent order using customers and purchases tables from Puzzle 17.
Setup Code: (Use customers and purchases tables.)
Add an order_date column to purchases table
purchases['order_date'] = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08']Hint: Use a lateral join or subquery with MAX(order_date).
Expected Output: Columns customer_id, customer_name, and latest order_id.
Task: Given a locations table with lat and lon, find points within 10 km of a reference point.
Setup Code: (Create locations table with sample latitude/longitude.)
/create locations
locations['id'] = [1, 2, 3, 4, 5]
locations['lat'] = [40.7128, 34.0522, 37.7749, 47.6062, 33.7489]
locations['lon'] = [-74.006, -118.2437, -122.4194, -122.3321, -118.2437]Hint: Use the Haversine formula or spatial functions if supported.
Expected Output: Rows of locations within the radius.
Task: Find customers who have not logged in for over 30 days using a last_login column in customers.
Setup Code: (Use customers table with last_login date.)
# Add a last_login column to customers table
customers['last_login'] = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2025-04-29']Hint: Use date functions like julianday(CURRENT_DATE) - julianday(last_login) > 30.
Expected Output: List of customer_id and last_login older than 30 days.
Task: Mask the middle digits of a Social Security number in a users_with_PII table, showing only the first and last two digits.
Setup Code: (Create users_with_PII table with user_id and ssn.)
/create users_with_PII
users_with_PII['user_id'] = range(1, 11)
# Generate some plausible SSN-like strings (not real SSNs)
import random
ssns = [f'{random.randint(100, 999):03}-{random.randint(10, 99):02}-{random.randint(1000, 9999):04}' for _ in range(10)]
users_with_PII['ssn'] = ssnsHint: Use SUBSTR(), CONCAT() or equivalent string functions.
Expected Output: Columns user_id and masked_ssn.
Task: Assign a dense rank to players in a scores table based on their score, showing ties without gaps.
Setup Code: (Create scores table with player_id and score.)
/create scores
scores['player_id'] = range(1, 11)
scores['score'] = [850, 920, 780, 920, 850, 600, 780, 990, 700, 920] # Include tiesHint: Use DENSE_RANK() OVER (ORDER BY score DESC).
Expected Output: Columns player_id, score, and dense_rank.
Task: Calculate net sales by subtracting returns from total sales using CTEs on product_sales and returns tables.
Setup Code: (Use product_sales table from Puzzle 38 and returns table from Puzzle 37.)
Hint: Use multiple WITH clauses to aggregate sales and returns before joining.
Expected Output: Columns product_id and net_sales.
Task: For a given table, return the count of unique values for each column.
Setup Code: (Create sales table from Puzzle 12.)
Hint: Use COUNT(DISTINCT column_name) for each column.
Expected Output: A table with columns column_name and unique_count.
Puzzle 52: Given a table with columns a, b, and c, find the number of non-null values in each column
Task: For a given table, return the count of non-null values for each column.
Setup Code: (Load sensor_readings table from above.)
Hint: Use COUNT(column_name) for each column.
Expected Output: A table with columns column_name and non_null_count.
Task: Given two tables C and D, count the number of rows whose id occurs only in C and not D, and vice versa (D and not C).
Setup Code: (Create tables C and D.)
# Run these lines one by one in the CLI
/create C
C['id_col'] = [1, 2, 3, 4, 5, 9, 9, 8] # 5 rows w/ id only in C
C['value'] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
/create D
D['id_col'] = [4, 5, 6, 7, 8, 11, 11, 12, 13, 14, 15] # 8 rows w/ id only in D
D['value'] = ['I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S']Hint: Use EXCEPT or LEFT JOIN to find unique rows in each table.
Expected Output: A table with columns id_col and value showing the unique rows in each table.
Schema C: id_col (int), value (string)
Schema D: id_col (int), value (string)
Task: Given a table with columns id, education, and income, impute the missing values of income using the average income of people with the same education level.
Setup Code: (Create imputation_table with columns id, education, and income.)
# Run these lines one by one in the CLI
/create imputation_table
imputation_table['id'] = [1, 2, 3, 4, 5]
imputation_table['education'] = [12, 16, 12, 13, 17]
imputation_table['income'] = [0.45, 0.65, None, 0.85, None]Hint: Use AVG() with GROUP BY to calculate the average income for each education level.
Expected Output: A table with columns id, education, and income with imputed values.