Skip to content

Latest commit

 

History

History
907 lines (569 loc) · 34.6 KB

File metadata and controls

907 lines (569 loc) · 34.6 KB

SQL Puzzles for sql-cli-playground

Here are some SQL puzzles to practice your skills within the sql-cli-playground environment. For each puzzle, first run the provided Python code snippets in the CLI to set up the necessary tables. Then, write and execute the SQL query to solve the puzzle.

Give me some high-level approaches for how to solve this SQL puzzle (just the barest hint - DO NOT solve it for me).


Puzzle 1: Simple Selection (Easy)

Task: Select the name and species of all pets from the pets table.

Setup Code (Run in CLI):

# Run these lines one by one in the CLI
/create pets2
pets2['id'] = [1, 2, 3, 4, 5]
pets2['name'] = ['Buddy', 'Lucy', 'Max', 'Daisy', 'Charlie']
pets2['species'] = ['Dog', 'Cat', 'Dog', 'Dog', 'Cat']
pets2['age'] = [3, 5, 2, 7, 1]

Expected Output: A table showing only the name and species columns for all 5 pets.


Puzzle 2: Filtering Data (Easy)

Task: Find the names of all dogs older than 3 years from the pets table.

Setup Code: (Use the pets table created in Puzzle 1)

Expected Output: A table showing the names 'Daisy'.


Puzzle 3: Aggregation and Grouping (Medium)

Task: Calculate the average age for each pet species in the pets table.

Setup Code: (Use the pets table created in Puzzle 1)

Expected Output: A table showing each species and its corresponding average age (e.g., Dog: 4.0, Cat: 3.0).


Puzzle 4: Simple Join (Medium)

Task: List the names of pets and the names of their owners.

Setup Code (Run in CLI):

# Assumes 'pets' table exists from Puzzle 1
/create owners
owners['owner_id'] = [101, 102, 103, 104, 105]
owners['name'] = ['Alice', 'Bob', 'Carol', 'David', 'Eve']
owners['pet_id'] = [1, 2, 3, 4, 6] # Note: Pet ID 5 has no owner, Pet ID 6 doesn't exist in pets

Expected Output: A table showing pet names alongside their owner's name for pets 1 through 4. Buddy/Alice, Lucy/Bob, Max/Carol, Daisy/David.


Puzzle 5: Finding Max Value in Groups (Medium/Hard)

Task: Find the name and salary of the employee with the highest salary in each department.

Setup Code (Run in CLI):

/create employees
employees['emp_id'] = [1, 2, 3, 4, 5, 6, 7]
employees['name'] = ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace']
employees['department'] = ['HR', 'Engineering', 'Engineering', 'HR', 'Sales', 'Sales', 'Engineering']
employees['salary'] = [60000, 80000, 85000, 65000, 70000, 75000, 90000]

Hint: You might need GROUP BY and an aggregate function like MAX(), potentially in a subquery or using window functions if your SQLite version supports them well (ROW_NUMBER()).

Expected Output: A table showing the highest earner(s) from each department:

  • HR: David, 65000
  • Engineering: Grace, 90000
  • Sales: Frank, 75000

Puzzle 6: Self Join or Subquery (Hard)

Task: Find the names of employees who earn more than their managers. Assume direct managers are listed in a separate table.

Setup Code (Run in CLI):

# Assumes 'employees' table exists from Puzzle 5
/create managers
managers['emp_id'] = [1, 2, 3, 5, 6, 7] # Alice, Bob, Charlie, Eve, Frank, Grace
managers['manager_id'] = [4, 7, 7, -1, 5, 2] # David, Grace, Grace, NULL (Top), Eve, Bob
# Map: Alice->David, Bob->Grace, Charlie->Grace, Eve->(None), Frank->Eve, Grace->Bob

Hint: You'll need to join the employees table with itself (self-join) or use a subquery to compare an employee's salary with their manager's salary obtained via the managers table. Handle the case where an employee might not have a manager (manager_id = -1 or similar).

Expected Output: A table showing employees who earn more than their manager. Based on the data:

  • Grace (90000) earns more than Bob (80000).
  • Bob (80000) does not earn more than Grace (90000).
  • Frank (75000) earns more than Eve (70000).

Puzzle 7: Outer Join (Medium)

Task: List all employees and their corresponding manager's name. Include employees who do not have a manager listed in the managers table.

Setup Code: (Use employees and managers tables from Puzzle 5 & 6)

Hint: How do you include rows from one table even when there's no matching row in the table you are joining it with?

Expected Output: A table showing each employee's name and their manager's name. Employees without managers (like Eve) should still appear, perhaps with a NULL value for the manager's name.


Puzzle 8: Conditional Aggregation (Medium)

Task: Count the number of employees in the 'Engineering' department versus the total number of employees in all other departments combined.

Setup Code: (Use employees table from Puzzle 5)

Hint: Think about how you can use functions like COUNT or SUM along with a condition (like an IF or CASE statement) inside the aggregation.

Expected Output: A table showing two counts, one for 'Engineering' employees and one for 'Other Departments'. (e.g., Engineering: 3, Other: 4)


Puzzle 9: Working with Dates (Medium)

Task: Find employees hired in the year 2023.

Setup Code (Run in CLI):

# Assumes 'employees' table exists from Puzzle 5
employees['hire_date'] = ['2022-05-10', '2023-01-15', '2023-08-20', '2022-11-01', '2024-02-28', '2023-06-10', '2022-03-15']

Hint: Most SQL dialects provide functions to extract parts of a date (like the year).

Expected Output: A table listing the names of employees hired in 2023 (Bob, Charlie, Frank).


Puzzle 10: String Matching (Easy)

Task: Find all employees whose names start with the letter 'A'.

Setup Code: (Use employees table from Puzzle 5)

Hint: Look for SQL operators specifically designed for pattern matching in strings.

Expected Output: A table listing the employee(s) whose name starts with 'A' (Alice).


Puzzle 11: Multiple Joins (Medium)

Task: List the names of projects assigned to employees working in the 'Engineering' department.

Setup Code (Run in CLI):

# Assumes 'employees' table exists from Puzzle 5
/create projects
projects['proj_id'] = ['P1', 'P2', 'P3', 'P4']
projects['proj_name'] = ['Website Redesign', 'Database Migration', 'Mobile App', 'Security Audit']

/create assignments
assignments['assign_id'] = [101, 102, 103, 104, 105, 106]
assignments['emp_id'] = [2, 3, 7, 1, 3, 5] # Bob, Charlie, Grace, Alice, Charlie, Eve
assignments['proj_id'] = ['P1', 'P1', 'P2', 'P3', 'P4', 'P3']

Hint: You'll need to link three tables together: employees to assignments, and assignments to projects.

Expected Output: A table listing the project names associated with 'Engineering' employees (Bob, Charlie, Grace). Expected projects: Website Redesign, Database Migration, Security Audit.


Puzzle 12: Finding Duplicates (Medium)

Task: Find all salary values that are paid to more than one employee.

Setup Code: (Use employees table from Puzzle 5)

Hint: How can you group rows based on a specific value and then count how many rows fall into each group? How do you filter these groups?

Expected Output: A table listing salary values that appear more than once. (In the sample data, no salary is duplicated, but the query should work if there were duplicates). If Alice also made 80000, the output should show 80000.


Puzzle 13: Nth Highest Value (Hard)

Task: Find the name and salary of the employee with the 3rd highest salary. (Assume no ties for simplicity, or define how to handle ties).

Setup Code: (Use employees table from Puzzle 5)

Hint: Can you order the employees by salary and then somehow select only the row at a specific position? Think about subqueries or LIMIT/OFFSET.

Expected Output: A table showing the name and salary of the 3rd highest paid employee (Bob, 80000).


Puzzle 14: Pivot Data (Hard)

Task: Display the total salary budget for each department, but with departments as columns rather than rows.

Setup Code: (Use employees table from Puzzle 5)

Hint: Similar to conditional aggregation, use SUM with conditions (e.g., CASE) to sum salaries only for specific departments, creating separate columns for each.

Expected Output: A single row table with columns like HR_Total, Engineering_Total, Sales_Total, showing the sum of salaries for each. (HR: 125000, Engineering: 255000, Sales: 145000)


Puzzle 15: Set Operations (Easy)

Task: List the names of all employees who are in the 'HR' department or the 'Sales' department.

Setup Code: (Use employees table from Puzzle 5)

Hint: How do you combine the results of two separate SELECT statements into a single list?

Expected Output: A table listing the names of employees in HR or Sales (Alice, David, Eve, Frank).


Puzzle 16: Correlated Subquery (Hard)

Task: For each employee, find the average salary of their specific department.

Setup Code: (Use employees table from Puzzle 5)

Hint: Can a subquery refer to values from the outer query that contains it? Use this to calculate the average salary only for the current employee's department within the subquery.

Expected Output: A table showing each employee's name, department, salary, and a column with the average salary for their department (e.g., Alice, HR, 60000, 62500; Bob, Engineering, 80000, 85000; etc.).


Puzzle 17: Feature Engineering - Binning (Medium)

Task: Categorize customers into 'Low', 'Medium', or 'High' spenders based on their total purchase amount. Thresholds: Low < 100, Medium < 500, High >= 500.

Setup Code (Run in CLI):

/create customers
customers['customer_id'] = [1, 2, 3, 4, 5, 6]
customers['name'] = ['Liam', 'Olivia', 'Noah', 'Emma', 'Oliver', 'Ava']

/create purchases
purchases['purchase_id'] = [101, 102, 103, 104, 105, 106, 107, 108]
purchases['customer_id'] = [1, 2, 1, 3, 4, 2, 5, 6]
purchases['amount'] = [50.0, 120.0, 30.0, 450.0, 600.0, 80.0, 25.0, 700.0]

Hint: You'll need to aggregate purchase amounts per customer first. Then, use a CASE statement to apply the category labels based on the total amount.

Expected Output: A table showing each customer's ID, name, total purchase amount, and their spending category ('Low', 'Medium', or 'High').


Puzzle 18: Feature Engineering - Lag Features (Hard)

Task: For each product on each date, find the sales amount from the previous day. Assume dates are consecutive where data exists.

Setup Code (Run in CLI):

/create daily_sales
daily_sales['sale_date'] = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03']
daily_sales['product_id'] = [10, 10, 10, 20, 20, 20]
daily_sales['sales_amount'] = [200, 210, 215, 70, 75, 80]
# Convert date string to actual date type if needed by the specific SQL environment
# In sqlite-web or similar, date strings might work directly for comparisons
# If not, you might need: daily_sales['sale_date'] = pd.to_datetime(daily_sales['sale_date']) # before loading - anyway this works fine for the current sqlite3 environment

Hint: Window functions like LAG() are ideal for this. You need to partition by the product and order by date to look back correctly. Handle the first day for each product where there's no previous day.

Expected Output: A table showing sale date, product ID, sales amount, and the previous day's sales amount (or NULL/0 for the first day).


Puzzle 19: Data Cleaning - Identifying Missing Values (Easy)

Task: Find all rows in the sensor_readings table where the temperature reading is missing (NULL).

Setup Code (Run in CLI):

/create sensor_readings
sensor_readings['reading_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
sensor_readings['sensor_id'] = ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'B', 'A', 'C', 'B', 'A']
sensor_readings['timestamp'] = ['2023-05-01 10:00', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:10', '2023-05-02 10:00', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-04 10:10']
sensor_readings['temperature'] = [25.5, 22.0, None, 30.1, None, 26.0, 28.0, 29.2, 27.3, None, 28.5, 27.8] # Using Python's None for NULL
sensor_readings['humidity'] = [60, 65, 62, 55, 68, 61, 63, 64, 66, 67, 69, 90]

**Hint:** SQL has a specific operator to check if a value `IS NULL`.

**Expected Output:** A table showing the full rows (all columns) for readings where the temperature is NULL (reading_ids 3 and 5).

---

## Puzzle 20: Data Sampling (Medium)

**Task:** Select approximately 50% of the user sessions randomly from the `user_sessions` table.

**Setup Code (Run in CLI):**

```python
/create user_sessions
user_sessions['session_id'] = range(1, 21) # 20 sessions
user_sessions['user_id'] = [ (i % 5) + 1 for i in range(20)] # Users 1-5
user_sessions['duration_seconds'] = [ (i * 10) + 30 for i in range(20)]

Hint: Most SQL dialects have a function to generate random numbers (e.g., RANDOM() or RAND()). You can use this in an ORDER BY clause and LIMIT the result, or use it in a WHERE clause to filter rows probabilistically. The exact method might vary slightly based on the SQL flavor (SQLite's RANDOM() is common).

Expected Output: A table containing roughly half (around 10 rows) of the original user_sessions table, selected randomly. The exact rows will vary on each execution.


Puzzle 21: A/B Testing Analysis - Basic Comparison (Medium)

Task: Calculate the conversion rate (percentage of users who made a purchase) for users in the 'Control' group versus the 'Treatment' group.

Setup Code (Run in CLI):

/create experiment_users
experiment_users['user_id'] = range(1, 11) # 10 users
experiment_users['group_name'] = ['Control'] * 5 + ['Treatment'] * 5

/create user_actions
user_actions['action_id'] = range(101, 116) # 15 actions
user_actions['user_id'] = [1, 2, 6, 3, 7, 8, 1, 4, 9, 6, 2, 10, 5, 7, 3] # Some users repeat actions
user_actions['action_type'] = ['view', 'view', 'view', 'click', 'view', 'purchase', 'click', 'view', 'purchase', 'click', 'purchase', 'view', 'view', 'purchase', 'purchase']

Hint: First, identify which users made a purchase. Then, join this information with the experiment group assignments. Finally, aggregate counts per group (total users, converted users) and calculate the rate (converted / total * 100.0). Use COUNT(DISTINCT user_id) carefully.

Expected Output: A table showing two rows, one for 'Control' and one for 'Treatment', with their respective conversion rates. (Control: User 3 & 5 purchased? 2 / 5 users. Treatment: User 7, 8, 9 purchased? 3 / 5 users).


Puzzle 22: Time Series Analysis - Simple Moving Average (Hard)

Task: Calculate the 3-day moving average of website visits for the entire site.

Setup Code (Run in CLI):

/create website_traffic
website_traffic['log_date'] = ['2023-03-01', '2023-03-02', '2023-03-03', '2023-03-04', '2023-03-05', '2023-03-06', '2023-03-07']
website_traffic['visits'] = [1000, 1100, 1050, 1200, 1150, 1300, 1250]
# Ensure log_date is treated as a date for ordering

Hint: This is a classic use case for window functions. You need AVG(visits) over a window defined by ORDER BY log_date that includes the current row and the 2 preceding rows (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW).

Expected Output: A table showing each date and the corresponding 3-day moving average of visits. The first two days will have averages based on fewer than 3 days.


Puzzle 23: User Segmentation - RFM Prep (Medium/Hard)

Task: For each customer, calculate their total number of purchases (Frequency) and the total amount spent (Monetary Value). (Recency is harder without transaction dates, so we skip it here).

Setup Code: (Use customers and purchases tables from Puzzle 11)

Hint: You need to GROUP BY customer and use two different aggregate functions: COUNT() for frequency and SUM() for monetary value. Join with the customers table to get names.

Expected Output: A table showing customer_id, name, Frequency (count of purchases), and Monetary (sum of purchase amounts) for each customer.


Puzzle 24: Anomaly Detection - Simple Thresholding (Medium)

Task: Identify users whose average session duration is more than 1.5 times the overall average session duration across all users.

Setup Code: (Use user_sessions table from Puzzle 14) Adjust the avg duration for user 1 to be only 10 and user 5 to be 250:

user_sessions.loc[user_sessions.user_id == 1,'duration_seconds'] = 10
user_sessions.loc[user_sessions.user_id == 5,'duration_seconds'] = 250

Hint: First, calculate the overall average duration. Then, calculate the average duration per user. Finally, select users whose average duration exceeds the overall average multiplied by 1.5. Subqueries or CTEs (Common Table Expressions) are helpful here.

Expected Output: A table listing the user_id and their average session duration for users identified as having anomalously long average sessions.


Puzzle 25: Data Transformation - Finding Min/Max for Scaling (Easy)

Task: Find the minimum and maximum values for the feature1 column in the ml_data table. This is often needed for Min-Max scaling: (value - min) / (max - min).

Setup Code (Run in CLI):

/create ml_data
ml_data['id'] = range(1, 8)
ml_data['feature1'] = [10.5, 12.1, 8.3, 15.0, 9.9, 11.2, 14.5]
ml_data['feature2'] = [100, 110, 95, 120, 105, 108, 115]
ml_data['label'] = [0, 1, 0, 1, 0, 1, 1]

Hint: Use the MIN() and MAX() aggregate functions. You don't need a GROUP BY if you want the overall min/max across the whole table.

Expected Output: A single row table showing the minimum and maximum values found in feature1. (Min: 8.3, Max: 15.0).

Follow-up: If you want to scale the feature1 column, you can use the formula: (value - min) / (max - min).


Puzzle 26: Model Evaluation Prep - Confusion Matrix Counts (Hard)

Task: Given a table of actual labels and predicted labels, count the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). Assume '1' is the positive class and '0' is the negative class.

Setup Code (Run in CLI):

/create predictions
predictions['id'] = range(1, 11) # 10 predictions
predictions['actual_label'] = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0]
predictions['predicted_label'] = [1, 1, 0, 1, 0, 1, 1, 0, 0, 0]

Hint: This requires conditional aggregation. Use SUM() with CASE statements for each of the four conditions:

  • TP: actual = 1 AND predicted = 1
  • FP: actual = 0 AND predicted = 1
  • TN: actual = 0 AND predicted = 0
  • FN: actual = 1 AND predicted = 0

Expected Output: A single row table with four columns: TP, FP, TN, FN, showing the counts for each category based on the provided data.


Puzzle 27: Handling Missing Values - Imputation (Medium)

Task: Replace missing temperature values in the sensors2 table with the average temperature for the same sensor on the same day.

Setup Code:

/create sensors2
sensors2['reading_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
sensors2['sensor_id'] = ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'B', 'A', 'C', 'B', 'A']
sensors2['timestamp'] = ['2023-05-01 10:00', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:00', '2023-05-01 10:05', '2023-05-01 10:10', '2023-05-02 10:00', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-02 10:00', '2023-05-02 10:05', '2023-05-04 10:10']
sensors2['temperature'] = [25.5, 22.0, None, 30.1, None, 26.0, 28.0, 29.2, 27.3, None, 28.5, 27.8] # Using Python's None for NULL
sensors2['humidity'] = [60, 65, 62, 55, 68, 61, 63, 64, 66, 67, 69, 90]

**Hint:** You'll need to calculate the average temperature per sensor per day first. Then, use this in an `UPDATE` statement or a `SELECT` with `COALESCE()` to replace missing values.

**Expected Output:** The modified `sensors2` table with missing `temperature` values replaced.

---

## Puzzle 28: Data Quality Check - Duplicate Rows (Easy)

**Task:** Identify and count duplicate rows in the `user_sessions` table.

**Setup Code:** (Use `user_sessions` table from Puzzle 14) and add a duplicate row:
```python
user_sessions.loc[(len(user_sessions) + 1)] = [1, 1, 10]

Hint: Use COUNT(*) with GROUP BY on all columns. Then, filter for groups with more than one row.

Expected Output: A table showing the duplicate rows and their counts.


Puzzle 29: Data Normalization - Scaling (Medium)

Task: Scale the feature1 and feature2 columns in the ml_data table using Min-Max scaling.

Setup Code: (Use ml_data table from Puzzle 25)

Hint: Calculate the minimum and maximum for each feature. Then, apply the Min-Max scaling formula: (value - min) / (max - min).

Expected Output: The ml_data table with feature1 and feature2 scaled between 0 and 1.


Puzzle 30: Data Visualization Prep - Aggregating for Plotting (Medium)

Task: Prepare data for plotting the distribution of feature1 values in the ml_data table by aggregating into bins.

Setup Code: (Use ml_data table from Puzzle 25)

Hint: Decide on a bin size, then use GROUP BY with a CASE statement to assign each value to a bin. Count the number of values in each bin.

Expected Output: A table showing the bins and their respective counts, ready for plotting.


Puzzle 31: Sales Rep Ranking (Medium)

Task: Using the sales table, rank each sales representative by their total sales amount, showing the rep_id, total_sales, and assign a sales_rank.

Setup Code (Run in CLI):

/create sales
sales['sale_id'] = [1, 2, 3, 4, 5, 6]
sales['rep_id'] = [101, 102, 101, 103, 102, 101]
sales['amount'] = [500, 300, 200, 400, 600, 150]

Hint: Use SUM(amount) OVER (PARTITION BY rep_id) to compute totals and DENSE_RANK() OVER (ORDER BY total_sales DESC) to assign ranks.

Expected Output: A table with columns rep_id, total_sales, and sales_rank.


Puzzle 32: Cumulative Monthly Revenue (Medium)

Task: Calculate cumulative revenue month over month using the sales_series table. Include month, revenue, and cumulative_revenue.

Setup Code:

/create sales_series
sales_series['month'] = ['2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12']
sales_series['amount'] = [1200, 1500, 1300, 0, 1600, 1400, 1700, 1800, 1550, 1650, 1750, 1850]

Hint: Use SUM(amount) OVER (ORDER BY month ROWS UNBOUNDED PRECEDING).

Expected Output: A table sorted by month showing cumulative_revenue.


Puzzle 33: Top N Employees by Department (Medium)

Task: Find the top 2 highest paid employees in each department using the employees2 table. Return department, employee_name, and salary.

Setup Code:

/create employees2
employees2['emp_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
employees2['name'] = ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack']
employees2['department'] = ['HR', 'Engineering', 'Engineering', 'HR', 'Sales', 'Sales', 'Engineering', 'HR', 'Engineering', 'Sales']
employees2['salary'] = [60000, 80000, 85000, 65000, 70000, 75000, 90000, 30000, 45000, 25000]

Hint: Use ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) and filter on row_num <= 2.

Expected Output: The top 2 earners per department.


Puzzle 34: Employee-Manager Self-Join (Easy)

Task: List each employee along with their manager's name using the employees table from Puzzle 5.

Setup Code: (Use employees table from Puzzle 5)

Hint: Perform a self-join on employees matching manager_id to employee_id.

Expected Output: Columns employee_id, employee_name, and manager_name.


Puzzle 35: Organizational Hierarchy (Hard)

Task: Show the full management chain for each employee using the employees2 table from Puzzle 33 and the managers2 table defined below.

Setup Code: (Use employees2 table from Puzzle 33 and the managers2 table defined below)

/create managers2
managers2['emp_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
managers2['manager_id'] = [2, 7, 6, 2, 6, 7, None, 1, 3, 5]

Hint: Use a WITH RECURSIVE CTE to traverse the hierarchy.

Expected Output: A table listing employee_id, employee_name, and the full path of managers above them.


Puzzle 36: Customers Without Purchases (Easy)

Task: Identify customers who have never made a purchase from the customers and purchases tables used in Puzzle 9.

Setup Code: (Use customers and purchases tables from Puzzle 9) Add customers who haven't made purchases: customers.loc[(len(customers) + 1)] = [7, 'Bob'] customers.loc[(len(customers) + 1)] = [8, 'Mary']

Hint: Use LEFT JOIN or EXCEPT to find customers with no matching purchases.

Expected Output: A list of customer IDs or names with no purchases.


Puzzle 37: Common Products (Easy)

Task: Find products that appear in both the product_sales table and a returns table.

Setup Code:

/create product_sales
product_sales['product_id'] = [101, 102, 103, 104, 102]
product_sales['quantity'] = [10, 20, 30, 40, 15]

/create returns
returns['return_id'] = [1, 2, 3]
returns['product_id'] = [102, 104, 106]
returns['quantity'] = [2, 5, 1]

Hint: Use INTERSECT to find shared product_ids.

Expected Output: A list of product_ids present in both tables.


Puzzle 38: Monthly Sales Pivot (Medium)

Task: Pivot the product_sales2 table so months become columns and rows show product_id with total sales per month.

Setup Code:

/create product_sales2
product_sales2['month'] = ['2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06']
product_sales2['product_id'] = [101, 102, 103, 104, 102, 103]
product_sales2['quantity'] = [10, 20, 30, 40, 15, 20]
product_sales2['price'] = [1, 2, 3, 4, 2, 3]

Hint: Use CASE expressions with SUM() and GROUP BY, or a PIVOT clause if supported.

Expected Output: A pivoted table with product_id and monthly sales columns.


Puzzle 39: Survey Data Unpivot (Medium)

Task: Unpivot a survey table with columns response_id, q1, q2, and q3 into a tall format.

Setup Code: (Create survey table with sample answers for q1, q2, and q3.)

/create survey
survey['response_id'] = [101, 102, 103, 104]
survey['q1'] = ['High', 'Medium', 'Low', 'High']      # satisfaction level
survey['q2'] = ['Yes', 'No', 'Yes', 'No']            # would recommend
survey['q3'] = ['Daily', 'Weekly', 'Monthly', 'Daily']  # usage frequency

Hint: Use UNION ALL or the UNPIVOT operator.

Expected Output: Columns response_id, question, and answer.


Puzzle 40: JSON Data Extraction (Medium)

Task: Extract nested values from a JSON column in a json_data table, e.g., get user.id and order.total.

Setup Code: (Create json_data table with a data JSON column.)

/create json_data
json_data['id'] = [1, 2, 3, 4]
json_data['data'] = [
    '{"user": {"id": 101, "name": "John"}, "order": {"total": 100}}',
    '{"user": {"id": 102, "name": "Jane"}, "order": {"total": 200}}',
    '{"user": {"id": 103, "name": "Bob"}, "order": {"total": 300}}',
    '{"user": {"id": 104, "name": "Alice"}, "order": {"total": 400}}'
]

Hint: Use JSON_EXTRACT(), ->>, or equivalent functions.

Expected Output: A table with user_id and order_total.


Puzzle 41: Full-Text Search (Medium)

Task: Find articles containing the keyword "database" in an articles table.

Setup Code:

/create articles
articles['article_id'] = [1, 2, 3, 4, 5]
articles['content'] = ['database management', 'design', 'database optimization', 'warehouse performance', 'internal database security']

Hint: Use LIKE '%database%' or full-text search functions if available.

Expected Output: Rows of article_id and a snippet of content.


Puzzle 42: Moving Average Excluding Current Row (Hard)

Task: Calculate a 3-point moving average in a measurements table, excluding the current row.

Setup Code:

/create measurements
measurements['reading_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
measurements['value'] = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]

Hint: Define window frame with ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING.

Expected Output: Columns reading_id, value, and moving_avg.


Puzzle 43: Top Percentile Employees (Medium)

Task: Identify employees in the top 10% salary bracket from the employees2 table in Puzzle 33.

Setup Code: (Use employees2 table from Puzzle 33.)

Hint: Use CUME_DIST() or PERCENT_RANK().

Expected Output: A list of employee_id and salary for the top 10%.


Puzzle 44: Conditional Aggregation by Status (Easy)

Task: Count the number of orders in each status category from the order_status table.

Setup Code:

/create order_status
order_status['order_id'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
order_status['status'] = ['Shipped', 'Pending', 'Shipped', 'Delivered', 'Pending', 'Shipped', 'Delivered', 'Shipped', 'Pending', 'Delivered']

Hint: Use SUM(CASE WHEN status = '...' THEN 1 ELSE 0 END) grouped by status.

Expected Output: Counts for each status value.


Puzzle 45: Latest Order Per Customer (Hard)

Task: For each customer, show their most recent order using customers and purchases tables from Puzzle 17.

Setup Code: (Use customers and purchases tables.) Add an order_date column to purchases table

purchases['order_date'] = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08']

Hint: Use a lateral join or subquery with MAX(order_date).

Expected Output: Columns customer_id, customer_name, and latest order_id.


Puzzle 46: Geospatial Distance Filter (Hard) - maybe too difficult

Task: Given a locations table with lat and lon, find points within 10 km of a reference point.

Setup Code: (Create locations table with sample latitude/longitude.)

/create locations
locations['id'] = [1, 2, 3, 4, 5]
locations['lat'] = [40.7128, 34.0522, 37.7749, 47.6062, 33.7489]
locations['lon'] = [-74.006, -118.2437, -122.4194, -122.3321, -118.2437]

Hint: Use the Haversine formula or spatial functions if supported.

Expected Output: Rows of locations within the radius.


Puzzle 47: Inactive Customers (Easy)

Task: Find customers who have not logged in for over 30 days using a last_login column in customers.

Setup Code: (Use customers table with last_login date.)

# Add a last_login column to customers table
customers['last_login'] = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2025-04-29']

Hint: Use date functions like julianday(CURRENT_DATE) - julianday(last_login) > 30.

Expected Output: List of customer_id and last_login older than 30 days.


Puzzle 48: Data Masking (Medium)

Task: Mask the middle digits of a Social Security number in a users_with_PII table, showing only the first and last two digits.

Setup Code: (Create users_with_PII table with user_id and ssn.)

/create users_with_PII
users_with_PII['user_id'] = range(1, 11)
# Generate some plausible SSN-like strings (not real SSNs)
import random
ssns = [f'{random.randint(100, 999):03}-{random.randint(10, 99):02}-{random.randint(1000, 9999):04}' for _ in range(10)]
users_with_PII['ssn'] = ssns

Hint: Use SUBSTR(), CONCAT() or equivalent string functions.

Expected Output: Columns user_id and masked_ssn.


Puzzle 49: Dense Ranking Scores (Medium)

Task: Assign a dense rank to players in a scores table based on their score, showing ties without gaps.

Setup Code: (Create scores table with player_id and score.)

/create scores
scores['player_id'] = range(1, 11)
scores['score'] = [850, 920, 780, 920, 850, 600, 780, 990, 700, 920] # Include ties

Hint: Use DENSE_RANK() OVER (ORDER BY score DESC).

Expected Output: Columns player_id, score, and dense_rank.


Puzzle 50: Net Sales Calculation with CTEs (Hard)

Task: Calculate net sales by subtracting returns from total sales using CTEs on product_sales and returns tables.

Setup Code: (Use product_sales table from Puzzle 38 and returns table from Puzzle 37.)

Hint: Use multiple WITH clauses to aggregate sales and returns before joining.

Expected Output: Columns product_id and net_sales.


Puzzle 51: Get counts of unique values for each column (Easy)

Task: For a given table, return the count of unique values for each column.

Setup Code: (Create sales table from Puzzle 12.)

Hint: Use COUNT(DISTINCT column_name) for each column.

Expected Output: A table with columns column_name and unique_count.


Puzzle 52: Given a table with columns a, b, and c, find the number of non-null values in each column

Task: For a given table, return the count of non-null values for each column.

Setup Code: (Load sensor_readings table from above.)

Hint: Use COUNT(column_name) for each column.

Expected Output: A table with columns column_name and non_null_count.


Puzzle 53: Differences between tables (Easy)

Task: Given two tables C and D, count the number of rows whose id occurs only in C and not D, and vice versa (D and not C).

Setup Code: (Create tables C and D.)

# Run these lines one by one in the CLI
/create C
C['id_col'] = [1, 2, 3, 4, 5, 9, 9, 8] # 5 rows w/ id only in C
C['value'] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']

/create D
D['id_col'] = [4, 5, 6, 7, 8, 11, 11, 12, 13, 14, 15] # 8 rows w/ id only in D
D['value'] = ['I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S']

Hint: Use EXCEPT or LEFT JOIN to find unique rows in each table.

Expected Output: A table with columns id_col and value showing the unique rows in each table.

Schema C: id_col (int), value (string)

Schema D: id_col (int), value (string)

Puzzle 54: Imputation

Task: Given a table with columns id, education, and income, impute the missing values of income using the average income of people with the same education level.

Setup Code: (Create imputation_table with columns id, education, and income.)

# Run these lines one by one in the CLI
/create imputation_table
imputation_table['id'] = [1, 2, 3, 4, 5]
imputation_table['education'] = [12, 16, 12, 13, 17]
imputation_table['income'] = [0.45, 0.65, None, 0.85, None]

Hint: Use AVG() with GROUP BY to calculate the average income for each education level.

Expected Output: A table with columns id, education, and income with imputed values.