Yellow Taxi Trip Records for January 2024 - June 2024 from the NYC TLC Trip Record Data.
Used the provided load_yellow_taxi_data.py script to download 6 parquet files and upload them to GCS bucket dezoomcamp-hw3-2026.
python3 load_yellow_taxi_data.pyCreate External Table:
CREATE OR REPLACE EXTERNAL TABLE `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_external`
OPTIONS (
format = "PARQUET",
uris = ["gs://dezoomcamp-hw3-2026/yellow_tripdata_2024-*.parquet"]
);Create Materialized Table (no partitioning or clustering):
CREATE OR REPLACE TABLE `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`
AS
SELECT * FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_external`;What is the count of records for the 2024 Yellow Taxi Data?
SELECT COUNT(*) as total_records
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`;Result: 20,332,093
Answer: 20,332,093
What is the estimated amount of data read when counting distinct PULocationIDs on the External Table vs the Materialized Table?
-- External Table
SELECT COUNT(DISTINCT PULocationID)
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_external`;
-- Estimated: 0 MB
-- Materialized Table
SELECT COUNT(DISTINCT PULocationID)
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`;
-- Estimated: ~155.12 MB (162,656,744 bytes)BigQuery cannot estimate scan size for external tables (shows 0 MB), but for native tables it knows the column size precisely.
Answer: 0 MB for the External Table and 155.12 MB for the Materialized Table
Why are the estimated bytes different when querying 1 column vs 2 columns from the materialized table?
-- Query 1 column
SELECT PULocationID
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`;
-- Estimated: 162,656,744 bytes (~155 MB)
-- Query 2 columns
SELECT PULocationID, DOLocationID
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`;
-- Estimated: 325,313,488 bytes (~310 MB)The 2-column query estimates exactly double the bytes of the 1-column query because each column is stored independently at ~155 MB.
Answer: BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.
How many records have a fare_amount of 0?
SELECT COUNT(*) as zero_fare_trips
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`
WHERE fare_amount = 0;Result: 8,333
Answer: 8,333
What is the best strategy to make an optimized table if your query will always filter on tpep_dropoff_datetime and order by VendorID?
CREATE OR REPLACE TABLE `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_partitioned_clustered`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID
AS
SELECT * FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`;- Partition on the filter column (datetime) to prune entire data blocks
- Cluster on the sort/order column for efficient ordering within partitions
Answer: Partition by tpep_dropoff_datetime and Cluster on VendorID
Compare estimated bytes for distinct VendorIDs between 2024-03-01 and 2024-03-15 on both tables.
-- Non-partitioned table
SELECT DISTINCT VendorID
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`
WHERE tpep_dropoff_datetime BETWEEN "2024-03-01" AND "2024-03-15";
-- Estimated: 325,313,488 bytes (~310.24 MB)
-- Partitioned table
SELECT DISTINCT VendorID
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_partitioned_clustered`
WHERE tpep_dropoff_datetime BETWEEN "2024-03-01" AND "2024-03-15";
-- Estimated: 28,141,776 bytes (~26.84 MB)Partitioning reduced the scan by ~91% because BigQuery only reads the partitions covering March 1-15.
Answer: 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table
Where is the data stored in the External Table?
External tables don't store data in BigQuery — they reference files stored in Google Cloud Storage.
Answer: GCP Bucket
Is it best practice in BigQuery to always cluster your data?
Clustering adds overhead and isn't beneficial for small tables or when queries don't filter/sort on the clustered columns. It's best used for large tables with frequent filtering on specific columns.
Answer: False
How many bytes does a SELECT COUNT(*) query estimate on the materialized table? Why?
SELECT COUNT(*)
FROM `aiagentsintensive.dezoomcamp_hw3.yellow_taxi_materialized`;
-- Estimated: 0 bytesAnswer: 0 bytes. BigQuery maintains internal metadata for each table, including the total row count. Since SELECT COUNT(*) with no WHERE clause doesn't need to read any actual column data, BigQuery resolves it directly from its cached metadata, resulting in 0 bytes processed.