ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.
In case you don't get one option exactly, select the closest one
For the homework, we'll be working with the green taxi dataset located here:
https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download
To get a wget-able link, use this prefix (note that the link itself gives 404):
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/
So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021.
As a hint, Kestra makes that process really easy:
- You can leverage the backfill functionality in the scheduled flow to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from
2021-01-01to2021-07-31. Also, make sure to do the same for bothyellowandgreentaxi data (select the right service in thetaxiinput). - Alternatively, run the flow manually for each of the seven months of 2021 for both
yellowandgreentaxi data. Challenge for you: find out how to loop over the combination of Year-Month andtaxi-type usingForEachtask which triggers the flow for each combination using aSubflowtask.
Complete the quiz shown below. It's a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra, and ETL pipelines.
- Within the execution for
YellowTaxi data for the year2020and month12: what is the uncompressed file size (i.e. the output fileyellow_tripdata_2020-12.csvof theextracttask)?
- 128.3 MiB
- 134.5 MiB
- 364.7 MiB
- 692.6 MiB
- What is the rendered value of the variable
filewhen the inputstaxiis set togreen,yearis set to2020, andmonthis set to04during execution?
{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csvgreen_tripdata_2020-04.csvgreen_tripdata_04_2020.csvgreen_tripdata_2020.csv
- How many rows are there for the
YellowTaxi data for all CSV files in the year 2020?
- 13,537.299
- 24,648,499
- 18,324,219
- 29,430,127
- How many rows are there for the
GreenTaxi data for all CSV files in the year 2020?
- 5,327,301
- 936,199
- 1,734,051
- 1,342,034
- How many rows are there for the
YellowTaxi data for the March 2021 CSV file?
- 1,428,092
- 706,911
- 1,925,152
- 2,561,031
- How would you configure the timezone to New York in a Schedule trigger?
- Add a
timezoneproperty set toESTin theScheduletrigger configuration - Add a
timezoneproperty set toAmerica/New_Yorkin theScheduletrigger configuration - Add a
timezoneproperty set toUTC-5in theScheduletrigger configuration - Add a
locationproperty set toNew_Yorkin theScheduletrigger configuration
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw2
- Check the link above to see the due date
Will be added after the due date
