Cloud-Dataflow-Batch-Processing/README.md at master · tchken/Cloud-Dataflow-Batch-Processing · GitHub

59 lines (31 loc) · 1.48 KB

ETL Job - Cloud-Dataflow-Batch-Processing

store dataset (.csv) in a Google Cloud Storage bucket.
create a Dataflow batch job that read and process the csv file.
in the Dataflow job, apply a "Group By" transform to get the count of listings by the "neighbourhood" field.
store both the original csv data and the transformed data into their own separate BigQuery tables.

gcloud command - connecting cloud shell, project setting, and check

gcloud auth list
gcloud config list project
export PROJECT=""
gcloud config set project $PROJECT
gsutil mb - c regional -l us-east4 gs://$PROJECT
gsutil cp ./ gs://$PROJECT/
bq mk
export GOOGLE_APPLICATION_CREDENTIALS="/filename.json"
bq show -j --project_id=<project_id dataflow_job>

Additional setup and install apache-beam

python2.7 -m virtual env
source env/bin/activate
deactivate (after job)
pip freeze -r requirements.txt (enviroment setup)
pip install apache-beam

Execute pipeline (Local DirectRunner)

python local_directrunner_pipeline.py

Execute pipeline (DataFlow)

python dataflow_pipeline.py
--project=$PROJECT
--runner=DataflowRunner
--staging_location=gs://$PROJECT/temp
--temp_location gs://$PROJECT/temp
--input gs://$PROJECT/datafilename.csv --save_main_session