Skip to content

Latest commit

 

History

History
59 lines (31 loc) · 1.48 KB

File metadata and controls

59 lines (31 loc) · 1.48 KB

ETL Job - Cloud-Dataflow-Batch-Processing

  1. store dataset (.csv) in a Google Cloud Storage bucket.
  2. create a Dataflow batch job that read and process the csv file.
  3. in the Dataflow job, apply a "Group By" transform to get the count of listings by the "neighbourhood" field.
  4. store both the original csv data and the transformed data into their own separate BigQuery tables.

alt text

gcloud command - connecting cloud shell, project setting, and check

  • gcloud auth list

  • gcloud config list project

  • export PROJECT=""

  • gcloud config set project $PROJECT

  • gsutil mb - c regional -l us-east4 gs://$PROJECT

  • gsutil cp ./ gs://$PROJECT/

  • bq mk

  • export GOOGLE_APPLICATION_CREDENTIALS="/filename.json"

  • bq show -j --project_id=<project_id dataflow_job>

Additional setup and install apache-beam

  • python2.7 -m virtual env

  • source env/bin/activate

  • deactivate (after job)

  • pip freeze -r requirements.txt (enviroment setup)

  • pip install apache-beam

Execute pipeline (Local DirectRunner)

  • python local_directrunner_pipeline.py

Execute pipeline (DataFlow)

  • python dataflow_pipeline.py
    --project=$PROJECT
    --runner=DataflowRunner
    --staging_location=gs://$PROJECT/temp
    --temp_location gs://$PROJECT/temp
    --input gs://$PROJECT/datafilename.csv --save_main_session