-
Notifications
You must be signed in to change notification settings - Fork 6
Beam and Dataflow Setup Guide
Follow this guide to set up your Apache Beam and Dataflow environments for this class.
-
Enable the Dataflow API:
GCP Console -> Navigation Menu -> APIs & Services -> Library -> enter Dataflow in the search bar -> click Enable
-
Create a Cloud Storage bucket:
GCP Console -> Navigation Menu -> Storage -> Browser -> Create Bucket
Bucket name: [group name]-[some unique suffix]
Location type: Region
Location: us-central1 (Iowa)
Storage Class: Standard
Create 3 folders in your bucket with these names:staging,temp,outputNote: bucket names are unique on GCP, that's why you may need to add a unique suffix to your group name. The 3 folders will be used by the WordCount example to store the outputs from the job.
-
Start up your Jupyter notebook instance and go to Jupyter Lab.
-
Bring up a terminal window by going to File -> New -> Terminal.
-
Create a virtual environment and install apache beam by entering these commands in the terminal:
$ pip install --user virtualenv
$ export PATH=$PATH:/home/jupyter/.local/bin
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install apache-beam[gcp]
$ pip install apache-beam[interactive]
- Install ipykernel for beam by entering these commands in the terminal:
$ python -m pip install ipykernel
$ python -m ipykernel install --user --name beam_kernel --display-name "Python Beam"
$ jupyter kernelspec list
Expected output:
Available kernels:
beam_kernel /home/jupyter/.local/share/jupyter/kernels/beam_kernel
python3 /home/jupyter/venv/share/jupyter/kernels/python3
Note: To run a beam pipeline from a python notebook, choose the Python Beam kernel from the Kernel menu.
- Test your Apache Beam setup by running the Wordcount example using the Direct Runner:
python -m apache_beam.examples.wordcount --output wordcount.out
If you see any errors in stdout, stop and debug. Open wordcount.out-00000-of-00001 and examine the output
- Test your Dataflow setup by running the Wordcount example using the Dataflow Runner: Replace $PROJECT_ID and $BUCKET with your project id and bucket name.
export PROJECT_ID=cs327e-sp2021
export BUCKET=beam_cs327e_data
export REGION=us-central1
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--region $REGION \
--runner DataflowRunner \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--output gs://$BUCKET/output
Go to the Dataflow console, find the running job, and examine the job details. Open the GCS console, go to your bucket, open the 3 folders and view the contents of the files.
Note: the output files will actually be placed in the root of your BUCKET instead of the output folder.
If the wordcount job completed without errors, your Dataflow setup is complete.