Skip to content

Beam and Dataflow Setup Guide

Shirley Cohen edited this page May 1, 2021 · 4 revisions

Follow this guide to set up your Apache Beam and Dataflow environments for this class.

  1. Enable the Dataflow API:

    GCP Console -> Navigation Menu -> APIs & Services -> Library -> enter Dataflow in the search bar -> click Enable

  2. Create a Cloud Storage bucket:

    GCP Console -> Navigation Menu -> Storage -> Browser -> Create Bucket

    Bucket name: [group name]-[some unique suffix]
    Location type: Region
    Location: us-central1 (Iowa)
    Storage Class: Standard
    Create 3 folders in your bucket with these names: staging, temp, output

    Note: bucket names are unique on GCP, that's why you may need to add a unique suffix to your group name. The 3 folders will be used by the WordCount example to store the outputs from the job.

  3. Start up your Jupyter notebook instance and go to Jupyter Lab.

  4. Bring up a terminal window by going to File -> New -> Terminal.

  5. Create a virtual environment and install apache beam by entering these commands in the terminal:

$ pip install --user virtualenv
$ export PATH=$PATH:/home/jupyter/.local/bin
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install apache-beam[gcp]
$ pip install apache-beam[interactive]
  1. Install ipykernel for beam by entering these commands in the terminal:
$ python -m pip install ipykernel
$ python -m ipykernel install --user --name beam_kernel --display-name "Python Beam"
$ jupyter kernelspec list

Expected output:

Available kernels:
  beam_kernel    /home/jupyter/.local/share/jupyter/kernels/beam_kernel
  python3        /home/jupyter/venv/share/jupyter/kernels/python3

Note: To run a beam pipeline from a python notebook, choose the Python Beam kernel from the Kernel menu.

  1. Test your Apache Beam setup by running the Wordcount example using the Direct Runner:
python -m apache_beam.examples.wordcount --output wordcount.out

If you see any errors in stdout, stop and debug. Open wordcount.out-00000-of-00001 and examine the output

  1. Test your Dataflow setup by running the Wordcount example using the Dataflow Runner: Replace $PROJECT_ID and $BUCKET with your project id and bucket name.
export PROJECT_ID=cs327e-sp2021
export BUCKET=beam_cs327e_data
export REGION=us-central1
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--region $REGION \
--runner DataflowRunner \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--output gs://$BUCKET/output

Go to the Dataflow console, find the running job, and examine the job details. Open the GCS console, go to your bucket, open the 3 folders and view the contents of the files. If the wordcount job completed without errors, your Dataflow setup is complete.

Clone this wiki locally