Skip to content

Beam and Dataflow Setup Guide

Shirley Cohen edited this page May 1, 2021 · 4 revisions

Follow this guide to set up your Apache Beam and Dataflow environments for this class.

  1. Enable the Dataflow API:

    GCP Console -> Navigation Menu -> APIs & Services -> Library -> enter Dataflow in the search bar -> click Enable

  2. Create a Cloud Storage bucket:

    GCP Console -> Navigation Menu -> Storage -> Browser -> Create Bucket

    Bucket name: [group name]-[some unique suffix]
    Location type: Region
    Location: us-central1 (Iowa)
    Storage Class: Standard
    Create 3 folders in your bucket with these names: staging, temp, output

    Note: bucket names are unique on GCP, that's why you may need to add a unique suffix to your group name. The 3 folders will be used by the WordCount example to store the outputs from the job.

  3. Start up your Jupyter notebook instance and go to Jupyter Lab.

  4. Bring up a terminal window by going to File -> New -> Terminal.

  5. Create a virtual environment and install apache beam by entering these commands in the terminal:

$ pip install --user virtualenv
$ export PATH=$PATH:/home/jupyter/.local/bin
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install apache-beam[gcp]
$ pip install apache-beam[interactive]
  1. Install ipykernel for beam by entering these commands in the terminal:
$ python -m pip install ipykernel
$ python -m ipykernel install --user --name beam_kernel --display-name "Python Beam"
$ jupyter kernelspec list

Expected output:

Available kernels:
  beam_kernel    /home/jupyter/.local/share/jupyter/kernels/beam_kernel
  python3        /home/jupyter/venv/share/jupyter/kernels/python3

Note: To run a beam pipeline from a python notebook, choose the Python Beam kernel from the Kernel menu.

  1. Test your Apache Beam setup by running the Wordcount example using the Direct Runner:
python -m apache_beam.examples.wordcount --output wordcount.out

If you see any errors in stdout, stop and debug. Open wordcount.out-00000-of-00001 and examine the output

  1. Test your Dataflow setup by running the Wordcount example using the Dataflow Runner: Replace $PROJECT_ID and $BUCKET with your project id and bucket name.
export PROJECT_ID=cs327e-sp2021
export BUCKET=beam_cs327e_data
export REGION=us-central1
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--region $REGION \
--runner DataflowRunner \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--output gs://$BUCKET/output

Go to the Dataflow console, find the running job, and examine the job details. Open the GCS console, go to your bucket, open the 3 folders and view the contents of the files.

Note: the output files will actually be placed in the root of your BUCKET instead of the output folder.

If the wordcount job completed without errors, your Dataflow setup is complete.

Clone this wiki locally