|
1 | | -# data-etl |
2 | | -ETL jobs managed by data engineering |
| 1 | +# Docker ETL |
| 2 | + |
| 3 | +This repo is a collection of dockerized ETL jobs to increase discoverability |
| 4 | +of the source code of scheduled ETL. |
| 5 | +There are also tools here that automate the common steps involved with creating and |
| 6 | +scheduling an ETL job. |
| 7 | +This includes defining a Docker image, setting up CI, and language boilerplate. |
| 8 | +The primary use of this repo is to create Dockerized jobs that are pushed to GCR |
| 9 | +so they can be scheduled via the Airflow GKE pod operator. |
| 10 | + |
| 11 | +## Project Structure |
| 12 | + |
| 13 | +### Jobs |
| 14 | + |
| 15 | +Each job is located in its own directory in the `jobs/` directory, |
| 16 | +e.g. the contents of a job named `my-job` would go into `jobs/my-job` |
| 17 | + |
| 18 | +All job directories should have a `Dockerfile`, a `ci_job.yaml`, |
| 19 | +a `ci_workflow.yaml`, and a `README.md` in the root directory. |
| 20 | +`ci_job.yaml` and `ci_workflow.yaml` contain the yaml structure that will be placed |
| 21 | +in the `- jobs:` and `- workflows:` sections of the CircleCI `config.yml` respectively |
| 22 | + |
| 23 | +### Templates |
| 24 | + |
| 25 | +Templates for job creation and the CI config file are located in `templates/`. |
| 26 | + |
| 27 | +The CI config template is in `.circleci/config.template.yml`. |
| 28 | +This is the file that should be modified instead of the `circleci/config.yml`. |
| 29 | + |
| 30 | +Each job template is located in a directory in `templates/` that is the name of the template, |
| 31 | +e.g. a `python` template is in `templates/python/`. |
| 32 | +Within the directory of a template is a directory named `job/` that contains |
| 33 | +all the contents that will be copied when the template is used. |
| 34 | +Other files in the directory of a particular template are used for |
| 35 | +job creation, e.g. `ci_job.template.yaml`. |
| 36 | + |
| 37 | +### Example Directory Structure: |
| 38 | + |
| 39 | +``` |
| 40 | ++--docker-etl/ |
| 41 | +| +--jobs/ |
| 42 | +| +--example-python-1/ |
| 43 | +| +--ci_job.yaml |
| 44 | +| +--ci_workflow.yaml |
| 45 | +| +--Dockerfile |
| 46 | +| +--README.md |
| 47 | +| +--script |
| 48 | +| +--templates/ |
| 49 | +| +--python/ |
| 50 | +| +--job/ |
| 51 | +| +--module/ |
| 52 | +| +--tests/ |
| 53 | +| +--Dockerfile |
| 54 | +| +--README.md |
| 55 | +| +--requirements.txt |
| 56 | +| +--ci_job.template.yaml |
| 57 | +| +--ci_workflow.template.yaml |
| 58 | +
|
| 59 | +``` |
| 60 | + |
| 61 | +## Development |
| 62 | + |
| 63 | +The tools in this repository are intended for python 3.8+. |
| 64 | + |
| 65 | +To install dependencies: |
| 66 | + |
| 67 | +```sh |
| 68 | +pip install -r requirements.txt |
| 69 | +``` |
| 70 | + |
| 71 | +This project uses `pip-tools` to pin dependencies. New dependencies go in |
| 72 | +`requirements.in` and `pip-compile` is used to generate `requirements.txt`: |
| 73 | + |
| 74 | +```sh |
| 75 | +pip install pip-tools |
| 76 | +pip-compile --generate-hashes requirements.in |
| 77 | +``` |
| 78 | + |
| 79 | +To run tests: |
| 80 | + |
| 81 | +```sh |
| 82 | +pytest --flake8 --black tests/ |
| 83 | +``` |
| 84 | + |
| 85 | +### Adding a new job |
| 86 | + |
| 87 | +To add a new job: |
| 88 | + |
| 89 | +```sh |
| 90 | +./script/create_job --job-name example-job --template python |
| 91 | +``` |
| 92 | + |
| 93 | +`job-name` is the name of the directory that will be created in `jobs/`. |
| 94 | + |
| 95 | +`template` is an optional argument that will populate the created directory |
| 96 | +with the contents of a template. |
| 97 | +If no template is given, a directory with only the required files is created. |
| 98 | + |
| 99 | +#### Available Templates: |
| 100 | + |
| 101 | +| Template name | Description | |
| 102 | +| ------------- | ----------- | |
| 103 | +| default | Base directory with readme, Dockerfile, and CI config files | |
| 104 | +| python | Simple Python module with unit test and lint config | |
| 105 | + |
| 106 | +### Modifying the CI config |
| 107 | + |
| 108 | +This repo uses CircleCI which only allows a single global config file. |
| 109 | +In order to simplify adding and removing jobs to CI, the config file is |
| 110 | +generated using templates. |
| 111 | +This means the `config.yml` in `.circleci/` should not be modified directly. |
| 112 | + |
| 113 | +Generate `.circleci/config.yml`: |
| 114 | + |
| 115 | +```sh |
| 116 | +./script/update_ci_config |
| 117 | +``` |
| 118 | + |
| 119 | +To make changes to the config that are not ETL job specific |
| 120 | +(e.g. add a command), changes should be made to `templates/config.template.yml` |
| 121 | +and the output config should be re-generated. |
| 122 | + |
| 123 | +Each job has a `ci_job.yaml` and a `ci_workflow.yaml` which define the steps |
| 124 | +that will go into the jobs and workflow sections of the CircleCI config. |
| 125 | +Any changes to these files should be followed by updating the global config |
| 126 | +via `scripts/update_ci_config`. |
| 127 | +When a job is created, the CI files are created based on the |
| 128 | +`ci_*.template.yaml` files in the template directory. |
| 129 | + |
| 130 | +### Adding a template |
| 131 | + |
| 132 | +To add a new template, create a new directory in `templates/` with the name |
| 133 | +of the template. |
| 134 | +This directory must have a `ci_job.template.yaml`, a `ci_workflow.template.yaml`, |
| 135 | +and a `job/` directory which contains all the files that will be copied to |
| 136 | +any job that uses this template. |
| 137 | + |
0 commit comments