PXF is an extensible framework that allows a distributed database like Greenplum and Apache Cloudberry to query external data files, whose metadata is not managed by the database. PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, JDBC-accessible databases and more. Users can also create their own connectors to other data storage or processing engines.
This project is derived from greenplum/pxf and customized for Apache Cloudberry.
external-table/: Contains the Cloudberry extension implementing an External Table protocol handlerfdw/: Contains the Cloudberry extension implementing a Foreign Data Wrapper (FDW) for PXFserver/: Contains the server side code of PXF along with the PXF Service and all the Pluginscli/: Contains command line interface code for PXFautomation/: Contains the automation and integration tests for PXF against the various datasourcesci/: Contains CI/CD environment and scripts (including singlecluster Hadoop environment)regression/: Contains the end-to-end (integration) tests for PXF against the various datasources, utilizing the PostgreSQL testing frameworkpg_regress
Below are the steps to build and install PXF along with its dependencies including Cloudberry and Hadoop.
git clone https://github.com/apache/cloudberry-pxf.gitTo build PXF, you must have:
-
GCC compiler,
makesystem,unzippackage,mavenfor running integration tests -
Installed Cloudberry
Either download and install Cloudberry RPM or build Cloudberry from the source by following instructions in the Cloudberry.
Assuming you have installed Cloudberry into
/usr/local/cloudberry-dbdirectory, run its environment script:source /usr/local/cloudberry-db/greenplum_path.sh # For Cloudberry 2.0 source /usr/local/cloudberry-db/cloudberry-env.sh # For Cloudberry 2.1+ -
JDK 1.8 or JDK 11 to compile/run
Export your
JAVA_HOME:export JAVA_HOME=/usr/lib/jvm/java-11-openjdk -
Go (1.9 or later)
You can download and install Go via Go downloads page.
Make sure to export your
GOPATHand add go to yourPATH. For example:export GOPATH=$HOME/go export PATH=$PATH:/usr/local/go/bin:$GOPATH/bin
Once you have installed Go, you will need the
ginkgotool which runs Go tests, respectively. Assuminggois on yourPATH, you can run:go install github.com/onsi/ginkgo/ginkgo@latest
PXF uses Makefiles to build its components. PXF server component uses Gradle that is wrapped into the Makefile for convenience.
cd cloudberry-pxf/
# Compile PXF
makeTo install PXF, first make sure that the user has sufficient permissions in the $GPHOME and $PXF_HOME directories to perform the installation. It's recommended to change ownership to match the installing user. For example, when installing PXF as user gpadmin under /usr/local/cloudberry-db:
mkdir -p /usr/local/cloudberry-pxf
export PXF_HOME=/usr/local/cloudberry-pxf
export PXF_BASE=${HOME}/pxf-base
chown -R gpadmin:gpadmin "${PXF_HOME}"
make installNOTE: if PXF_BASE is not set, it will default to PXF_HOME, and server configurations, libraries or other configurations, might get deleted after a PXF re-install.
Ensure that PXF is in your path. This command can be added to your .bashrc:
export PATH=/usr/local/cloudberry-pxf/bin:$PATHThen you can prepare and start up PXF by doing the following.
pxf prepare
pxf startIf ${HOME}/pxf-base does not exist, pxf prepare will create the directory for you. This command should only need to be run once.
Note: Local development with PXF requires a running Cloudberry cluster.
Once the desired changes have been made, there are 2 options to re-install PXF:
- Run
make -sj4 installto re-install and run tests - Run
make -sj4 install-serverto only re-install the PXF server without running unit tests.
After PXF has been re-installed, you can restart the PXF instance using:
pxf restartNote
Since the docker container will house all Single cluster Hadoop, Cloudberry and PXF, we recommend that you have at least 4 cpus and 6GB memory allocated to Docker. These settings are available under docker preferences.
We provide a Docker-based development environment that includes Cloudberry, Hadoop, and PXF. See automation/README.Docker.md for detailed instructions.
- Start IntelliJ. Click "Open" and select the directory to which you cloned the
pxfrepo. - Select
File > Project Structure. - Make sure you have a JDK (version 1.8) selected.
- In the
Project Settings > Modulessection, selectImport Module, pick thepxf/serverdirectory and import as a Gradle module. You may see an error saying that there's no JDK set for Gradle. Just cancel and retry. It goes away the second time. - Import a second module, giving the
pxf/automationdirectory, select "Import module from external model", pickMaventhen click Finish. - Restart IntelliJ
- Check that it worked by running a unit test (cannot currently run automation tests from IntelliJ) and making sure that imports, variables, and auto-completion function in the two modules.
- Optionally you can replace
${PXF_TMP_DIR}with${GPHOME}/pxf/tmpinautomation/pom.xml - Select
Tools > Create Command-line Launcher...to enable starting Intellij with theideacommand, e.g.cd ~/workspace/pxf && idea ..
- In IntelliJ, click
Edit Configurationand add a new one of typeRemote - Change the name to
PXF Service Boot - Change the port number to
2020 - Save the configuration
- Restart PXF in DEBUG Mode
PXF_DEBUG=true pxf restart - Debug the new configuration in IntelliJ
- Run a query in CloudberryDB that uses PXF to debug with IntelliJ
See the CONTRIBUTING file for how to make contributions dedicated to the PXF for Cloudberry Database.
Under Apache License V2.0, See the LICENSE for details.