Local Debugging of AWS Glue Jobs

Amazon now offers a Docker image to handle local Glue debugging.

Debug AWS Glue scripts locally using PyCharm or Jupyter Notebook. 

Before You Start 

You will need the following before you can complete this task: 

  • An AWS account (not needed for just local work).
  • Python 3.6+.
  • Java - JDK installed (here 1.8.0)
  • A Linux computer. This is not tested on Mac or Windows.
  • An understanding of environment variables and editing BASH scripts.
  • Familiarity with git.
  • Familiarity with Linux terminals.
  • For Jupyter Notebook
    • Understanding of how Notebooks work
    • Installing Python packages with pip or other method
  • For IDE debugging
    • SSH and configuring keys
    • PyCharm or other python debugging platform.

Note: This involves many steps and correcting several things manually. It may not be worth the trouble, but if you have had as many issues with Glue that we have, it could save you time in the long run.

Alternative Method from AWS - NEW

Since our original publishing of this How-To, AWS has created their own documentation using Docker containers. We have not tested it, but it may be preferable if you like using containers.

Developing AWS Glue ETL jobs locally using a container


In the research programming group we have begun using AWS Glue for several projects. It is very useful for transforming data from CSV to Parquet for later use in AWS Athena, or loading large flat files into an RDBMS as part of other processes. 

The problem is that sometimes our data is not in proper format, our Glue crawler or job fails, and we stare blankly at an insufficient error log in CloudWatch. The nature of Glue means{ that it can take a while to learn about a failure and thus the iterative process is delayed by 20 minutes for each attempt. We could use a Glue endpoint, but they are prohibitively expensive and can be difficult to connect to in a VPC. 

Our ideal is to be able to step through the debugger as a job runs, and examine it. Here are the steps needed to do that.

Download Maven

Apache Maven is required to download additional Java libraries for aws-glue-libs.

Download: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz

We will be extracting this and the following archives into the $HOME/bin/ directory but you may place them anywhere as long as they are referenced correctly in the later environment variables. Create this directory if it does not yet exist.

Extract the maven archive.

$ tar xvf apache-maven-3.6.0-bin.tar.gz -C $HOME/bin/

Update PATH and JAVA_HOME*

Maven needs to be in your $PATH.

$ export PATH="$HOME/bin/apache-maven-3.6.0/bin:$PATH"

Set the environment variables for JAVA_HOME. This must point to the libs for the JDK. Here are some places it may be.

$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

*These and remaining vars can be added to your user environment variables in .bashrc/.zshenv or handled per session.

See: aws-glue-libs

Download Spark

Glue adds AWS features on top of Apache Spark and uses the Spark libraries.

Download: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz

The glue-1.0 version is compatible with Python 3 and will be the one used in these examples.

Extract the Spark archive.

$ tar xvf spark-2.4.3-bin-hadoop2.8.tgz -C $HOME/bin/

Set the environment variables for SPARK_HOME.

$ export SPARK_HOME=$HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/

See: aws-glue-libs

Checkout AWS Glue libraries

The aws-glue-libs repository contains AWS libraries for adding on top of Apache Spark. Use git to checkout. 

$ cd $HOME/bin
$ git clone https://github.com/awslabs/aws-glue-libs.git
Cloning into 'aws-glue-libs'...
remote: Enumerating objects: 151, done.
remote: Total 151 (delta 0), reused 0 (delta 0), pack-reused 151
Receiving objects: 100% (151/151), 60.60 KiB | 4.04 MiB/s, done.
Resolving deltas: 100% (91/91), done.

Change directory into the repository and checkout the glue-1.0 BRANCH.

$ cd aws-glue-libs
$ git checkout glue-1.0
Branch 'glue-1.0' set up to track remote branch 'glue-1.0' from 'origin'.
Switched to a new branch 'glue-1.0'

Run glue-setup.sh

The glue-setup.sh script needs to be run to create the PyGlue.zip library, and download the additional .jar files for AWS Glue using maven.

$ cd $HOME/bin/aws-glue-libs
$ chmod +x ./bin/glue-setup.sh
$ ./bin/glue-setup.sh
  adding: awsglue/ (stored 0%)
  adding: awsglue/dynamicframe.py (deflated 81%)
[INFO] Finished at: 2019-11-18T11:02:34-05:00
[INFO] ------------------------------------------------------------------------
rm: cannot remove '$HOME/bin/aws-glue-libs/conf/spark-defaults.conf': No such file or directory

There will be some some errors that can be ignored regarding "cannot remove 'PyGlue.zip'" and "cannot remove 'spark-defaults.conf'". These files are replaced with each run and do not exist on the first run.

Modify glue-setup.sh

Since we will be modifying the jars manually we have to prevent Maven from overwriting our work. We need to comment out the Maven update line just in case we run glue-setup.sh again.

Edit glue-setup.sh using nano or vi.

$ nano $HOME/bin/aws-glue-libs/bin/glue-setup.sh 


$ vi $HOME/bin/aws-glue-libs/bin/glue-setup.sh 


mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jars dependency:copy-dependencies


# mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jars dependency:copy-dependencies

Save and exit.

Copy .jar file from Spark

There is a bug in the aws-glue-libs project that causes Glue to fail. See: aws-glue-libs Issues #25

To correct this we need to remove netty-all-4.0.23.Final.jar and replace it with netty-all-4.1.17.Final.jar from the spark installation.

$ rm $HOME/bin/aws-glue-libs/jarsv1/netty-all-4.0.23.Final.jar
$ cp $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/netty-all-4.1.17.Final.jar $HOME/bin/aws-glue-libs/jarsv1/

Check you have the correct jars:

$ ls -1 $HOME/bin/aws-glue-libs/jarsv1 | grep -i netty

Set environment variables

The AWS provided scripts that launch Glue have limitations, but under the hood they basically run Spark after setting up particular environment variables. We need to set those manually to run Spark like Glue in our own way. 

$ export SPARK_CONF_DIR=$HOME/bin/aws-glue-libs/conf
$ export PYTHONPATH="${SPARK_HOME}python/:${SPARK_HOME}python/lib/py4j-0.10.7-src.zip:$HOME/bin/aws-glue-libs/PyGlue.zip:${PYTHONPATH}" 

Notebook Method

Create Project Directory and Virtual environment

$ mkdir glue_notebook
$ cd glue_notebook
$ python3.6 -m venv venv
$ source venv/bin/activate
(venv) $ 

Install Notebook

(venv) $ pip install notebook
Collecting notebook
  Using cached https://files.pythonhosted.org/packages/b1/f1/0a67f09ef53a342403ffa66646ee39273e0ac79ffa5de5dbe2f3e28b5bdf/notebook-6.0.3-py3-none-any.whl
Collecting tornado>=5.0 (from notebook)
  Using cached https://files.pythonhosted.org/packages/95/84/119a46d494f008969bf0c775cb2c6b3579d3c4cc1bb1b41a022aa93ee242/tornado-6.0.4.tar.gz
Collecting jupyter-client>=5.3.4 (from notebook)
  Using cached https://files.pythonhosted.org/packages/34/0b/2ebddf775f558158ca8df23b35445fb15d4b1558a9e4a03bc7e75b13476e/jupyter_client-6.1.3-py3-none-any.whl
Collecting nbconvert (from notebook)
  Using cached https://files.pythonhosted.org/packages/79/6c/05a569e9f703d18aacb89b7ad6075b404e8a4afde2c26b73ca77bb644b14/nbconvert-5.6.1-py2.py3-none-any.whl
Collecting jupyter-core>=4.6.1 (from notebook)
  Using cached https://files.pythonhosted.org/packages/63/0d/df2d17cdf389cea83e2efa9a4d32f7d527ba78667e0153a8e676e957b2f7/jupyter_core-4.6.3-py2.py3-none-any.whl
Successfully installed MarkupSafe-1.1.1 Send2Trash-1.5.0 attrs-19.3.0 backcall-0.2.0 bleach-3.1.5 decorator-4.4.2 defusedxml-0.6.0 entrypoints-0.3 importlib-metadata-1.7.0 ipykernel-5.3.0 ipython-7.16.1 ipython-genutils-0.2.0 jedi-0.17.1 jinja2-2.11.2 jsonschema-3.2.0 jupyter-client-6.1.3 jupyter-core-4.6.3 mistune-0.8.4 nbconvert-5.6.1 nbformat-5.0.7 notebook-6.0.3 packaging-20.4 pandocfilters-1.4.2 parso-0.7.0 pexpect-4.8.0 pickleshare-0.7.5 prometheus-client-0.8.0 prompt-toolkit-3.0.5 ptyprocess-0.6.0 pygments-2.6.1 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pyzmq-19.0.1 six-1.15.0 terminado-0.8.3 testpath-0.4.4 tornado-6.0.4 traitlets-4.3.3 wcwidth-0.2.5 webencodings-0.5.1 zipp-3.1.0

See: https://jupyter.readthedocs.io/en/latest/install.html for other options

Run Jupyter 

You will need to make sure that your environment variables are still set from earlier, if they are you can simply run the following.

(venv) $ jupyter notebook
[I 12:03:42.738 NotebookApp] Serving notebooks from local directory: {DIR}/glue_notebook
[I 12:03:42.738 NotebookApp] The Jupyter Notebook is running at:
[I 12:03:42.738 NotebookApp] http://localhost:8888/?token=abcdef1234567890abcdef

Open your browser to URL given. You should see the Jupyter admin page.

Jupyter administration page showing the project directory and the venv subdirectory

Create two directories 'data_in' and 'data_out' by clicking New > Folder.

Create a new notebook using Python 3 or download the example notebook.

Jupyter 'New' dropdown with 'Python 3' highlighted.

And download the example data to the 'data_in' directory. Example data

Download and run example notebook blocks

Example Glue Notebook

Image of the example Glue script loaded in Jupyter.

IDE Method

Setup PyCharm Projects

We need to create a project in PyCharm. You can change these parameters as you like.

  1. Location: $HOME/Projects/local_glue
  2. Interpreter: a virtual environment with Python 3.6/3.7 $HOME/Projects/local_glue/venv
  3. Using the instruction in Remote Debugging with PyCharm copy pydevd-pycharm.egg to your PROJECT ROOT.
  4. Under Settings > Project Structure add a new content root pointing to the AWS Glue PyGlue.zip
    1. $HOME/bin/aws-glue-libs/PyGlue.zip
  5. Add another content root for py4j-*.zip in the Spark directory and for pyspark.zip
    1. $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip
    2. $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/pyspark.zip
  6. Your IDE should now be able to interpret Spark and Glue functions.
  7. Create an input and output directory as a subdir in the project.
  8. Download this example script and save as glue_script.py in the PROJECT ROOT.
  9. Change the SOURCE_ROOT and OUTPUT_ROOT in glue_script.py  to reflect your project directory.
  10. Save this example CSV of The Daily Show guests to the ./input directory.

PyCharm Remote Debug Run Configuration

Because of the distributed nature of Spark we have to set up debugging using the PyCharm remote debug server.

  1. Create a new Run Config under the menu Run > Edit Configurations.
  2. Add New Configuration (Alt+Insert)
  3. Select "Python Remote Debug"
  4. Set the local host name:
  5. Set the debug port: 12345
  6. Map the path to itself. Ex $HOME/Projects/local_glue=$HOME/Projects/local_glue
  7. Click OK
  8. Click the bug symbol to wait for debug client. (SHIFT+F9)

Run the Spark Job

We need to run our glue_script.py using Spark in local mode, with a single worker and specifying the JOB_NAME.

$ $SPARK_HOME/bin/spark-submit --master local\[1\] $HOME/Projects/local_glue/glue_script.py --JOB_NAME local_test

Return to PyCharm

If everything worked, PyCharm should be receive the call from glue_script.py and be waiting for instructions. You should be able to interactively debug Glue locally now.


java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information

This is caused by some conflicting signed libraries. All I did was delete them.

rm $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/javax.servlet*

 com.amazonaws.SdkClientException: Unable to load region information from any provider in the chain

You will need to set an AWS_REGION or AWS_DEFAULT_REGION in your AWS credentials file, or environment variables. See https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html


There are many working parts and the libraries this relies on could change in the future. Hopefully this will become an easier task with more visibility.



Contact: Douglas H. King Research Programming