Local Debugging of AWS Glue Jobs

Debug AWS Glue scripts locally using PyCharm. 

Before You Start 

You will need the following before you can complete this task: 

  • An AWS account (not needed for just local work).
  • Python 3.6+.
  • Java - JDK installed (here 1.8.0)
  • PyCharm or other python debugging platform.
  • A Linux computer. This is not tested on Mac or Windows.
  • An understanding of environment variables and editing BASH scripts.
  • Familiarity with git.
  • Familiarity with Linux terminals.
  • SSH and configuring keys

Note: This involves many steps and correcting several things manually. It may not be worth the trouble, but if you have had as many issues with Glue that we have, it could save you time in the long run.

Introduction

In the research programming group we have begun using AWS Glue for several projects. It is very useful for transforming data from CSV to Parquet for later use in AWS Athena, or loading large flat files into an RDBMS as part of other processes. 

The problem is that sometimes our data is not in proper format, our Glue crawler or job fails, and we stare blankly at an insufficient error log in CloudWatch. The nature of Glue means{ that it can take a while to learn about a failure and thus the iterative process is delayed by 20 minutes for each attempt. We could use a Glue endpoint, but they are prohibitively expensive and can be difficult to connect to in a VPC. 

Our ideal is to be able to step through the debugger as a job runs, and examine it. Here are the steps needed to do that.

Download Maven

Apache Maven is required to download additional Java libraries for aws-glue-libs.

Download: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz

We will be extracting this and the following archives into the $HOME/bin/ directory but you may place them anywhere as long as they are referenced correctly in the later environment variables. Create this directory if it does not yet exist.

Extract the maven archive.

$ tar xvf apache-maven-3.6.0-bin.tar.gz -C $HOME/bin/
apache-maven-3.6.0/README.txt
apache-maven-3.6.0/LICENSE
apache-maven-3.6.0/NOTICE
...

Update PATH and JAVA_HOME*

Maven needs to be in your $PATH.

$ export PATH="$HOME/bin/apache-maven-3.6.0/bin:$PATH"

Set the environment variables for JAVA_HOME. This must point to the libs for the JDK. Here are some places it may be.

$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

*These and remaining vars can be added to your user environment variables in .bashrc/.zshenv or handled per session.

See: aws-glue-libs

Download Spark

Glue adds AWS features on top of Apache Spark and uses the Spark libraries.

Download: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz

The glue-1.0 version is compatible with Python 3 and will be the one used in these examples.

Extract the Spark archive.

$ tar xvf spark-2.4.3-bin-hadoop2.8.tgz -C $HOME/bin/
spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/
spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/bin/
spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/conf/
...

Set the environment variables for SPARK_HOME.

$ export SPARK_HOME=$HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/

See: aws-glue-libs

Checkout AWS Glue libraries

The aws-glue-libs repository contains AWS libraries for adding on top of Apache Spark. Use git to checkout. 

$ cd $HOME/bin
$ git clone https://github.com/awslabs/aws-glue-libs.git
Cloning into 'aws-glue-libs'...
remote: Enumerating objects: 151, done.
remote: Total 151 (delta 0), reused 0 (delta 0), pack-reused 151
Receiving objects: 100% (151/151), 60.60 KiB | 4.04 MiB/s, done.
Resolving deltas: 100% (91/91), done.

Change directory into the repository and checkout the glue-1.0 BRANCH.

$ cd aws-glue-libs
$ git checkout glue-1.0
Branch 'glue-1.0' set up to track remote branch 'glue-1.0' from 'origin'.
Switched to a new branch 'glue-1.0'

Run glue-setup.sh

The glue-setup.sh script needs to be run to create the PyGlue.zip library, and download the additional .jar files for AWS Glue using maven.

$ cd $HOME/bin/aws-glue-libs
$ chmod +x ./bin/glue-setup.sh
$ ./bin/glue-setup.sh
  adding: awsglue/ (stored 0%)
  adding: awsglue/dynamicframe.py (deflated 81%)
...
[INFO] Finished at: 2019-11-18T11:02:34-05:00
[INFO] ------------------------------------------------------------------------
rm: cannot remove '$HOME/bin/aws-glue-libs/conf/spark-defaults.conf': No such file or directory
$HOME/bin/aws-glue-libs

There will be some some errors that can be ignored regarding "cannot remove 'PyGlue.zip'" and "cannot remove 'spark-defaults.conf'". These files are replaced with each run and do not exist on the first run.

Modify glue-setup.sh

Since we will be modifying the jars manually we have to prevent Maven from overwriting our work. We need to comment out the Maven update line just in case we run glue-setup.sh again.

Edit glue-setup.sh using nano or vi.

$ nano $HOME/bin/aws-glue-libs/bin/glue-setup.sh 

or

$ vi $HOME/bin/aws-glue-libs/bin/glue-setup.sh 

Change:

mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jars dependency:copy-dependencies

To:

# mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jars dependency:copy-dependencies

Save and exit.

Copy .jar file from Spark

There is a bug in the aws-glue-libs project that causes Glue to fail. See: aws-glue-libs Issues #25

To correct this we need to remove netty-all-4.0.23.Final.jar and replace it with netty-all-4.1.17.Final.jar from the spark installation.

$ rm $HOME/bin/aws-glue-libs/jarsv1/netty-all-4.0.23.Final.jar
$ cp $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/netty-all-4.1.17.Final.jar $HOME/bin/aws-glue-libs/jarsv1/

Check you have the correct jars:

$ ls -1 $HOME/bin/aws-glue-libs/jarsv1 | grep -i netty
netty-3.6.2.Final.jar
netty-all-4.1.17.Final.jar
netty-buffer-4.1.17.Final.jar
netty-codec-4.1.17.Final.jar
netty-codec-http-4.1.17.Final.jar
netty-common-4.1.17.Final.jar
netty-handler-4.1.17.Final.jar
netty-resolver-4.1.17.Final.jar
netty-transport-4.1.17.Final.jar

Set environment variables

The AWS provided scripts that launch Glue have limitations, but under the hood they basically run Spark after setting up particular environment variables. We need to set those manually to run Spark like Glue in our own way. 

$ export SPARK_CONF_DIR=$HOME/bin/aws-glue-libs/conf
$ export PYTHONPATH="${SPARK_HOME}python/:${SPARK_HOME}python/lib/py4j-0.10.7-src.zip:$HOME/bin/aws-glue-libs/PyGlue.zip:${PYTHONPATH}" 

Setup PyCharm Projects

We need to create a project in PyCharm. You can change these parameters as you like.

  1. Location: $HOME/Projects/local_glue
  2. Interpreter: a virtual environment with Python 3.6/3.7 $HOME/Projects/local_glue/venv
  3. Using the instruction in Remote Debugging with PyCharm copy pydevd-pycharm.egg to your PROJECT ROOT.
  4. Under Settings > Project Structure add a new content root pointing to the AWS Glue PyGlue.zip
    1. $HOME/bin/aws-glue-libs/PyGlue.zip
  5. Add another content root for py4j-*.zip in the Spark directory and for pyspark.zip
    1. $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip
    2. $HOME/bin/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/pyspark.zip
  6. Your IDE should now be able to interpret Spark and Glue functions.
  7. Create an input and output directory as a subdir in the project.
  8. Download this example script and save as glue_script.py in the PROJECT ROOT.
  9. Change the SOURCE_ROOT and OUTPUT_ROOT in glue_script.py  to reflect your project directory.
  10. Save this example CSV of The Daily Show guests to the ./input directory.

PyCharm Remote Debug Run Configuration

Because of the distributed nature of Spark we have to set up debugging using the PyCharm remote debug server.

  1. Create a new Run Config under the menu Run > Edit Configurations.
  2. Add New Configuration (Alt+Insert)
  3. Select "Python Remote Debug"
  4. Set the local host name: 127.0.0.1
  5. Set the debug port: 12345
  6. Map the path to itself. Ex $HOME/Projects/local_glue=$HOME/Projects/local_glue
  7. Click OK
  8. Click the bug symbol to wait for debug client. (SHIFT+F9)

Run the Spark Job

We need to run our glue_script.py using Spark in local mode, with a single worker and specifying the JOB_NAME.

$ $SPARK_HOME/bin/spark-submit --master local\[1\] $HOME/Projects/local_glue/glue_script.py --JOB_NAME local_test

Return to PyCharm

If everything worked, PyCharm should be receive the call from glue_script.py and be waiting for instructions. You should be able to interactively debug Glue locally now.

Conclusion

There are many working parts and the libraries this relies on could change in the future. Hopefully this will become an easier task with more visibility.

References

Questions?

Contact: Douglas H. King Research Programming