Try Luigi with Vagrant!

"Never bake your own ETL pipeline" seems to be a piece of advice, which most people working with data can whole-heartedly agree on. So, what are the options to build a well behaving, understandable, expandable and easy to maintain data pipeline? There are multiple choices, as usually. Spotify has open sourced their Python workflow engine called Luigi. Viewing talks and reading articles about it is all well and good, but to really get an impression if it is something you would be happy to work with, there is no way around actually getting your hands dirty.

As with many other tools and frameworks, it takes a while to dig out what needs to be installed and configured for Luigi to work. Then you usually copy example testing code, dig through commands that are needed to get it started from multiple different sources and only afterwards you get to start trying to modify them. Way too long until you get to the interesting parts, and too many manual steps, experimenting and frustration just to get the parts right which you don't really care about.

This article is part of a series. With minimal initial preparation you will be able to try multiple ways to build data pipelines without cluttering up your system and fighting basic setup issues which will not have an impact on your work. It will work out of the box, and you will be able to see a working example in action with only two simple commands. Subsequently nothing will stop you from exploring and modifying the code from your favourite editor.

Preparations

There are two applications, which you will need to install on your machine to follow along. Install VirtualBox, and Vagrant. They are straight forward to set up, and very easy to get rid of again.

Now, you only need to either download and unpack the zip archive containing the project, or clone/fork the git repository from Github. Open a terminal inside the new directory, and run the start.bat on Windows or start.sh if you are on Linux/MacOS. As Vagrant states, this WILL take some time. I'd advise to do something else and check back in about 10 minutes to avoid busy waiting. When the terminal window either closes by itself, or shows that the command has finished.

Logging In

You're only two more operations removed from running your first Luigi job, seeing the web interface and examining the results. The configuration will take some time, but afterwards a new window should open, with a Linux commandline environment.

Log into the new virtual machine, using the user name "vagrant" and the password "vagrant", both without the quotes. Start the Luigi job, by typing the following two commands (without the dollar sign in the beginning).

$ luigid &
$ python code/hello_world.py

The first runs the Luigi server, the second one executes a Luigi task by running a Python script. Because the server runs in the same terminal, it may output lines from time to time. It should not be a major disturbance. Press the enter key a few times to clean up if needed.

You can see the state of the operation, by opening your browser, and navigating to the address **localhost:8082". This is the Luigi dashboard, you should see that a task has recently been executed (or is still executing), what other tasks it depended on and whether it was successful.

You can view and edit the source code of the Luigi job on your original operating system, in the project folder, navigate to the code subfolder, the file "hello_world.py" contains everything which has been executed. Not exciting. Let's look at something which actually crunches data and produces an output.

Processing Outputs

The purchases.py contains two more active tasks. One produces a local file, the other a table in the virtual machine's PostgreSQL database. You can run it with

$ python code/purchases.py testing.PurchasesTotal
$ python code/purchases.py testing.PurchasesToDatabase

The final results of the first command is available in the data folder, which is shared between the virtual machine and your operating system. The total.txt file has been created by the Luigi task. The second command, populates a PostgreSQL database with the data from the original .csv file. You can take a look at it by connecting to the database and querying the new table:

$ psql vagrantdb
$ SELECT * FROM purchases;

Conclusion

Try modifying the pipeline, and executing the second command starting with python to execute it. Congratulations, you are now using Luigi. Happy exploring!

For a non-testing scenario, this setup might be a good first prototype, but I'd advise against relying on it for long or growing data sets. You probably want to set up a dedicated server running Luigi - probably among your other company AWS machines.