Will Using Luigi Limit Your Pipeline Performance and Language Choice Flexibility?

You are currently looking around for the best foundation for your company's future data pipeline, and Luigi is looking like a solid choice. But it's in Python and that's also the language you have to write all the data processing code in, right? Python is a blast to develop in, but it's not the fastest language around. Will it hold you back eventually? Does using Luigi mean that all the data crunching needs to happen in Python as well?

No reason to worry. You are not limiting yourself to a single language. While Luigi provides the means to implement data processing code right inside of its Tasks, you don't have to. Think of it as something capable to save you effort and glue together parts of your pipeline which will be doing the actual heavy lifting. For once, there is a well supported focus on running Hadoop-ecosystem type of jobs as well as interacting with Spark which are more than suited to process huge amounts of data.

There are of course other ways. The simplest one, would be to use subprocesses to execute other programs from the Python code. You can either execute programs written in Scala, R, C, Julia, Lua, Ruby, JavaScript, Python3, or anything which can be wrapped in a bash script. Luigi will take care of orchestration issues, such as dependency resolution or alerting. Advanced approaches can involve the awesome power of containers. Using Docker, you can delegate any heavy lifting to code running inside a containerized environment, which in turn can be setup to support any programming language you might need. For an example of a very advanced Luigi-powered setup, check out this excellent talk by Ville Tuulos from AdRoll. He provides a tour of AdRoll's data processing systems which use Luigi tasks to orchestrate Docker containers in the AWS cloud. With their setup, AdRoll are able to handle data at petabyte-scale while giving their developers a free choice of tools.