3 Tools for Combining Multiple Data Sources

How to combine data from multiple sources into one place, so it can be queried quickly and will be safe in the long run? What is popular for taking care of company data plumbing right now? Is there one best solution which everybody can use?

Disregarding the exact way the data will be handled in between, your best bet is having it arrive in a solid data warehouse such as Amazon Redshift. Now that this is decided, how to get it there best? The ecosystem of data related tools is huge, ever growing and, depending on where you look, pretty confusing to navigate.

The best choice for you, depends on what stage your data efforts are at, what infrastructure you have in place, how heavily it is used, what long term plans you have for your data and how fast you want to process it. Let's look at three representative tools and when to use them.

Data Virtuality

If you have data in various third party services like Google Analytics and several different databases, you can access all of it transparently from one place with Data Virtuality. To connect your databases and configure the application, you use a simple graphical interface. No coding necessary. Popular tools such as Tableau can connect to it afterwards and access data from connected data sources without breaking a leg or struggling with the amount. In the background, Data Virtuality is usually configured to load data into Amazon Redshift intelligently, which reduces the load on the source databases and speeds up queries significantly. An Amazon AMI is available, with which you can run the Data Virtuality application on an AWS cloud instance and get started right away. The software requires a regular licensing fees apply.

Luigi

Want to load data periodically from various databases and logs, while being able to perform long running, complex chained operations on it? You also really like Python? Great! Luigi was developed by Spotify and is currently used by many successful companies such as Buffer or Stripe. It's purpose is to build complex pipelines of batch jobs, with scheduling and dependency resolution. It is widely used, has a strong community and is very developer friendly. In case something breaks or just to see how everything is going, you can get a visual overview of your whole pipeline with no extra effort. This clean and readable open source project is very versatile and fits well into the Python and Hadoop data-processing ecosystem. You need to write code to use it - with great power comes great responsibility. Luigi is meant for projects that crunch data in regular intervals, and is perfect for creating maintainable and elegant data processing pipelines.

Snowplow

Described as an event analytics platform, this approach differs from the two above. It focuses on generating Snowplow-specific events and processing them, instead of transforming data from existing databases, services or logs. You include Snowplow-specific code into your mobile apps, servers and website. From there, events are passed into a central configurable pipeline, where they are handled by custom code. Granular event data ends up in the data warehouse of your choice. In addition to getting events into something like Amazon Redshift, Snowplow is capable of also passing data to be further processed in a real-time fashion, which is great for building data-driven applications. As with Luigi, Snowplow is widely used and developer friendly and an open source project, but there is the option buying a managed service plan from the company that created it.

In Conclusion

Depending on your current situation, any of the above tools could be the correct choice for your data plumbing needs. Have a question on your mind or not quite sure if one of the above is a good fit? Drop me a mail, I'll be happy to point you in the right direction.