A Better Data Pipeline: From Hadoop to BigQuery at DeNA

The following text is based on the great lightning talk held at GDC'15 by Appie Hirve from DeNA. It is a very worthwhile and recommendable presentation, densely packed with great insights. If you are interested in officially endorsed case studies focused on BigQuery, take a look here. Updated slides from an updated talk from DeNA are available on slideshare.

To rule out any misunderstandings: I am in no way affiliated with DeNA, have not worked with this great company and have thus sadly had no part in their transition towards a better way of handling data and their impressive results.

This article is an attempt to present their situation, challenges and achieved results at a more leisurely pace than a lightning talk affords. Theirs, is an archetypical example of what growing companies are struggling with data-wise and can expect to achieve by investing into their data pipeline.

Who is DeNA?

DeNA is a Japanese company with a "growing global presence in gaming". They are publishing and operating titles that were developed by third parties, in cooperation with other game studios or by DeNA in-house. The company is operating globally, with $2 billion worth of virtual in-game currency spent by gamers in Japan and $270 million around the rest of the world. As with many social/mobile games, many of their titles produce a steady stream of acionable data and, in theory, there are constant opportunities to increase revenue by interacting with players and improving the games with the help of new information.

The Situation

DeNA has the opportunity to keep track of many different data categories that relate to their games:

Game KPIs, such as the number of active users or retention rates.
Marketing data, to know what sells where.
Game logs for custom game insights, which can be used to tune games and improve their design.
Ad vendor data, to better monetize traffic.

As the company is cooperating with other game studios, passing insights on and sharing data with third parties and internal teams in a controlled fashion is desirable for everybody involved. As many other companies in the industry, DeNA are running real-time game events in their titles, where real-time data and fast iterations would enable them to experiment and perform better. Due to their old data pipeline however, the time it took for data to become available was three hours.

The amount of data that needed to be handled by DeNA on busy days was up to ~60MB of raw logs per minute, or ~50GB per day. A formidable amount, that is large enonugh to warrant:

That has given us some head aches in the past.

Their data infrastructure before BigQuery has been a Hadoop cluster. It contained 42TB of data covering four years - the time from May 2011 to January 2015. Data from over 100 different titles (and multiple studios) was being stored there and could subsequently be accessed via HiveQL.

The Problems

The old data infrastructure had gotten more and more complicated, had a lot of bottle necks and failure points. The biggest issues that were negatively impacting the business were among others:

A very slow ETL process

The time from a player triggering an action in a game to the moment when people at DeNA were able to use it was restricting. It took three hours or longer for data be ready for analysis. As mentioned previously, the negative impact was especially felt during live events, where it was the bottleneck preventing quick reactions and iterations.

Too many cooks

The amount of people (marketing, finance, users and games themselves) who directly or indirectly needed productive access to the data has grown in size over time. The large amount of users started to clog the system "a lot" and even brought it down from time to time. Just imagine the amount of working time lost due to delays and downtimes. A large part of this issue, was due to difficulties with control permissions. In part because the company was using an older version of Hadoop.

Slow queries

A is a common issue, and one often taken for granted. It is one with huge costs in terms of constly context switches, slow iterations, frustration and questions left unasked. Appie captures the user perspective on this problem perfectly during her talk:

Running some query, you kick it off, and you wait and wait, and it takes forever to get data back. You forget what you were looking at when you started.

Having to deal with queries that take minutes and hours to compute should not be the norm.

A Better Data Pipeline

DeNA decided to invest into upgrading the tech stack for games serving their wester audience. They switched to Google App Engine as a platform for their titles, which made Google BigQuery a great synergetic choice for data storage and analysis.

For us it's really been a good solution. We want to be in control of our own data in-house, to do our own analysis and we don't want to have to worry about maintaining the infrastructure to make it happen. We are a games company after all.

The Impact

The investment into a better data pipeline has solved all of the problems previously describe, while making it possible for DeNA to focus on their core business instead of worrying about infrastructure.

Data ingestion is near-instant

With the new data pipeline, the data ingestion has become much quicker. The time for data to become available was reduced down to seconds instead of more than three hours.

Scaling is taken care of

DeNA is now able to scale their team and products with less worries. Publishing more titles or hiring additional analysts will not lead to technical issues with their data processing pipeline. The Google platform and services are taking care of issues which have been a major concern and bother previously. Whole systems are a lot less likely to go down in the process of day-to-day usage.

Sharing data while maintaining control is not a big challenge anymore, be it internally or with external partners. Permissions are very easy to manage, compared to the previous constraints of the legacy setup.

Queries are fast

The fine people at Google know how to make data talk. For example: a query processing on 173 GB of data now runs in under 8 seconds. The same query over less data in the previous Hadoop setup took about 2 minutes to compute.

In addition BigQuery makes connecting to the data with analytical tools such as Tableau, very simple. Interactive data analysis and exploration without frequent busy-waiting, content switches or losing the train of thought is a clear win. I assume it has resulted in better decisions and more asked questions, which are bound to have a big impact on the business in the long run.

Wrapping up

Upgrading their data pipeline has been a huge win for DeNA. They chose to rely on Google's services, and probably have not looked back to their old setup since the dust settled down.

"In summary: BigQuery has really helped us actually use data instead of exhausting ourselves, trying to get to the data itself. We are excited about what analysts can push themselves to, now that we upgraded the tools."

It's impressive how much fascinating information has been compressed down to the size of a 5-minute lightning talk. I hope that this writeup has done it justice. I would be very interested in getting to know more about the business impact for the western market and the ROI of their investment in the long run. If there will be a chance, I will definitely make sure to be in the audience of DeNA talks in the future.

Data Pipeline Architect