fun with mysql and hadoop at SCALE 12x

It’s my pleasure to be presenting at SCaLE 12X on the subject of real-time data loading from MySQL to Hadoop. This is the first public talk on work at Continuent that enables Tungsten Replicator to move transactions from MySQL to HDFS (Hadoop Distributed File System). I will explain how replication to Hadoop works, how to set it up, and offer a few words on constructing views of MySQL data using tools like Hive.

As usual with replication everything we are doing on Hadoop replication is open source. Builds and documentation will be publicly available by the 21st of February, which is when the talk happens. Hadoop support is already in testing with Continuent customers, and we have confidence that we can handle basic loading cases already. That said, Hadoop is a complex beast with lots of use cases, and we need feedback from the community on how to make Tungsten loading support better. My colleagues and I plan to do a lot of talks about Hadoop to help community users get up to speed.

Here is a tiny taste of what MySQL to Hadoop loading looks like. Most MySQL users are familiar with sysbench. Have you ever wondered what sysbench tables would look like in Hadoop? Let’s use the following sysbench command to apply transactions to table db01.sbtest:
sysbench –test=oltp –db-driver=mysql –mysql-host=logos1 –mysql-db=db01
–mysql-user=tungsten –mysql-password=secret
–oltp-read-only=off –oltp-table-size=10000
–oltp-index-updates=4 –oltp-non-index-updates=2 –max-requests=200000
–max-time=900 –num-threads=5 run
This results in rows that look like the following in MySQL:
mysql select * from sbtest where id = 2841G
*************************** 1. row ***************************
id: 2841
k: 2
c: 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769
pad: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt
After replication into Hadoop with Tungsten, we can crunch the log records using a couple of HiveQL queries to generate a point-in-time snapshot of the sbtest table on HDFS. By a point-in-time snapshot, I mean that a table that contains not only inserted data but also shows the results of subsequent update and delete operations on each row up to a particular point in time. We can now run the same query to see the data:
hive select * from sbtest where id = 2841;
Total MapReduce jobs = 1
Launching Job 1 out of 1

Job 0: Map: 1 Cumulative CPU: 0.74 sec HDFS Read: 901196 HDFS Write: 158 SUCCESS
Total MapReduce CPU Time Spent: 740 msec
2841 2 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769 qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt
Tungsten does a lot more than just move transaction data, of course. It also provides tools to generate Hive schema, performs transformations on columns to make them match the limited HiveQL datatypes, and arranges data in a way that allows you generate materialized views for analytic usage (like the preceding example) with minimal difficulty.

If you want to learn more about how Tungsten does all of this magic, please attend the talk. I hope to see you in Los Angeles.

p.s., If you cannot attend SCaLE 12X, we will have a Continuent webinar on the same subject the following week. (Sign up here.)