Juicing Big Data: Startup Builds GPU Database to Visualize the World on Twitter
What may be the most audacious demo at this week’s Supercomputing 2013 show traces its beginnings to a juice bar in rural Syria.
Map-D, a startup based in Cambridge, Mass., has built a high-speed GPU in-memory database and geospatial visualization tool that can track more than a billion tweets worldwide – and provide real-time interactive visual analysis of an almost boundless number of socio-economic queries.
Located within NVIDIA’s show booth, beside the heavily trafficked open-plan lecture hall, the Map-D demo – the name is a reference to “massively parallel database” – is drawing a steady stream of curious visitors.
With a few mouse clicks, they can instantaneously track the movement of a flu bug across southern U.S. states, building a heat map showing when and where the word “flu” was tweeted over a period of weeks. Or they can track tweets commenting on the relentless sweep of Typhoon Haiyan across the South China Sea, even revealing embedded video links showing the storm’s force at different hours along the way. Political sentiment – say, the response to a President Obama speech – can be tracked in milliseconds.
This big data application – which draws on the parallel processing power of GPUs to accelerate analysis by 70 to 1,000 times beyond that offered by a CPU – was created by a pair of self-trained technologists with social-science backgrounds and a taste for distant corners of the globe.
Their wanderlust caused them to cross paths in the least likely of settings – a far-flung refreshment stand in Syria where they bonded over pomegranate juice – before finding themselves in the same Middle Eastern Law class at Harvard, where they were both graduate students.
Todd Mostak, now 30, came to the work when he was seeking to find relationships among 40 million tweets sent during the Egyptian Spring uprising for his master’s in Middle Eastern Studies. Todd started the project while auditing a database class next door at MIT. Australian co-founder Tom Graham, now 29, a capital markets lawyer previously practicing in Hong Kong, had lived snug alongside China’s North Korean border studying civil disobedience before returning to Harvard Law School to research big data and internet-related law reform issues.
As Todd dug further into his analysis, he built a GPU-powered database to instantly crunch complex spatial and census data. First exposed to GPUs when he taught himself OpenGL to make iPhone apps as a hobby, he then learned CUDA, using standard gaming GPUs, which crunched data dramatically faster than CPU-based systems.
The Map-D system now uses NVIDIA’s freshly launched Tesla K40 GPUs, whose massive 12GB of memory unleashes extraordinary speed. The Map-D in-memory column store database is fully integrated into the memory on multiple GPUs and clusters and is powerful enough to query billions of data points and interactively visualize the results in just milliseconds.
Mostak and Graham’s demo at SC13 is focusing on tweets because of the easy availability of the data. But Map-D’s SQL database works equally well with any kind of massive data set.
Still comprising just its two hard-running cofounders, but with more engineering chops on the way, Map-D is beginning to engage in a range of commercial efforts that complement its usability for social researchers.
Among them are business intelligence efforts involving U.S. government contracts, work for the Saudi government seeking to prevent mishaps among the many millions who make the annual Haj pilgrimage to Mecca, and helping social media companies track marketing opportunities.
Another project involves Major League Baseball, which is looking to provide more vivid graphics for millions of fans. One challenge involves creating a heat map that shows, precisely, where each of the many thousands of pitches thrown in a major league hurler’s career have tended to land in the strike zone.
Looks like Mostak and Graham have already thrown a few strikes of their own.