Wednesday, February 29, 2012

Big Data vs. Big Compute, Two Sides of the Same Coin?

With Big Data designated as the next frontier for innovation, competition, and productivity by McKinsey, and Gartner selecting it as one of the top 10 strategic technologies for 2012, Big Data is everywhere in the media, and on top of people’s minds. But are Big Data technologies sufficient in addressing “big data” problems? What about the compute-intensive applications that involve processing significant amounts of data? The Silicon Valley HPC/GPU Supercomputing Meetup Group decided to bring the two topics together for a discussion.

“It wasn't what I expected, which turned out to be a good thing.” Commented by an attendee of the recent panel discussion and debate on similarities and differences of Big Data vs. Big Compute. When multiple reviews on the event run over 150 words each, it is clear that the event had struck a chord, and perhaps, invoked some controversy.

Big Data and Big Compute, specifically using GPUs in this case, are not new concepts. Big Data usually refers to datasets too large for database management tools, and GPU computing leverages the parallel nature of graphics processing unit to perform computation in applications traditionally handled by the CPU.

The Meetup organizer and panel moderator, Dr. Jike Chong, offered several interpretations of Big Data and Big Compute from speaking with the four panelists with diverse backgrounds: constraint-based, relative, scalability and historic. The discussion quickly heated up:

Is Big Data vs. Big Compute a wallet problem in an economic context? Depends. Do we know which data is valuable prior to computation taking place?

What are the differences in priority? For big data, it tends to be to explore things rapidly, where as for big compute, there is more focus on optimization and fine tuning of the hardware.

While the focuses of the two approaches appear to be different, both are concerned with throughput. Some may say these are two different ways of solving the same problem, and others (myself included) advocate these to work together to solve big data compute intensive problems.

Steve Scott, NVIDIA Tesla CTO, one of the panelists, puts it well with a summary on GPUs for Big Data:

If it is all data movement, there’s no need for GPU or CPU.

If there’s some serious computing that needs to be done on that data and the problem can be distributed, GPU can help allow more complex analysis.

If the problem has no locality such as in big graph analytics, GPUs may work well in the future.

So where is the convergence and what are the implications? Lots of big compute problem are also big data: reverse time migration in oil and gas, visual search for example. It also turns out a very critical thing for both is power: how do we get less power per productivity? Both Big Data and Big Compute have to optimize, and will be driving that effort in the next five years.

Big Data and Big Compute are not opposing concepts, and the discussion revealed there is more than a perspective, application, or priority difference, there is underlying culture difference of the two camps using these approaches. Going forward, Big data, as “marriage of ‘database’ with compute”, and Big Compute need to take the other side into consideration, as technologies for each can shine where their priorities and interests aligns.

More information can be accessed at the links to slides.

The four panelists come from diverse backgrounds, and both Aaron Kimball and Tim Child had presented to the group prior:

Aaron Kimball, co-founder of WibiData
Steve Scott, Tesla CTO, NVIDIA
Tim Kaldewey, IBM research
Tim Child, Chief Entrepeneur Officer at 3DMashup

Description of Aaron Kimball's talk, Large Scale Machine Learning on Hadoop and HBase, is here, and Tim Child's talk on Postgres is here and was also mentioned here.

Join us next time on March 26 for another exciting discussion in HPC and GPU Supercomputing!