4 Hot Open Source Big Data Projects
Updated · Apr 30, 2013
It's difficult to talk about Big Data processing without mentioning Apache Hadoop, the open source Big Data software platform. But Hadoop is only part of the Big Data software ecosystem. There are many other open source software projects that are emerging to help you get more from Big Data.
Here are a few interesting ones that are worth keeping an eye on.
Spark
Spark bills itself as providing “lightning-fast cluster computing” that makes data analytics fast to run and fast to write. It's being developed at UC Berkeley AMPLab and is free to download and use under the open source BSD license.
So what does it do? Essentially it's an extremely fast cluster computing system that can run data in memory. It was designed for two applications where keeping data in memory is an advantage: running iterative machine learning algorithms, and interactive data mining.
It's claimed that Spark can run up to 100 times faster than Hadoop MapReduce in these environments. Spark can access any data source that Hadoop can access, so you can run it on any existing data sets that you have already set up for a Hadoop environment.
Download Spark
Drill
Apache Drill is “a distributed system for interactive analysis of large-scale datasets.”
MapReduce is often used to perform batch analysis on Big Data in Hadoop, but what if batch processing isn't suited to the task at hand: What if you want fast results to ad-hoc queries so you can carry out interactive data analysis and exploration?
Google developed its own solution to this problem for internal use with Dremel, and you can access Dremel as a service using Google's BigQuery.
However if you don't want to use Google's Dremel on a software-as-a-service basis, Apache is backing Drill as an Incubation project. It's based on Dremel, and its design goal is to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds.
Download Drill source code
D3.js
D3 stands for Data Driven Documents, and D3.js is an open source JavaScript library which allows you to manipulate documents that display Big Data. It was developed by New York Times graphics editor Michael Bostock.
Using D3.js you can create dynamic graphics using Web standards like HTML5, SVG and CSS. For example, you can generate a plain old HTML table from an array of numbers, but more impressively you can make an interactive bar chart using scalable vector graphic from the same data.
That barely scratches the surface of what D3 can do, however. There are dozens of visualization methods — like chord diagrams, bubble charts, node-link trees and dendograms — and thanks to D3's open source nature, new ones are being contributed all the time.
D3 has been designed to be extremely fast, it supports Big Data datasets, and it has cross-hardware platform capability. That's meant it has become an increasingly popular tool for showing graphical visualizations of the results of Big Data analysis. Expect to see more of it in the coming months.
Download D3.js
HCatalog
HCatalog is an open source metadata and table management framework that works with Hadoop HDFS data, and which is distributed under the Apache license. It's being developed by some of the engineers at Hortonworks, the commercial organization that's the sponsor of Hadoop (and which also sponsors Apache).
The idea of HCatalog is to liberate Big Data by allowing different tools to share Hive. That means that Hadoop users making use of a tool like Pig or MapReduce or Hive have immediate access to data created with another tool, without any loading or transfer steps. Essentially it makes the Hive metastore available to users of other tools on Hadoop by providing connectors for Map Reduce and Pig. Users of those tools can read data from and write data to Hive’s warehouse.
It also has a command line tool, so that users who do not use Hive can operate on the metastore with Hive Data Definition Language statements.
Download HCatalog
And More
Other open source big data projects to watch:
Storm. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.
Kafka. Kafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn's activity stream and operational data processing pipeline. It is now used at a variety of different companies for various data pipeline and messaging uses.
Julia. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments.
Impala. Cloudera Impala is a distributed query execution engine that runs against data stored natively in Apache HDFS and Apache HBase.
Paul Rubens has been covering IT security for over 20 years. In that time he has written for leading UK and international publications including The Economist, The Times, Financial Times, the BBC, Computing and ServerWatch.
Paul Ferrill has been writing for over 15 years about computers and network technology. He holds a BS in Electrical Engineering as well as a MS in Electrical Engineering. He is a regular contributor to the computer trade press. He has a specialization in complex data analysis and storage. He has written hundreds of articles and two books for various outlets over the years. His articles have appeared in Enterprise Apps Today and InfoWorld, Network World, PC Magazine, Forbes, and many other publications.