Apache Drills into Hadoop
Updated · Dec 02, 2014
It's a big week for Big Data and the open source Hadoop ecosystem. The Apache Hadoop 2.6.0 project was released on Nov. 30, and today the Apache Drill project announced it had become a top level project in the Apache Software Foundation.
The Apache Hadoop 2.6.0 release, the fourth incremental Hadoop release this year from Apache, fixed 900 issues across the Hadoop Common, HDFS, YARN and MapReduce projects. Among the improvements in Hadoop 2.6.0 is support for heterogeneous storage tiers for HDFS.
“Admins can define storage tiers across disks in a data node, and applications can utilize APIs to store data to these different storage tiers,” Arun Murthy, founder of Hortonworks, wrote in a blog post.
Apache Drill
Apache Drill is one of many different projects that can enable Hadoop. Drill, which defines itself as a schema-free SQL query engine, until this week had been an incubated Apache project. Tomer Shiran, Apache Drill founder, explained to Enterprise Apps Today that as a top level project, Drill now has autonomy at the Apache Software Foundation, its own board (also known as PMC) and makes its own decisions.
“A TLP is more visible to users,” Shiran said. “For example, you'll notice that as of this morning, Drill's website is at drill.apache.org, as opposed to incubator.apache.org/drill.”
Shiran noted that the graduation is an indicator that Drill has established a strong community of users and developers. From an adoption standpoint, Drill had its first public release in August and in just a few months has attracted dozens of companies, including some members of the Global 2000.
There are a number of different open source efforts that try to help solve the Big Data query challenge. One such effort is Cloudera's Impala project which became generally available in 2013. Shiran said Drill is a different animal than Impala, as Drill is a schema-free SQL engine with a JSON data model.
Shiran explained that normally the data in Hadoop already has structure, inside the Parquet files, JSON files or HBase tables.
Self Service Data Exploration
“With Drill the user can just query that data, and Drill is able to pick up that structure on the fly during query execution,” Shiran said. “With Impala, the user must wait for IT to define and maintain redundant schemas on all that data.”
The promise of Drill is to enable self-service data exploration, which increases agility and makes users more productive. Shiran added that Drill is the only SQL engine that can handle evolving data automatically. So, for example, when fields change, new data sources are added. Another difference is that Drill supports many data sources, so it's not limited to a single HDFS source.
Shiran said an administrator could run Drill on top of the latest version of Hadoop.
“This would allow users to submit ad-hoc queries on the data in HDFS, either by providing the SQL statement or through a BI tool like Tableau,” Shiran said. “Unlike any other SQL engine on Hadoop, any end-user could explore their data without requiring IT to define the schemas on the data. That's the self-service nature of Drill.”
Looking forward, Shiran said that Drill 1.0 will come out early next year. The project will continue to expand Drill's capabilities with respect to performance and advanced SQL capabilities.
“We're already seeing some very interesting applications that simply weren't possible before, such as ad-hoc analysis of very sparse data,” he said.
Sean Michael Kerner is a senior editor at Enterprise Apps Today and InternetNews.com. Follow him on Twitter @TechJournalist.
Sean Michael is a writer who focuses on innovation and how science and technology intersect with industry, technology Wordpress, VMware Salesforce, And Application tech. TechCrunch Europas shortlisted her for the best tech journalist award. She enjoys finding stories that open people's eyes. She graduated from the University of California.