Big data is … well … big in size! Exactly how much data can be classified as big data is not very clear cut, so let's not get bogged down in that debate. For a small company that is used to dealing with data in gigabytes, 10 TB of data would be BIG. However for companies like Facebook and Yahoo, petabytes is big.
Just the size of big data, makes it impossible (or at least cost prohibitive) to store it in traditional storage like databases or conventional filers. We are talking about cost to store gigabytes of data. Using traditional storage filers can cost a lot of money to store big data.
Here we'll take a look at big data, its challenges, and how Hadoop can help solve them. First up, big data's biggest challenges.
Big Data Is Unstructured or Semi-Structured
A lot of big data is unstructured. For example, click stream log data might look like:
time stamp, user_id, page, referrer_page
Lack of structure makes relational databases not well suited to store big data. Plus, not many databases can cope with storing billions of rows of data.
There's No Point in Storing Big Data If We Can't Process It
Storing big data is part of the game. We have to process it to mine intelligence out of it. Traditional storage systems are pretty "dumb'" in the sense that they just store bits. They don't offer any processing power.
The traditional data processing model has data stored in a storage cluster, which is copied over to a compute cluster for processing. The results are written back to the storage cluster.
This model, however, doesn't quite work for big data because copying so much data out to a compute cluster might be too time consuming or impossible. So what is the answer?
One solution is to process big data in place, such as in a storage cluster doubling as a compute cluster.
So as we have seen above, big data defies traditional storage. So how do we handle big data?
How Hadoop Solves the Big Data Problem
Hadoop is built to run on a cluster of machines
Lets start with an example. Let's say that we need to store lots of photos. We will start with a single disk. When we exceed a single disk, we may use a few disks stacked on a machine. When we max out all the disks on a single machine, we need to get a bunch of machines, each with a bunch of disks.
This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the get go.
Hadoop clusters scale horizontally
More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.
Hadoop can handle unstructured/semi-structured data
Hadoop doesn't enforce a schema on the data it stores. It can handle arbitrary text and binary data. So Hadoop can digest any unstructured data easily.
Hadoop clusters provides storage and computing
We saw how having separate storage and processing clusters is not the best fit for big data. Hadoop clusters, however, provide storage and distributed computing all in one.
The Business Case for Hadoop
Hadoop provides storage for big data at reasonable cost
Storing big data using traditional storage can be expensive. Hadoop is built around commodity hardware, so it can provide fairly large storage for a reasonable cost. Hadoop has been used in the field at petabyte scale.
One study by Cloudera suggested that enterprises usually spend around $25,000 to $50,000 per terabyte per year. With Hadoop, this cost drops to a few thousand dollars per terabyte per year. As hardware gets cheaper and cheaper, this cost continues to drop.
Hadoop allows for the capture of new or more data
Sometimes organizations don't capture a type of data because it was too cost prohibitive to store it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored.
One example would be website click logs. Because the volume of these logs can be very high, not many organizations captured these. Now with Hadoop it is possible to capture and store the logs.
With Hadoop, you can store data longer
To manage the volume of data stored, companies periodically purge older data. For example, only logs for the last three months could be stored, while older logs were deleted. With Hadoop it is possible to store the historical data longer. This allows new analytics to be done on older historical data.
For example, take click logs from a website. A few years ago, these logs were stored for a brief period of time to calculate statistics like popular pages. Now with Hadoop, it is viable to store these click logs for longer period of time.
Hadoop provides scalable analytics
There is no point in storing all this data if we can't analyze them. Hadoop not only provides distributed storage, but also distributed processing as well, which means we can crunch a large volume of data in parallel. The compute framework of Hadoop is called MapReduce. MapReduce has been proven to the scale of petabytes.
Hadoop provides rich analytics
Native MapReduce supports Java as a primary programming language. Other languages like Ruby, Python and R can be used as well.
Of course, writing custom MapReduce code is not the only way to analyze data in Hadoop. Higher-level Map Reduce is available. For example, a tool named Pig takes English like data flow language and translates them into MapReduce. Another tool, Hive, takes SQL queries and runs them using MapReduce.
Business intelligence (BI) tools can provide even higher level of analysis. There are tools for this type of analysis as well.
This content is excerpted from "Hadoop Illuminated" by Mark Kerzner and Sujee Maniyam. It has been made available via Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.