pctechguide.com

  • Home
  • Guides
  • Tutorials
  • Articles
  • Reviews
  • Glossary
  • Contact

Guidelines on Processing Big Data with Hadoop

We are observing an increasing number of social and industrial applications where the flow of data and information grows at an exponential rate, which generates databases of sizes that exceeds traditional media. In just one day, Google must process more than 20000 terabytes of information while NASA generates more than 2 gigabytes of information every 5 minutes. The storage, processing and analysis of this information is not a simple task, since it is physically impossible to process large amounts of data over the same computing instance (due to storage, temporary memory and processing restrictions). However, a solution to this problem consists in the use of distributed applications, applications that run in different physical instances linked through a network.

There are a number of technological applications that assist with big data processing. Apache Hadoop is one of them. However, many people are stuck when it comes to using Hadoop for data mining and analysis.

Overview of Hadoop as a Big Data Mining and Analysis Toolkit

Apache Hadoop is a framework that supports applications distributed under a free license and is inspired by Google documents on MapReduce and Google File System. It is used by Yahoo, Facebook, Linkedin, Ebay, among others, because it allows the quick search of words in large text strings, sort lists and multiply large matrices (among many other applications).

How do you use Hadoop?

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing. You need to know how to utilize them before you can get the most of this program.

To understand this concept, one can think of the problem of a population census. The methodology consists of dividing the census into cities where you have people counting the population and sending the results to a central place where the results are finally reduced to a total count. This scheme of sending (mapping) people parallel to cities and then centralizing and reducing (reducing) them is what Google generalizes and calls MapReduce.

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing.

A specific problem can be worked with Hadoop if it allows to clearly define the roles of mapping and reducing. The Hadoop framework is in charge of distributing memory and communication between the different instances created in parallel of the mapping and reducing functions.

This article presents a basic example of word count in a long document using Hadoop. Normally a document with a size of hundreds of gigabytes is impossible to process with a computation instance and it is there where you see the need to process the document with tools such as Hadoop. For this, we must identify how it should be mapped and how it should be reduced. In general terms what will be done is to take the text and divide it into appropriate lines. The Mapper function receives some of these lines and reports the words in that line. The Reducer function gathers the words reported by Mapper to count them. Here is how to implement such a mechanism. The example was developed on linux, although Hadoop is multiplatform and is supported by the best known operating systems.

Math Decision provides consultancy in the execution of Big Data and machine learning projects. We are a quantitative team composed by researchers of the highest level (Georgia Tech, Univ. of Toronto, Univ. of Sao Paulo) with experience in the implementation of high performance algorithms in parallel and distributed systems. For more information contact us at info@mathdecision.com, visit our website or visit us at rutaN, the innovation center of Medellín.

Filed Under: Articles

Latest Articles

Portable Music Players

Portable Music Players There are also many portable digital music players now on the market, epitomised (arguably) by the Apple iPod. Being a committed early adopter, I bought the first generation 10GB iPod, which I sold and upgraded to the 20GB model when I (quickly) exceeded 10GB of MP3 … [Read More...]

The Flexibility of ASUS Transformer Book Trio

Putting together a laptop, a desktop PC and tablet all in one, Asus nailed it! The Transformer Book Trio is not just a smooth design and idea, this ultra-portable device is a three-computer in one. You can detach the 11.6 inch display, then you can relax and enjoy a lightweight multi-touch tablet. … [Read More...]

Enhanced Dot Pitch Monitors

Developed by Hitachi, EDP is the newest mask technology, coming to market in late 1997. This takes a slightly different approach, concentrating more on the phosphor implementation than the shadow mask or aperture grill. On a typical shadow mask CRT, the phosphor trios are more or less arranged … [Read More...]

Gaming Laptop Security Guide: Protecting Your High-End Hardware Investment in 2025

Since Jacob took over PC Tech Guide, we’ve looked at how tech intersects with personal well-being and digital safety. Gaming laptops are now … [Read More...]

20 Cool Creative Commons Photographs About the Future of AI

AI technology is starting to have a huge impact on our lives. The market value for AI is estimated to have been worth $279.22 billion in 2024 and it … [Read More...]

13 Impressive Stats on the Future of AI

AI technology is starting to become much more important in our everyday lives. Many businesses are using it as well. While he has created a lot of … [Read More...]

Graphic Designers on Reddit Share their Views of AI

There are clearly a lot of positive things about AI. However, it is not a good thing for everyone. One of the things that many people are worried … [Read More...]

Redditors Talk About the Impact of AI on Freelance Writers

AI technology has had a huge impact on our lives. A 2023 survey by Pew Research found that 56% of people use AI at least once a day or once a week. … [Read More...]

11 Most Popular Books on Perl Programming

Perl is not the most popular programming language. It has only one million users, compared to 12 million that use Python. However, it has a lot of … [Read More...]

Guides

  • Computer Communications
  • Mobile Computing
  • PC Components
  • PC Data Storage
  • PC Input-Output
  • PC Multimedia
  • Processors (CPUs)

Recent Posts

FPM DRAM

All types of memory are addressed as an array of rows and columns, and individual bits are stored in each cell of the … [Read More...]

ADSL Filters

Because ADSL uses high frequency signals that are outside the range the human ear is capable of hearing, the service can operate over the same pair … [Read More...]

Digital Cameras vs. Film

Despite the massive strides it has made in recent years, the conventional wisdom remains that though digital cameras … [Read More...]

[footer_backtotop]

Copyright © 2025 About | Privacy | Contact Information | Wrtie For Us | Disclaimer | Copyright License | Authors