pctechguide.com

  • Home
  • Guides
  • Tutorials
  • Articles
  • Reviews
  • Glossary
  • Contact

Guidelines on Processing Big Data with Hadoop

We are observing an increasing number of social and industrial applications where the flow of data and information grows at an exponential rate, which generates databases of sizes that exceeds traditional media. In just one day, Google must process more than 20000 terabytes of information while NASA generates more than 2 gigabytes of information every 5 minutes. The storage, processing and analysis of this information is not a simple task, since it is physically impossible to process large amounts of data over the same computing instance (due to storage, temporary memory and processing restrictions). However, a solution to this problem consists in the use of distributed applications, applications that run in different physical instances linked through a network.

There are a number of technological applications that assist with big data processing. Apache Hadoop is one of them. However, many people are stuck when it comes to using Hadoop for data mining and analysis.

Overview of Hadoop as a Big Data Mining and Analysis Toolkit

Apache Hadoop is a framework that supports applications distributed under a free license and is inspired by Google documents on MapReduce and Google File System. It is used by Yahoo, Facebook, Linkedin, Ebay, among others, because it allows the quick search of words in large text strings, sort lists and multiply large matrices (among many other applications).

How do you use Hadoop?

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing. You need to know how to utilize them before you can get the most of this program.

To understand this concept, one can think of the problem of a population census. The methodology consists of dividing the census into cities where you have people counting the population and sending the results to a central place where the results are finally reduced to a total count. This scheme of sending (mapping) people parallel to cities and then centralizing and reducing (reducing) them is what Google generalizes and calls MapReduce.

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing.

A specific problem can be worked with Hadoop if it allows to clearly define the roles of mapping and reducing. The Hadoop framework is in charge of distributing memory and communication between the different instances created in parallel of the mapping and reducing functions.

This article presents a basic example of word count in a long document using Hadoop. Normally a document with a size of hundreds of gigabytes is impossible to process with a computation instance and it is there where you see the need to process the document with tools such as Hadoop. For this, we must identify how it should be mapped and how it should be reduced. In general terms what will be done is to take the text and divide it into appropriate lines. The Mapper function receives some of these lines and reports the words in that line. The Reducer function gathers the words reported by Mapper to count them. Here is how to implement such a mechanism. The example was developed on linux, although Hadoop is multiplatform and is supported by the best known operating systems.

Math Decision provides consultancy in the execution of Big Data and machine learning projects. We are a quantitative team composed by researchers of the highest level (Georgia Tech, Univ. of Toronto, Univ. of Sao Paulo) with experience in the implementation of high performance algorithms in parallel and distributed systems. For more information contact us at info@mathdecision.com, visit our website or visit us at rutaN, the innovation center of Medellín.

Share this:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)

Related

Filed Under: Articles

Latest Articles

How to Disable Find Fast Indexer?

How to Disable Find Fast Indexer?

If you attempt to disable the Find Fast control panel by simply removing the Microsoft Find Fast shortcut from the StartUp group, the following problems may occur: The index files are not removed from the hard disk. The size of these index files depends on the number and size of Microsoft … [Read More...]

Intel Mobile Pentium III and Tualatin Pentium III-M Guide

The October 1999 announcement of a range of 0.18-micron Pentium III processors included the launch of the first mobile Pentium IIIs. The new processors - available at speeds of 400MHz, 450MHz and 500MHz and featuring a 100MHz system bus - … [Read More...]

Digital Video Fundimentals

Understanding what digital video is first requires an understanding of its ancestor - broadcast television or analogue video. The invention of radio demonstrated that sound waves can be converted into electromagnetic waves and transmitted … [Read More...]


What Are the Best eSports Games to Watch in 2020?

The esports industry is growing at a remarkable rate.  According to Statistica, the global market for esports is expected to reach $1.79 billion … [Read More...]

Choosing the Safest Crypto Wallet for the New Year

 There are a number of excellent benefits of cryptocurrencies.  They are extremely convenient and offer a tremendous amount of privacy. … [Read More...]

Guidelines on Troubleshooting Python Code

Whenever we write a computer program we need to verify that it works as expected. Usually we execute the code, if necessary we provide some inputs, … [Read More...]

Developing a Python to Extract Data from Your Smartphone

Many times we ask ourselves: "What part of the universe is this phone from?", even if we already know where it is, but we don't know which operator. … [Read More...]

Most Demanding Heroes to Play in Dota 2

Dota 2 is one of the most popular multi-player online combat games. It had an average player count of 485,000 throughout most of 2018. The peak number … [Read More...]

University of Minnesota Report Highlights AI Advances in Facial Recognition

Big data is leading to massive changes in our daily lives. The University of Minnesota has published a study on the advances in facial recognition … [Read More...]

Guides

  • Computer Communications
  • Mobile Computing
  • PC Components
  • PC Data Storage
  • PC Input-Output
  • PC Multimedia
  • Processors (CPUs)

Recent Posts

FBI Money Pak Virus

FBI MoneyPak is a malware client that holds your computer for ransom until you pay a fine. As stated this is malware, a computer virus that infected … [Read More...]

Windows 10 is Coming

In case you have not heard, Windows 10 is about to be released to the masses. The new version is set to be released on July 29, 2015, and you can … [Read More...]

Intel Core – 8th generation CPU architecture

It was at the Intel Development Forum in March 2006 that Intel released details of its new Intel Core microarchitecture, the successor to the … [Read More...]

Return to top of page

Copyright © 2019 About | Privacy | Contact Information | Wrtie For Us | Disclaimer | Copyright License | Authors