pctechguide.com

  • Home
  • Guides
  • Tutorials
  • Articles
  • Reviews
  • Glossary
  • Contact

Guidelines on Processing Big Data with Hadoop

We are observing an increasing number of social and industrial applications where the flow of data and information grows at an exponential rate, which generates databases of sizes that exceeds traditional media. In just one day, Google must process more than 20000 terabytes of information while NASA generates more than 2 gigabytes of information every 5 minutes. The storage, processing and analysis of this information is not a simple task, since it is physically impossible to process large amounts of data over the same computing instance (due to storage, temporary memory and processing restrictions). However, a solution to this problem consists in the use of distributed applications, applications that run in different physical instances linked through a network.

There are a number of technological applications that assist with big data processing. Apache Hadoop is one of them. However, many people are stuck when it comes to using Hadoop for data mining and analysis.

Overview of Hadoop as a Big Data Mining and Analysis Toolkit

Apache Hadoop is a framework that supports applications distributed under a free license and is inspired by Google documents on MapReduce and Google File System. It is used by Yahoo, Facebook, Linkedin, Ebay, among others, because it allows the quick search of words in large text strings, sort lists and multiply large matrices (among many other applications).

How do you use Hadoop?

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing. You need to know how to utilize them before you can get the most of this program.

To understand this concept, one can think of the problem of a population census. The methodology consists of dividing the census into cities where you have people counting the population and sending the results to a central place where the results are finally reduced to a total count. This scheme of sending (mapping) people parallel to cities and then centralizing and reducing (reducing) them is what Google generalizes and calls MapReduce.

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing.

A specific problem can be worked with Hadoop if it allows to clearly define the roles of mapping and reducing. The Hadoop framework is in charge of distributing memory and communication between the different instances created in parallel of the mapping and reducing functions.

This article presents a basic example of word count in a long document using Hadoop. Normally a document with a size of hundreds of gigabytes is impossible to process with a computation instance and it is there where you see the need to process the document with tools such as Hadoop. For this, we must identify how it should be mapped and how it should be reduced. In general terms what will be done is to take the text and divide it into appropriate lines. The Mapper function receives some of these lines and reports the words in that line. The Reducer function gathers the words reported by Mapper to count them. Here is how to implement such a mechanism. The example was developed on linux, although Hadoop is multiplatform and is supported by the best known operating systems.

Math Decision provides consultancy in the execution of Big Data and machine learning projects. We are a quantitative team composed by researchers of the highest level (Georgia Tech, Univ. of Toronto, Univ. of Sao Paulo) with experience in the implementation of high performance algorithms in parallel and distributed systems. For more information contact us at info@mathdecision.com, visit our website or visit us at rutaN, the innovation center of MedellĂ­n.

Filed Under: Articles

Latest Articles

VBA Macro that Randomly Shows a Boy’s Name in Column A

Are you trying to come up with a boy's name at random? There may be a number of reasons that you want to do this. you may be writing a book for example, and don't want to keep thinking about the same names for all of your characters. The good news is that there is a way to do this with VBA. We … [Read More...]

Apple opposes judge’s order to hack San Bernardino shooter’s iPhone

Apple opposes judge's order to hack San Bernardino shooter's iPhone Apple has put it's foot down and actively opposing a judge's order to assist the FBI to break into the Iphone of one of the San Bernardino, California shooters. Apple is calling the directive "An Overreach by the U.S. … [Read More...]

Laser printer Consumables

Most lasers use cartridge technology based on an organic photoconductive (OPC) drum, coated in light-sensitive material. During the lifetime of the printer, the drum needs to be periodically replaced as its surface wears out and print quality … [Read More...]

20 Cool Creative Commons Photographs About the Future of AI

AI technology is starting to have a huge impact on our lives. The market value for AI is estimated to have been worth $279.22 billion in 2024 and it … [Read More...]

13 Impressive Stats on the Future of AI

AI technology is starting to become much more important in our everyday lives. Many businesses are using it as well. While he has created a lot of … [Read More...]

Graphic Designers on Reddit Share their Views of AI

There are clearly a lot of positive things about AI. However, it is not a good thing for everyone. One of the things that many people are worried … [Read More...]

Redditors Talk About the Impact of AI on Freelance Writers

AI technology has had a huge impact on our lives. A 2023 survey by Pew Research found that 56% of people use AI at least once a day or once a week. … [Read More...]

11 Most Popular Books on Perl Programming

Perl is not the most popular programming language. It has only one million users, compared to 12 million that use Python. However, it has a lot of … [Read More...]

10 Exceptional Books on ChatGPT that Will Blow Your Mind

ChatGPT is a powerful new AI tool that is taking the world by storm. You are going to find a lot of amazing books that will teach you how to make the … [Read More...]

Guides

  • Computer Communications
  • Mobile Computing
  • PC Components
  • PC Data Storage
  • PC Input-Output
  • PC Multimedia
  • Processors (CPUs)

Recent Posts

Hot-Finder.com Virus Removal Guide

Hot Finder is another browser hijacker that can make your browsing experience annoying and cause issues for your PC. Installing this hijacker will … [Read More...]

VXA Tape Storage technology

Streaming is the technique used to transfer data in linear and helical scan tape drives, which operate by reading an entire … [Read More...]

Norton Antivirus 2017 with Antispyware

PROS: Norton automatically updates software virus definitions and tracks suspicious files. CONS: Network and miscellaneous settings are too … [Read More...]

[footer_backtotop]

Copyright © 2025 About | Privacy | Contact Information | Wrtie For Us | Disclaimer | Copyright License | Authors