pctechguide.com

  • Home
  • Guides
  • Tutorials
  • Articles
  • Reviews
  • Glossary
  • Contact

Guidelines on Processing Big Data with Hadoop

We are observing an increasing number of social and industrial applications where the flow of data and information grows at an exponential rate, which generates databases of sizes that exceeds traditional media. In just one day, Google must process more than 20000 terabytes of information while NASA generates more than 2 gigabytes of information every 5 minutes. The storage, processing and analysis of this information is not a simple task, since it is physically impossible to process large amounts of data over the same computing instance (due to storage, temporary memory and processing restrictions). However, a solution to this problem consists in the use of distributed applications, applications that run in different physical instances linked through a network.

There are a number of technological applications that assist with big data processing. Apache Hadoop is one of them. However, many people are stuck when it comes to using Hadoop for data mining and analysis.

Overview of Hadoop as a Big Data Mining and Analysis Toolkit

Apache Hadoop is a framework that supports applications distributed under a free license and is inspired by Google documents on MapReduce and Google File System. It is used by Yahoo, Facebook, Linkedin, Ebay, among others, because it allows the quick search of words in large text strings, sort lists and multiply large matrices (among many other applications).

How do you use Hadoop?

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing. You need to know how to utilize them before you can get the most of this program.

To understand this concept, one can think of the problem of a population census. The methodology consists of dividing the census into cities where you have people counting the population and sending the results to a central place where the results are finally reduced to a total count. This scheme of sending (mapping) people parallel to cities and then centralizing and reducing (reducing) them is what Google generalizes and calls MapReduce.

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing.

A specific problem can be worked with Hadoop if it allows to clearly define the roles of mapping and reducing. The Hadoop framework is in charge of distributing memory and communication between the different instances created in parallel of the mapping and reducing functions.

This article presents a basic example of word count in a long document using Hadoop. Normally a document with a size of hundreds of gigabytes is impossible to process with a computation instance and it is there where you see the need to process the document with tools such as Hadoop. For this, we must identify how it should be mapped and how it should be reduced. In general terms what will be done is to take the text and divide it into appropriate lines. The Mapper function receives some of these lines and reports the words in that line. The Reducer function gathers the words reported by Mapper to count them. Here is how to implement such a mechanism. The example was developed on linux, although Hadoop is multiplatform and is supported by the best known operating systems.

Math Decision provides consultancy in the execution of Big Data and machine learning projects. We are a quantitative team composed by researchers of the highest level (Georgia Tech, Univ. of Toronto, Univ. of Sao Paulo) with experience in the implementation of high performance algorithms in parallel and distributed systems. For more information contact us at info@mathdecision.com, visit our website or visit us at rutaN, the innovation center of Medellín.

Filed Under: Articles

Latest Articles

3 Ways Research is Driving Technology Advances in Security

Cyber security is a growing concern for everyone. It is not just consumers who are at risk, governments and businesses have been victims of cybercrime as well. In order to combat this threat, researchers have been busy trying to find the best ways to protect everyone’s personal information and … [Read More...]

How to Maintain Your Store Folder -Outlook Express

Outlook Express has changed the face of mail clients forever with its brilliant and intuitive interface. In a time where some of Microsoft's most cherished killer apps are starting to lose their luster, Outlook just keeps getting better. Still, OE is far from perfect, and there are still a … [Read More...]

Intel’s 955X Express Chipset – Glenwood

Formerly codenamed Glenwood, continues this practice, essentially providing the same features as the earlier 925X chipset, plus support for Pentium Extreme Edition processors. Ever since the release of its 865 and 875 chipsets in 2003, Intel has segregated its Pentium 4 chipsets into … [Read More...]

Everything You Need to Know About Sourcing Circuit Boards From U.S. Suppliers

In This Article This article includes: Why Source PCBs From the United States?How to Get a Quote From a U.S.-Based PCB ManufacturerThe Top U.S. … [Read More...]

Top Taplio Alternatives in 2025 : Why MagicPost Leads for LinkedIn Posting ?

LinkedIn has become a strong platform for professionals, creators, and businesses to establish authority, grow networks, and elicit engagement. Simple … [Read More...]

Shocking Cybercrime Statistics for 2025

People all over the world are becoming more concerned about cybercrime than ever. We have recently collected some statistics on this topic and … [Read More...]

Gaming Laptop Security Guide: Protecting Your High-End Hardware Investment in 2025

Since Jacob took over PC Tech Guide, we’ve looked at how tech intersects with personal well-being and digital safety. Gaming laptops are now … [Read More...]

20 Cool Creative Commons Photographs About the Future of AI

AI technology is starting to have a huge impact on our lives. The market value for AI is estimated to have been worth $279.22 billion in 2024 and it … [Read More...]

13 Impressive Stats on the Future of AI

AI technology is starting to become much more important in our everyday lives. Many businesses are using it as well. While he has created a lot of … [Read More...]

Guides

  • Computer Communications
  • Mobile Computing
  • PC Components
  • PC Data Storage
  • PC Input-Output
  • PC Multimedia
  • Processors (CPUs)

Recent Posts

Chrome Bookmarks From Any Browser

Chrome is one of the most popular browsers available and is my personal favorite at the moment. But, you may not always be at a location that has … [Read More...]

Contact Information

PCTechGuide.com always welcomes feedback.  You may contact us via e-mail at *kalen.smith@p*te*hguide.*om.  Just replace the * with a c and you will … [Read More...]

2008 to 2009 Updates of the PCTechGuide

Date Details of recent updates 18May 09 New page on the digital image series explaining colour palettes, colour look-up tables, and … [Read More...]

[footer_backtotop]

Copyright © 2026 About | Privacy | Contact Information | Wrtie For Us | Disclaimer | Copyright License | Authors