pctechguide.com

  • Home
  • Guides
  • Tutorials
  • Articles
  • Reviews
  • Glossary
  • Contact

Guidelines on Processing Big Data with Hadoop

We are observing an increasing number of social and industrial applications where the flow of data and information grows at an exponential rate, which generates databases of sizes that exceeds traditional media. In just one day, Google must process more than 20000 terabytes of information while NASA generates more than 2 gigabytes of information every 5 minutes. The storage, processing and analysis of this information is not a simple task, since it is physically impossible to process large amounts of data over the same computing instance (due to storage, temporary memory and processing restrictions). However, a solution to this problem consists in the use of distributed applications, applications that run in different physical instances linked through a network.

There are a number of technological applications that assist with big data processing. Apache Hadoop is one of them. However, many people are stuck when it comes to using Hadoop for data mining and analysis.

Overview of Hadoop as a Big Data Mining and Analysis Toolkit

Apache Hadoop is a framework that supports applications distributed under a free license and is inspired by Google documents on MapReduce and Google File System. It is used by Yahoo, Facebook, Linkedin, Ebay, among others, because it allows the quick search of words in large text strings, sort lists and multiply large matrices (among many other applications).

How do you use Hadoop?

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing. You need to know how to utilize them before you can get the most of this program.

To understand this concept, one can think of the problem of a population census. The methodology consists of dividing the census into cities where you have people counting the population and sending the results to a central place where the results are finally reduced to a total count. This scheme of sending (mapping) people parallel to cities and then centralizing and reducing (reducing) them is what Google generalizes and calls MapReduce.

The problems that can be addressed with the Hadoop framework must be able to be broken down into two tasks that Hadoop interprets: mapping and reducing.

A specific problem can be worked with Hadoop if it allows to clearly define the roles of mapping and reducing. The Hadoop framework is in charge of distributing memory and communication between the different instances created in parallel of the mapping and reducing functions.

This article presents a basic example of word count in a long document using Hadoop. Normally a document with a size of hundreds of gigabytes is impossible to process with a computation instance and it is there where you see the need to process the document with tools such as Hadoop. For this, we must identify how it should be mapped and how it should be reduced. In general terms what will be done is to take the text and divide it into appropriate lines. The Mapper function receives some of these lines and reports the words in that line. The Reducer function gathers the words reported by Mapper to count them. Here is how to implement such a mechanism. The example was developed on linux, although Hadoop is multiplatform and is supported by the best known operating systems.

Math Decision provides consultancy in the execution of Big Data and machine learning projects. We are a quantitative team composed by researchers of the highest level (Georgia Tech, Univ. of Toronto, Univ. of Sao Paulo) with experience in the implementation of high performance algorithms in parallel and distributed systems. For more information contact us at info@mathdecision.com, visit our website or visit us at rutaN, the innovation center of Medellín.

Filed Under: Articles

Latest Articles

Using Sudo for Super User Access to Root Privileges in Linux

The Linux super user, or root user, is a special user that has tremendous power, with the ability to access and modify all files on the operating system. This is necessary at times, but there is a potential for accidental errors to cause a great deal of destruction, so you have to be … [Read More...]

Did North Korea Only Get Even With Sony?

The Headlines: • Sony Pictures Entertainment’s computer system went totally dead on Monday (Nov 24). • The ‘Hacked By #GOP’ (Guardians of Peace) with red skull appeared on Sony’s dark computer screens. • Hackers warned Sony they found confidential corporate secrets which they could leak … [Read More...]

The Ideal Password Length

The issue began on the password length, when there has been an announcement on the eBay administration that the fixed password they would be accepting is 20 characters. There are some speculations due to security defect. So, at this moment, let us dig the deeper cause of needing a lengthy … [Read More...]

Gaming Laptop Security Guide: Protecting Your High-End Hardware Investment in 2025

Since Jacob took over PC Tech Guide, we’ve looked at how tech intersects with personal well-being and digital safety. Gaming laptops are now … [Read More...]

20 Cool Creative Commons Photographs About the Future of AI

AI technology is starting to have a huge impact on our lives. The market value for AI is estimated to have been worth $279.22 billion in 2024 and it … [Read More...]

13 Impressive Stats on the Future of AI

AI technology is starting to become much more important in our everyday lives. Many businesses are using it as well. While he has created a lot of … [Read More...]

Graphic Designers on Reddit Share their Views of AI

There are clearly a lot of positive things about AI. However, it is not a good thing for everyone. One of the things that many people are worried … [Read More...]

Redditors Talk About the Impact of AI on Freelance Writers

AI technology has had a huge impact on our lives. A 2023 survey by Pew Research found that 56% of people use AI at least once a day or once a week. … [Read More...]

11 Most Popular Books on Perl Programming

Perl is not the most popular programming language. It has only one million users, compared to 12 million that use Python. However, it has a lot of … [Read More...]

Guides

  • Computer Communications
  • Mobile Computing
  • PC Components
  • PC Data Storage
  • PC Input-Output
  • PC Multimedia
  • Processors (CPUs)

Recent Posts

Features and Parts of a Digital Camera

A colour LCD panel is a feature that is present on virtually all modern digital cameras. It acts as a mini GUI, … [Read More...]

2004 Site Updates

Date Details of Updates 21Nov '04 Tutorials: New tutorials added: - How to maintain your hard disk drive - How to schedule a program … [Read More...]

Protect Your PC From Fake Antivirus Programs

A fake anti-virus program can be the worst and most unrelenting malware that you may encounter on the internet. Malware are programs that phish out … [Read More...]

[footer_backtotop]

Copyright © 2025 About | Privacy | Contact Information | Wrtie For Us | Disclaimer | Copyright License | Authors