Questions tagged [hadoop]

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

23 questions
7
votes
4 answers

Can someone explain the technicalities of MapReduce in layman's terms?

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, but I still have some questions. Does MapReduce…
6
votes
2 answers

Optimal way to store 18 billion key, value pairs

I have around 200 million new objects coming in, and a 90 day retention policy, so that leaves me with 18 billion records to be stored in the form of key-value pairs. Key and value both will be a string. It is basically a mapping between a unique…
Chaos
  • 187
  • 1
  • 1
  • 7
5
votes
2 answers

Can map-reduce say "Hello World"?

Gathering that map-reduce is being used to process huge amounts of data, I set out to understand it. My queries were: What class of problems does it aim to solve? How does it help breaking down of complex problems? Can I write a sample app,…
Amol
  • 143
  • 7
4
votes
2 answers

how to convince other we should move to hadoop?

Everything I've read about Hadoop seems like exactly the technology we need to make our enterprise more scalable. We have terabytes of raw data that is in non-relational form (text files of some kind). We're quickly approaching the upper limits of…
Ramy
  • 162
  • 1
  • 7
4
votes
1 answer

Best practices for dashboard of near real-time analytics

I’m currently building a dashboard to view some analytics about the data generated by my company's product. We use MySQL as our database. The SQL queries to generate the analytics from the raw live data can be a bit complicated and take long time to…
Julien
  • 141
  • 3
4
votes
1 answer

Hadoop and Object Reuse, Why?

In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any evidence that this change actually improved…
Andrew White
  • 429
  • 4
  • 6
4
votes
3 answers

Why do HDFS clusters have only a single NameNode?

I'm trying to understand better how Hadoop works, and I'm reading The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is…
grautur
  • 141
  • 1
  • 3
4
votes
2 answers

Asynchronous Java

I'm wondering if I wanted to implement a web service based on java that does web analytics, what sort of architecture should I use. The actualy processing of the Big Data would be done by Hadoop. However I am not sure what I would need to do to…
3
votes
4 answers

Why is the whole Hadoop ecosystem written in Java?

Developing Big Data processing pipelines and storage, you probably come across software which is more or less a part of the Hadoop ecosystem. Be it Hadoop itself, Spark/Flink, HBase, Kafka, Accumulo, etc. Now all of these have been very well…
flowit
  • 237
  • 1
  • 6
3
votes
2 answers

Text search - big data problem

I have a problem I was hoping I could get some advice on! I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured. I have a 'category list'. I want to process the text, and cross-reference…
Duncan
  • 131
  • 4
3
votes
3 answers

Is cloudera hadoop certification worth the investment

I am considering investing time to learn Hadoop and it's related technologies. The problem is that my current day job will not be using Hadoop any time soon and even if I learn from books, blogs personal projects I will not have much to backup when…
geoaxis
  • 237
  • 1
  • 2
  • 7
2
votes
2 answers

Microservice architecture pattern for Batch based system

I have been exploring the microservice architecture for the batch-based system. Here is our current setup: Code: We have 5 systems that are internally connected and they pass data from one system to another. Currently entire logic is sitting in…
SMaZ
  • 79
  • 1
  • 6
2
votes
2 answers

SRP in the "big data" setting

We have a codebase at work that: Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items” These “micro-items” are then clustered together to find “macro-items” The “macro-items” become the input to a…
Ivan
  • 565
  • 4
  • 9
2
votes
1 answer

How best to implement a Dashboard from data in HDFS/Hadoop

We have a bunch of data (several TB) in Hadoop HDFS and it's growing. We want to create a dashboard that reports on the contents in there e.g counts of different types of objects, trends over time etc. Our first thought was to use something like…
kellyfj
  • 131
  • 1
  • 8
2
votes
1 answer

How best to merge/sort/page through tons of JSON arrays?

Here's the scenario: Say you have millions of JSON documents stored as text files. Each JSON document is an array of "activity" objects, each of which contain a "created_datetime" attribute. What is the best way to merge/sort/filter/page through…
Infin8Loop
  • 1,459
  • 2
  • 11
  • 16
1
2