Questions tagged [big-data]
73 questions
24
votes
4 answers
What is the definition of "Big Data"?
Is there one?
All the definitions I can find describe the size, complexity / variety or velocity of the data.
Wikipedia's definition is the only one I've found with an actual number
Big data sizes are a constantly moving target, as of 2012…

Ben
- 728
- 2
- 7
- 17
15
votes
4 answers
How to learn Cloud Computing and Big Data at home?
I want to learn Cloud Computing and Big Data at home.
Is it possible to learn these technologies on home PC?
Which technologies to learn in Cloud Computing?
Which technologies to learn Big Data (Hadoop)?

RPK
- 4,378
- 11
- 41
- 65
11
votes
3 answers
How to store large amounts of _structured_ data?
The application will continuously (approximately every second) collect the location of users and store them.
This data is structured. In a relational database, it would be stored as:
| user | timestamp | latitude | longitude |
However, there is too…

Utku
- 1,922
- 4
- 17
- 19
11
votes
4 answers
Why Big Data Needs To Be Functional?
I started working on a new project lately related to Big Data for my internship.
My managers recommended to start learning functional programming (They highly recommended Scala).
I had a humbled experience using F#, but I couldn't see the the…

user3047512
- 189
- 1
- 1
- 6
11
votes
3 answers
Choose C++ or Java for applications requiring huge amounts of RAM?
I'm thinking of scientific applications that are mostly processor-bound and heavy on heap usage (at least several gigabytes). Any other time of the year I would happily go with C++, but in this case I wonder if the fragmentation natural to the C++…

dsign
- 277
- 2
- 15
10
votes
1 answer
Partial name matching in millions of records
We have developed a web based application for name matching. It operates by breaking names into parts and the Soundex value of each part is stored in a database. The Levenshtein distance metric is used to apply percentage matching of sound as well…

bjan
- 229
- 1
- 8
8
votes
1 answer
quantitatively comparing AST shapes
How could one compare the shape of abstract syntax trees of similar source code programs (C, C++, Go, or anything compiled with GCC...)?
I guess that plagiarism detection on source code would use such techniques, but I have no idea of how would that…

Basile Starynkevitch
- 32,434
- 6
- 84
- 125
7
votes
4 answers
Can someone explain the technicalities of MapReduce in layman's terms?
When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, but I still have some questions.
Does MapReduce…

Eddie Bravo
- 81
- 3
7
votes
3 answers
Python in Big Data?
Can python be efficiently implemented in big data field? To be precise I am building an web app that analyses really big data in medical health care field consisting of medical history and huge personal information. I need some advice on how to…

Akshay
- 161
- 1
- 6
6
votes
1 answer
How can I reduce the amount of storage needed for a gravitational n-body simulation?
I am currently attempting to create a gravitational n-body simulation using a modified Barnes-Hut algorithm, to be more amicable to GPU computation. This is primarily as a learning project. My goal is to simulate a number of stars comparable to that…

john01dav
- 879
- 1
- 7
- 14
6
votes
4 answers
Design of high performance file processing web application
I'm trying to design a web app with ability to scale but can't wrap my heads around few concepts. I want to design it right but im not a experienced programmer, i have more of a system engineering background.
Basic architecture look like this:
Web…

Coolface
- 157
- 1
- 2
5
votes
1 answer
How do you perform accumulation on large data sets and pass the results as a response to REST API?
I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: yy_mm_dd_hh_min.json.gz
Each file contains subscription…

Namah
- 61
- 4
5
votes
1 answer
Querying large amount of data for parallel processing
I have a dataset containing list of users (around 50M).
Each user has an email address, name, and some more data columns.
I want to send a weekly email to those users, and the content of the email will be based on the user's data.
Each user should…

ItayMaoz
- 159
- 3
5
votes
4 answers
NoSQL and BIG DATA
I am doing an internship on Big Data technologies so I am new to this area. My question is about the use of NoSQL in the Big Data architecture. Do we need always to use a distributed storage (like HDFS in the case of Hadoop) then to put on top a…

soufiane.989
- 63
- 5
4
votes
2 answers
How to track change of JSON data over time for large number of entities?
I have a system that checks the status of a large number of entities on schedule every minute. For each entity, there would be a JSON file which has fields indicating the statuses for different attributes. The system dumps these JSON files on a…

softwarematter
- 245
- 3
- 6