Questions tagged [big-data]

73 questions
24
votes
4 answers

What is the definition of "Big Data"?

Is there one? All the definitions I can find describe the size, complexity / variety or velocity of the data. Wikipedia's definition is the only one I've found with an actual number Big data sizes are a constantly moving target, as of 2012…
Ben
  • 728
  • 2
  • 7
  • 17
15
votes
4 answers

How to learn Cloud Computing and Big Data at home?

I want to learn Cloud Computing and Big Data at home. Is it possible to learn these technologies on home PC? Which technologies to learn in Cloud Computing? Which technologies to learn Big Data (Hadoop)?
RPK
  • 4,378
  • 11
  • 41
  • 65
11
votes
3 answers

How to store large amounts of _structured_ data?

The application will continuously (approximately every second) collect the location of users and store them. This data is structured. In a relational database, it would be stored as: | user | timestamp | latitude | longitude | However, there is too…
Utku
  • 1,922
  • 4
  • 17
  • 19
11
votes
4 answers

Why Big Data Needs To Be Functional?

I started working on a new project lately related to Big Data for my internship. My managers recommended to start learning functional programming (They highly recommended Scala). I had a humbled experience using F#, but I couldn't see the the…
user3047512
  • 189
  • 1
  • 1
  • 6
11
votes
3 answers

Choose C++ or Java for applications requiring huge amounts of RAM?

I'm thinking of scientific applications that are mostly processor-bound and heavy on heap usage (at least several gigabytes). Any other time of the year I would happily go with C++, but in this case I wonder if the fragmentation natural to the C++…
dsign
  • 277
  • 2
  • 15
10
votes
1 answer

Partial name matching in millions of records

We have developed a web based application for name matching. It operates by breaking names into parts and the Soundex value of each part is stored in a database. The Levenshtein distance metric is used to apply percentage matching of sound as well…
bjan
  • 229
  • 1
  • 8
8
votes
1 answer

quantitatively comparing AST shapes

How could one compare the shape of abstract syntax trees of similar source code programs (C, C++, Go, or anything compiled with GCC...)? I guess that plagiarism detection on source code would use such techniques, but I have no idea of how would that…
Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125
7
votes
4 answers

Can someone explain the technicalities of MapReduce in layman's terms?

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, but I still have some questions. Does MapReduce…
7
votes
3 answers

Python in Big Data?

Can python be efficiently implemented in big data field? To be precise I am building an web app that analyses really big data in medical health care field consisting of medical history and huge personal information. I need some advice on how to…
Akshay
  • 161
  • 1
  • 6
6
votes
1 answer

How can I reduce the amount of storage needed for a gravitational n-body simulation?

I am currently attempting to create a gravitational n-body simulation using a modified Barnes-Hut algorithm, to be more amicable to GPU computation. This is primarily as a learning project. My goal is to simulate a number of stars comparable to that…
john01dav
  • 879
  • 1
  • 7
  • 14
6
votes
4 answers

Design of high performance file processing web application

I'm trying to design a web app with ability to scale but can't wrap my heads around few concepts. I want to design it right but im not a experienced programmer, i have more of a system engineering background. Basic architecture look like this: Web…
Coolface
  • 157
  • 1
  • 2
5
votes
1 answer

How do you perform accumulation on large data sets and pass the results as a response to REST API?

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: yy_mm_dd_hh_min.json.gz Each file contains subscription…
Namah
  • 61
  • 4
5
votes
1 answer

Querying large amount of data for parallel processing

I have a dataset containing list of users (around 50M). Each user has an email address, name, and some more data columns. I want to send a weekly email to those users, and the content of the email will be based on the user's data. Each user should…
ItayMaoz
  • 159
  • 3
5
votes
4 answers

NoSQL and BIG DATA

I am doing an internship on Big Data technologies so I am new to this area. My question is about the use of NoSQL in the Big Data architecture. Do we need always to use a distributed storage (like HDFS in the case of Hadoop) then to put on top a…
4
votes
2 answers

How to track change of JSON data over time for large number of entities?

I have a system that checks the status of a large number of entities on schedule every minute. For each entity, there would be a JSON file which has fields indicating the statuses for different attributes. The system dumps these JSON files on a…
softwarematter
  • 245
  • 3
  • 6
1
2 3 4 5