Questions tagged [spark]

19 questions
7
votes
4 answers

Can someone explain the technicalities of MapReduce in layman's terms?

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, but I still have some questions. Does MapReduce…
5
votes
1 answer

How do you perform accumulation on large data sets and pass the results as a response to REST API?

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: yy_mm_dd_hh_min.json.gz Each file contains subscription…
Namah
  • 61
  • 4
4
votes
1 answer

spark output back to web page

We are running jobs whose parameters come from a web page and are executed on large files on a spark cluster. After processing, we want to display the data back, written to text files using rdd.saveAsTextFile(path) We have a session id that is a…
tgkprog
  • 595
  • 6
  • 18
3
votes
0 answers

Designing clickstream analysis?

I have a application where user purchases/click the certain products. I need to design the click stream analysis here which product got clicked how many number of time, user/geographical detail click those product Here is the design i am…
user3198603
  • 1,896
  • 2
  • 16
  • 21
2
votes
3 answers

Method naming conventions "setX" vs "withX"

Why learning about Fluent Interfaces, I came across this post which states that using set hints one is mutating the object whereas with is returing a new object. I have seen this pattern first hand while using PySpark (Python API for Apache…
2
votes
1 answer

Data Ingest Architecture Advice

I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this. FINAL REQUIREMENTS The end goal of the system is to store the raw events as they appear…
Sriram R
  • 29
  • 3
2
votes
1 answer

How (whether to?) include Apache Spark in my Architecture

Brief overview of general data flow The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic analysis to make it searchable and to be able to…
foxtrotuniform6969
  • 799
  • 1
  • 7
  • 9
2
votes
0 answers

How to manage scheduled ETL jobs that are time sensitive?

We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter operations to get activity for that particular day…
Igneous01
  • 2,343
  • 2
  • 15
  • 18
2
votes
0 answers

How to design a report processing model using Spark in the most efficient way

I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data) I need to generate several reports based on different combinations of the incoming raw_data eg: report1 =…
2
votes
1 answer

Exploiting Apache Spark Data

and sorry if the question seems a bit naive. I'm currently reading tutorials about Kafka & Spark and there's something I can't figure out : how to exploit / expose the data Spark received. Here's what I'm trying to understand : A lot of events <=>…
Javier92
  • 123
  • 3
1
vote
3 answers

Python: Is returning self in method chaining a violation of Demeter's law?

In Python it is very common to see code that uses method chaining, the main difference with code elsewhere is that this is also combined with returning an object of the same type but modified. This approach usually assumes that objects are…
1
vote
0 answers

what's the maximum number of simultaneous java socket connections in the cluster?

we work within a cluster of 1 gb/s of bandwidth, we use java sockets to perform some data transfer between the cluster's nodes like broadcast and shuffle (nodes of the cluster exchange data) in the cluster, in the instant t we may have multiple…
0
votes
1 answer

Processing only once the same message produced by two producers

If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed? Is the only way to have an input topic, then a consumer that dedupes and saves to…
Syed Jafri
  • 33
  • 1
  • 3
0
votes
1 answer

Could Apache spark be an option?

Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay. I have no experience with Spark, so the question is: Can we input the data from the source tables, create the "data…
Mr Zach
  • 269
  • 2
  • 8
-1
votes
1 answer

Where do you put tests that are not unit tests in a Maven project?

I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together to perform various analyses. Each transform and…
kingledion
  • 109
  • 6
1
2