Highest Voted 'spark' Questions - Software Engineering Stack Exchange

7

votes

4 answers

Can someone explain the technicalities of MapReduce in layman's terms?

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, but I still have some questions. Does MapReduce…

asked Jan 07 '17 at 00:28

Eddie Bravo

81
3

5

votes

1 answer

How do you perform accumulation on large data sets and pass the results as a response to REST API?

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: yy_mm_dd_hh_min.json.gz Each file contains subscription…

python rest big-data spark

asked Feb 19 '21 at 22:08

Namah

61
4

4

votes

1 answer

spark output back to web page

We are running jobs whose parameters come from a web page and are executed on large files on a spark cluster. After processing, we want to display the data back, written to text files using rdd.saveAsTextFile(path) We have a session id that is a…

web-applications spark paging

asked Nov 14 '16 at 15:19

tgkprog

595
6
18

3

votes

0 answers

Designing clickstream analysis?

I have a application where user purchases/click the certain products. I need to design the click stream analysis here which product got clicked how many number of time, user/geographical detail click those product Here is the design i am…

java design apache-kafka spark

asked Jun 10 '17 at 17:28

user3198603

1,896
2
16
21

2

votes

3 answers

Method naming conventions "setX" vs "withX"

Why learning about Fluent Interfaces, I came across this post which states that using set hints one is mutating the object whereas with is returing a new object. I have seen this pattern first hand while using PySpark (Python API for Apache…

python setters spark

asked Oct 30 '22 at 23:57

Ezequiel Castaño

151
5

2

votes

1 answer

Data Ingest Architecture Advice

I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this. FINAL REQUIREMENTS The end goal of the system is to store the raw events as they appear…

design architecture apache-kafka spark

asked Feb 26 '22 at 04:43

Sriram R

29
3

2

votes

1 answer

How (whether to?) include Apache Spark in my Architecture

Brief overview of general data flow The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic analysis to make it searchable and to be able to…

architecture message-queue dataflow spark

asked Feb 02 '21 at 21:52

foxtrotuniform6969

799
1
7
9

2

votes

0 answers

How to manage scheduled ETL jobs that are time sensitive?

We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter operations to get activity for that particular day…

etl spark

asked Dec 12 '19 at 14:41

Igneous01

2,343
2
15
18

2

votes

0 answers

How to design a report processing model using Spark in the most efficient way

I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data) I need to generate several reports based on different combinations of the incoming raw_data eg: report1 =…

big-data spark

asked May 02 '19 at 07:02

Remis Haroon - رامز

121
3

2

votes

1 answer

Exploiting Apache Spark Data

and sorry if the question seems a bit naive. I'm currently reading tutorials about Kafka & Spark and there's something I can't figure out : how to exploit / expose the data Spark received. Here's what I'm trying to understand : A lot of events <=>…

persistence redis cassandra apache-kafka spark

asked Feb 16 '17 at 19:30

Javier92

123
3

1

vote

3 answers

Python: Is returning self in method chaining a violation of Demeter's law?

In Python it is very common to see code that uses method chaining, the main difference with code elsewhere is that this is also combined with returning an object of the same type but modified. This approach usually assumes that objects are…

python spark law-of-demeter

asked Oct 29 '22 at 04:16

Ezequiel Castaño

151
5

1

vote

0 answers

what's the maximum number of simultaneous java socket connections in the cluster?

we work within a cluster of 1 gb/s of bandwidth, we use java sockets to perform some data transfer between the cluster's nodes like broadcast and shuffle (nodes of the cluster exchange data) in the cluster, in the instant t we may have multiple…

java sockets tcp cluster spark

asked Apr 29 '18 at 15:00

Soulimane Kamni

19
5

0

votes

1 answer

Processing only once the same message produced by two producers

If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed? Is the only way to have an input topic, then a consumer that dedupes and saves to…

message-queue apache-kafka pubsub spark

asked Sep 06 '18 at 09:57

Syed Jafri

33
1
3

0

votes

1 answer

Could Apache spark be an option?

Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay. I have no experience with Spark, so the question is: Can we input the data from the source tables, create the "data…

spark

asked Aug 08 '18 at 11:16

Mr Zach

269
2
8

-1

votes

1 answer

Where do you put tests that are not unit tests in a Maven project?

I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together to perform various analyses. Each transform and…

unit-testing testing maven spark

asked Dec 28 '18 at 19:57

kingledion

109
6

Questions tagged [spark]