4

We are running jobs whose parameters come from a web page and are executed on large files on a spark cluster. After processing, we want to display the data back, written to text files using

rdd.saveAsTextFile(path)  

We have a session id that is a common root for the output folders. Meaning it is a random folder but linked to the user session id.

What is a good way to keep track of, pointers to the different files, send pages back to the front end?

Meaning so we can have a list of files and send the results back to a monitoring (summary) and detail page showing the contents of the files.

tgkprog
  • 595
  • 6
  • 18

1 Answers1

1

Without getting into premature optimization, consider the following design principles:

  1. Convention. It seems like you already made the choice to have predictable path names in HDFS (based on a user session ID). You can extend this to have predictable paths for each job. If the jobs are initiated by a web application, then that web app can generate whatever name or ID is associated with the job, and create the HDFS path for the Spark job output in a consistent and predictable fashion.
  2. Authority. Every data element should have exactly one authoritative home, no matter how many copies of its values are scattered around the architecture. In your example, it seems proper for the web app to be authoritative on user session IDs and job IDs, and for HDFS to be authoritative on what files are present in a directory and what their contents are. So your web app then must maintain job IDs associated with a user session somewhere, and query HDFS (following the predictable path convention) to get a list of output files and their contents.
Tajh Taylor
  • 102
  • 3