Design of high performance file processing web application

Question

I'm trying to design a web app with ability to scale but can't wrap my heads around few concepts. I want to design it right but im not a experienced programmer, i have more of a system engineering background.

Basic architecture look like this:

Web Server -> File processing server -> NoSQL DB -> Search server

So the main scenario is follows:

User uploads a file via site
File is send for processing to a server(python script)
Results of processing is send to NoSQL DB
Results processed by search server and returned to user

We can scale web frontends via load balancing. Something like nginx+apache.
Database scaling is taking care by Cassandra or MongoDB. Search scaling is taking care by elasticsearch or sphinx clustering.

Now I want to be able to add multiple file processing servers in case file uploaded is too big. So I need to somehow split file into chunks and process it simultaneously on multiple nodes plus if node goes down while working it shouldn't affect anything and data must be saved. So I need something else which will be allocating tasks to my file processing servers, balancing load and control execution of tasks.

How to design custom applications for that kind of things? Should I use message queuing?

This type of problem is often solved with MapReduce programming, using tools such as Hadoop. — , Mar 27 '14 at 17:24
I know about hadoop, but i've heard it too slow for those kind of things. Anyway i want to start small and without hadoop initially so i more interested in design of custom applications — Coolface, Mar 28 '14 at 04:04
What are your timing requirements. Could you accept 10 mins latency for better scaling? — Esben Skov Pedersen, Nov 03 '14 at 13:16
Are you trying to scale to be able to process a bigger file or more files at once, or one file faster? They have very different options. — Sign, Nov 03 '14 at 15:19

score 1 · Answer 1 · answered Nov 03 '14 at 10:27

Computer power is cheap nowadays. Moreover, you don't know yet where the bottleneck will be.

To me, this smells like premature optimization where you worry about performance before even having the load. Perhaps you should just start by making it work, then about scaling it. My 2 cents.

The question is also if you want quick processing time or high throughput. If the processing is really resource/time intensive, it makes sense to split the file, distribute it and merge the results. However, these of course come at a certain cost: splitting, sending, scheduling outputs, merging, handling part failures. These taks consume resources too and adds lots of complexity. Distributed computation is only suited for appropriate tasks. Computing a single task per server is sometimes more efficient than doing all this stuff.

score 1 · Answer 2 · answered Nov 03 '14 at 15:17

I think message queuing is your answer, along with discarding the idea that the user should wait for the processing to complete. When the user uploads the file, queue it to a message "processor" which will do the preliminary analysis on it (basically just decide if it's too big and either divert it to a "splitter" queue or simply put it on the regular processing queue). At this point, you should return a token or URL to your user that they can use to access the result when the processing is complete. This way you don't have to worry about timeouts or keeping a session active, and the user has a reference in case they want/need to view the results again without re-processing the file.

score 0 · Answer 3 · answered Apr 17 '14 at 01:10

I'm not sure I quite follow what you're doing (like what does "file is sent for 'processing'" mean -- processing?). Maybe that's not all that important for the context.

I guess my first instinct would be to ask why your "processing" can't happen on your app server (as part of the upload process). Then, whatever scaling solution you use for your app server, your "processing" scales with it.

However, I also get the hint that perhaps you're trying to design a massively scalable solution, with independently tunable subsystems, so you probably aren't wanting to link your app server to your "processing" event. In that case, there are a few different ways to go.

I've used message queuing (as you suggested) for this type of problem before. Then, you just fire up additional listeners as your load increases. You have to be careful about what triggering mechanism you use though, and you have to deal with transaction processing appropriately (syncpoints / rollback and commits), otherwise queuing can become a mess to debug. It seems trivial, but in my experience, it's not. When everything's working, it's great, but you really have to test the heck out of the cloudy-day scenarios to make sure you don't get dropped messages or messages processed more than once. "Guaranteed delivery" isn't all it's cracked up to be.

I also recently solved a problem like this using a shared memory data grid (Hazelcast). The API actually implements a queue for you, so you don't have all the overhead of setting up a queue manager as you would with the solution above. In my case I was needing to run a transformation process against hundreds of billions of rows of data from a table, and by writing a multithreaded app that spun up table readers at different points in the table, I threw the records into the data grid, and I wrote a separate process that I could instantiate as many times as necessary to transform the data -- so it scaled as far as I needed it to. I suppose it's just a variation on the solution above. I took a process that took about 3 days to run in the original approach, to about 15 minutes, but it was spread across a 50 node cluster for those 15 minutes.

score 0 · Answer 4 · answered May 08 '14 at 22:26

You will have to use ajax (or some async callback or a handler) to hit the server only when you have the whole file, that way you won't need to split the file into chunks. Just tell the user that his file are going to be processed and send him a message when its finished.

From there, you can have async worker method (on the server) per uploaded file to send it to the python script for processing.

Design of high performance file processing web application

4 Answers4