5

Background

My web application lives on a centralised server in the product's "network", and provides the means to manage/configure various distributed devices. The server also logs various statistics that arrive from each device, storing them on disk in /var/log/. The web GUI allows users to download those logs.

It also has a facility to download them in various different formats, which requires an on-the-fly translation/conversion. This conversion takes some time (say on the order of thirty seconds) and results in files of size in the 300MB region.

All of this is fine and a user can accept that downloading such a converted file is going to take some time. But I've architected myself into a bit of a corner in terms of how effectively I can actually deliver these files.

For the purposes of this question, I shall not be exploring AJAX/JavaScript/Java/Flash/multi-step/multi-page solutions. Assume that, from the user agent's perspective, the download shall be a straightforward HTTP GET request to a CGI script from clicking on an <a> element, and nothing more.


Problem

My web application is loosely MVC-architected in such a way that the controller chosen to satisfy the requested action (say, in this case: controller "devices" action "getConvertedLog") performs its business logic and sets various flags that describe how the HTTP response should be composed. Only after the controller has finished its work will the HTTP response be composed, with response headers generated and the body streamed from, in this case, a temporary file on disk.

The first problem with this is that the controller itself performs (or, at least, invokes) the file conversion, takes some time to perform. The HTTP headers are consequently not generated (let alone transferred) for thirty seconds or so. Not only does this result in thirty seconds of literally nothing happening in the browser (at least from my experience in Chrome) but it also puts the entire request at high risk of a HTTP 504 Gateway Timeout error from intervening routers.

I could shuffle my code around a little so that some HTTP response headers can be transferred to the browser before the conversion begins, to at least give an indication that something is happening (and, hopefully, stave off the Gateway Timeout). But before the conversion completes I have no way of knowing how many bytes will comprise the result. Therefore, I cannot send a meaningful Content-Length header, so the user-agent cannot display progress to the user. And for a 300MB file I do not consider this to be acceptable.

The second problem with this is that, if there is an error during conversion, the HTTP response code should be meaningful. So I cannot have sent a Status in these hypothetical pre-conversion headers.


Question

What would you do here? What's the least I need to do indicate to user-agents and proxies that the request has been accepted and a response is coming (albeit slowly), before the success or failure and size of the response has been determined?

I guess it would be ideal if it were legal and functional to send a very small set of headers, say:

 Status: 200 OK
 Content-Type: application/zip
 Content-Disposition: attachment; filename="thefile.zip"

… then follow it up with the remaining headers (Set-Cookie, Cache-Control, Content-Length and so forth), some potentially replacing earlier ones (like a change in Status) and, finally, the response body.

I'm hoping the fact that Apache's CGI module translates and re-orders some headers (e.g. Status: 200 ends up in the first line of the response as HTTP/1.1 200 OK) can help here. How might HTTP 102 Processing help here?

(Update: "At least one CGI-Header must be supplied, but no CGI header can be repeated with the same field-name." [CGI 1.1, §9.2]. Rats.)

Lightness Races in Orbit
  • 8,755
  • 3
  • 41
  • 45
  • Similar to http://stackoverflow.com/q/3571139/560648, I guess, except he doesn't seem to be worried about proxies. Sounds like one answerer thinks he's cracked the problem from the internal httpd side of things, though. So that's a start. – Lightness Races in Orbit Jul 08 '15 at 17:05
  • 2
    Here's a stupid idea: could you use an HTTP redirect after you've figured out the size of the file to redirect the browser to a custom and newly generated page that contains the right Content-Length header? –  Jul 08 '15 at 19:09
  • @sga001: Hmm yeah but I don't understand how that helps. If I've figured out the size of the file then the file has been generated and I can just begin streaming it. And a second HTTP request requires that the file exist outside of the scope of a single request, opening up problems if e.g. the second request is never made. Then I have a file that's going to sit there forever taking up 300MB. – Lightness Races in Orbit Jul 08 '15 at 19:18
  • 3
    Can you return a 202 Accepted and a task id. Later you can keep track of the request with the task id. ? – gogasca Dec 14 '15 at 16:12
  • 1
    I think the answer to your question lies in the very popular "your download will start in a few seconds" pattern used in many web sites. I am not sure how they do it, (never had the need to look into it,) but I suppose they may be doing something as simple as to serve a complete new page which contains an IFrame which actually points to the download. – Mike Nakis Dec 14 '15 at 16:15
  • @spicyramen: I think that falls into the category of things I rejected in my list of constraints! It would add complexity in the backend through needing to "keep the data around for X time and then delete it" if the user agent never gets around to "keeping track of the request" and pulling down the response. Same for Mike's suggestion? – Lightness Races in Orbit Dec 14 '15 at 16:21
  • Well, having to "keep the data around for X time and then delete it" is not such a hard problem. I am curious to see if there is any better solution, but if not, then that does not represent an insurmountable problem. – Mike Nakis Dec 14 '15 at 16:52
  • _"having to 'keep the data around for X time and then delete it' is not such a hard problem."_ Actually, here, it is. As such, I have explicitly ruled it out. (I will fall back on it if I _really_ need to, but that is outside of the scope of this question, which is to find a better alternative.) – Lightness Races in Orbit Dec 14 '15 at 17:14
  • 1
    I think `Transfer-Encoding: chunked` would help. It allows you sending data for which you do not know the full length yet. – Sjoerd Job Postmus Dec 14 '15 at 17:57
  • @SjoerdJobPostmus: Ooooooooooooooooooooooooooooh. That sort of gets me halfway there. Still no way to signal failure halfway through AFAICT but it's a start – Lightness Races in Orbit Dec 14 '15 at 18:03
  • @LightnessRacesInOrbit: not sure if the following helps, but I found the following: http://stackoverflow.com/questions/5707291/error-code-redirect-when-returning-a-response-with-chunked-encoding . Not that helpful, I suspect, but maybe, just maybe it helps. – Sjoerd Job Postmus Dec 14 '15 at 18:16
  • @SjoerdJobPostmus: :) I think there are the beginnings of a decent answer here. – Lightness Races in Orbit Dec 14 '15 at 18:32

1 Answers1

3

This is a common problem. The first example that springs to mind is in credit card authentication.

As has been mentioned, you need to conceptually fork the process, so that one thing responds to the client and another does the actual work.

This is actually quite straight forward, because HTTP is a state free protocol, you can have a page render, and complete the client transaction, whilst actually carrying on functioning on the server (as long as the service on the server allows you too, PHP will normally time you out if you take too long).

So here's a plan. When you get your request, create your temporary file, with a name like /tmp/.part and redirect the client to a different page, which I'll come to shortly. Either send the as a parameter within the redirect link, or as a cookie. Then conclude your connection with the client and process the file, storing data in /tmp/.part as you go. Once you're done, rename /tmp/.part to /tmp/

On the different page, given parameter (either from cookie or parameter), and check for the existence of /tmp/ or /tmp/.part If the .part file exists, it's still processing, either generate a html file with a meta tag telling the user to reload in a few seconds, or sleep for a few seconds and give the client a redirect with the same form as the original caller.

If /tmp/ exists, you're done. Present the user with the file for download.

You might also want to add some garbage collections somewhere to check if there are .part files that haven't changed for >1 hour, or actual files that are more than 2 or 3 hours old, and delete them.

If you have a server that is timing you out, then you have 2 options, via restart during refresh, or via cron.
If you opt for restart, code the .part file with enough information to pick up the process mid way through and have the refresh stage check for a running process and if it doesn't find one, restart the process If you opt to go through cron, you're ultimately changing the above to a job management system, where instead of the cgi-bin initiating the work, you have it create the request, and have cron periodically checking for requests (the existence of a .part file) and where it finds one, it executes it.

Regarding error handling, this could feed into your ability to restart a process. If the .part contains some status indicator, to tell the reload screen where it's at, you could pass error conditions back to the user. If the executing script encounters a problem, it updates .part. When the client next reloads the page with as it's context, it looks at the .part file and returns back the status.

It's worth bearing in mind, you have to balance the complication of putting into .part more than simply the incomplete version of what will be vs creating a 3rd file such as .status which the reload script might use.

sibaz
  • 141
  • 3
  • I was trying to avoid all of this (the garbage collection, the external process, all of it) which was really the crux of my question. But it seems I may have to accept that the answer is "you can't" :P – Lightness Races in Orbit Jan 18 '16 at 16:09
  • 1
    I suspect, when attempting to re-invent the wheel, 9 times out of 10, a good engineer will conclude that what was done before, was probably done like that for a good reason :-) Although in answer to the objective behind the reason, you could use web 2.0 to have a Javascript app, running on the client, that would know to ask a cgi-bin for a named file, in a named format, offset to a particular chunk, such that the cgi-bin needn't create the tmp file at all, and could just stream the data back to the client. The JS could then reassemble the file, and drive the whole thing, including errors. – sibaz Jan 18 '16 at 16:28
  • let me caveat that by saying, I don't think IE handles file uploading like that, but I did something similar to reassemble a base64encoded blob stored in a database. I basically had a link containing an encoded file (within page), and I edited the contents of the link within my js, until I'd downloaded and reassembled all the file, then allow the user to download it. – sibaz Jan 18 '16 at 16:30
  • If the file weren't necessarily created on the fly by the server then there wouldn't be a problem in the first place ;) Chunking it is not outside of the realm of possibility but I was hoping not to - it's basically complexity everywhere I turn with this which, as you say, is probably a fairly strong sign – Lightness Races in Orbit Jan 18 '16 at 16:31
  • (but you said no ajax :-) ) – sibaz Jan 18 '16 at 16:31
  • 1
    Alright, a background process and garbage collection it is – Lightness Races in Orbit Jan 18 '16 at 16:32