How to improve this synchronous job execution design?

Question

Here is the design step by step:

User opens a webpage
Inputs few details in the form
Click submit
Request goes to API server
API server creates a pod in Kubernetes
Pod executes a script and stores the output in shared storage
Another pod keeps running and attached to the shared storage
API server waits for pod execution to complete
API server copies the file from Kubernetes API using the always running pod
Parses the file returns result to UI
User sees loading screen till above all steps to complete

Main challenge with this pattern is auto scaling. When pod goes to pending state because of no capacity available user needs to wait for 2-5 minutes for the auto scaling kick in and complete the pod execution.

You would probably get better results if you give some comments about why it's needed to create a new pod for every request. On the surface it feels like the approach is to re-invent AWS Lambda but why? Also, if users are waiting for 2-5 minutes they will need something to stop their browser or their brains from timing out and moving on to something else. Just showing the loading screen with no info would make many users believe the service is broken. — joshp, Mar 26 '23 at 16:17

score 1 · Answer 1 · answered Mar 26 '23 at 15:38

On the face of it this is a terrible design.

User opens a webpage
Inputs few details in the form
Click submit
Request goes to API server

fine so far
API server creates a pod in Kubernetes

This seems pointless. Have a worker process continually running and listening to a queue.
Pod executes a script and stores the output in shared storage

Instead of shared storage, post the result back to another queue or database.
Another pod keeps running and attached to the shared storage

You don't need a separate pod for everything.
API server waits for pod execution to complete

If the API server is waiting the whole time, just get it to do the work. The API should return immediately after creating the offline job
API server copies the file from Kubernetes API using the always running pod

?? why pass this data around so much? why have a pod for everything? why files??
Parses the file returns result to UI

why parse the file only to reserialise it again?
User sees loading screen till above all steps to complete

what if they kill the browser or click refresh or back?

However, you don't really say anything about why you are using this design over:

Just the API doing all the work.

As you wait for the result anyway, there's no benefit for all this sending stuff around
A queue + worker processes

Have the api write to a queue and return immediate "Processing Your Message"

Have a constantly running worker app pick up from the queue and do the work. Post back to another queue when done.

have the API listen to the "done" queue and push the messages back to the correct user via websocket.

Also, there is a scenario where you design might be good, or at least the only thing that would work. Thats where you have to run some third party application which doesn't like running multiple instances at the same time or starting and stopping cleanly (i'm looking at you microsoft excel)

In that kind of scenario you need to effectively spin up a new clean machine with no left over state, write and read files because that's the only thing the application understands and then clean up everything afterwards. But even then your unit is a container, not a pod?

This is great, appreciate your suggestions. Definitely will try to make use of these ideas. However, as a immediate solution, we just went ahead with the overprosioning solution by deploying low priority pod. Basically cluster will have n+1 nodes all the time. — karthikeayan, Jun 20 '23 at 08:33

How to improve this synchronous job execution design?

1 Answers1