-3

I have an SQL database in Azure that is ~4.5 TB / 150 million rows worth of data. Users will need to query this database for large sets of data. Right now, I am having users submit their requests through a web application hosted on an Azure VM.

How can I serve the rows pulled from the Azure SQL database directly to the user? It seems slow for the server to have to stream the data from Azure SQL, and then for the client to have to stream that from the VM.

I wish the user could directly query the Azure SQL db from the client side, but the problem is that the user's request has to be preprocessed and manipulated against a reference set of over a gigabyte of data; the request to the SQL db needs to happen server-side. Any help appreciated!

  • 4
    *"It seems slow"* - does it only "seem to be slow" because you are guessing, or is there a real, measurable performance issue? – Doc Brown Mar 08 '21 at 12:13
  • 1
    Depending on what your requirements, if you are returning large numbers of rows (>100k) it may be worth having the webserver converting results to a compressed columnar store like parquet or Apache arrow. This could potentially save slow and expensive internet bandwidth. – user1937198 Mar 09 '21 at 09:57

1 Answers1

1

No, you should NOT expose the database publicly. This is very difficult to do safely, and is difficult to integrate directly with a HTTP-based API.

So what you're doing – having a web app backend that sits between the user and the database – is the typical architecture. This can deliver perfectly acceptable performance, in particular if the backend server can scale horizontally. But how performant is this will be more of a function of the networking hardware and network architecture than of the software architecture. And in most cases, the user's internet connection is the limiting factor, not your server.

What you can do is consider whether a SQL database has to store all of this data. If you have columns that contain large blobs (multiple MB) then you might want to consider storing the data externally. Blob storage / object storage such as S3-like storage could be more appropriate. Since every object has an URL, the user can fetch it directly without having to go through your backend. Access controls can be implemented with signed URLs that are only valid for a finite time. Your backend would then get a list of object IDs from the SQL DB, and generate signed URLs for the user.

Note that cloud egress charges can make such a design much more expensive than doing extensive processing and filtering within your backend. Sometimes the user will truly need access to the full data, but often they just need a summary or report of the data.

In this scenario it seems that you merely have a lot of data, not large blobs (average row size only about 30kB). So moving parts of the DB to external storage is likely not feasible, though it could be possible for the reference data set.

amon
  • 132,749
  • 27
  • 279
  • 375
  • @DocBrown Good point, I added a paragraph discussing the average row size. I get 30kB though, not 30 bytes. Whether this average value is meaningful would depend on the table structure. – amon Mar 08 '21 at 14:27
  • 1
    Yes, you are right, 30kB, not 30 bytes- I should double-check my math;-) Looks large for a record without Blobs, but small for a Blob record. Hard to say without further info from the OP. – Doc Brown Mar 08 '21 at 14:38
  • thanks y'all! to clarify on the "scaling horizontally" and the "networking hardware" points: does "scaling horizontally" mean, "able to handle many requests at once?" and as for the networking hardware, this is what the server uses to talk to the Azure SQL database right? – Alex Gao Mar 08 '21 at 20:29
  • i suppose i should make another stackoverflow post about the scaling part; right now i'm just running an NGINX server in an Azure VM which is sure not to scale haha. – Alex Gao Mar 08 '21 at 20:41