0

I have a multiple files (one per CountryCode) which all get ~5000 entries added to it per day.

Each entry in the file looks like (256chars max):

{countryCode_customerId:{"ownerId": "PDXService","notificationId": "0123456789-abcdef","requestDate": "1970-01-01T00:00:00Z","retentionDate": "2020-08-13T14:02:35Z"}}

My API gets around 4K TPS and each request has CountryCode and CustomerID. For each request, I must query this file. If the countryCode_customerId is found within the file, I must reject the request (I have to use a local file to avoid latency overhead).

My concern is that this file will be unbounded and can grow quite large. I want to know which compression algorithm would fit best for such a file that will also allow for fast lookups.


I have considered: Trie and DAWG. If you can suggest some better ones, that'd be greatly appreciated!

sync101
  • 3
  • 2
  • TIL that a bloom filter may be a good speedup mechanism (no false negatives, relatively few false positives.) You only need to consult the file/database when the bloom filter registers a match. – Hans-Martin Mosner Jul 30 '20 at 11:34

1 Answers1

1

The easiest "compression" would be:

Apparently, you're storing at least 100% more information in that file than you need (you're not mentioning what everything apart from "customerId" is good for). As this seems to be some sort of blacklist, store only the relevant stuff (the list of customer IDs) in direct access and generate the file frequently (like, daily) from the full file.

Any compression apart from that will certainly be able to reduce file size, but also affect performance.

tofro
  • 891
  • 6
  • 10