13

Wikipedia lists 11 cache replacement algorithms. Assuming I know almost nothing about the application I'm going to develop, what should I use as a "default" cache replacement algorithm?

If I recall correctly from my OS course, LRU is the best general cache replacement algorithm. But maybe I'm mistaken.

Also, this is a bit of an academic question, since, generally, main memory is cheap and abundant and I don't really need to worry about cache size too much.

ashes999
  • 1,129
  • 5
  • 12
  • 22
  • 1
    Is pre-fetching relevant to your application? If so, the pre-fetching and retaining strategy must be considered together when choosing algorithms. – rwong Apr 23 '11 at 05:20
  • You will need to obtain sample traces (the list of data access patterns) that are representative of your intended application domain. You may be able to find publicly-available test sets from academic research. Then you can implement each algorithm, do simulation, and report your findings. Failing that, use LRU with sparingly random replacement. – rwong Apr 23 '11 at 06:52
  • 1
    If you "know almost nothing about the application" then it's *far*to early to think about "efficient" cache replacement algorithms. – Anon Apr 23 '11 at 11:41
  • Main memory may be cheap, but if performance is an important issue, access efficiency will matter. I don't think you get to choose your cache replacement strategy -unless you are a chief architect of a new computer. The rest of us get whatever the market offers. If you need to go fast, you need to organize your computation and data structures to make efficient usage of the memory hierarchy. – Omega Centauri Apr 23 '11 at 23:08
  • 1
    @Omega Centauri You think of the CPU caches only, but there's much more. The OS caches used files and directories, databases caches their data, nearly each application does a lot of caching (e.g. of already computed results). – maaartinus Apr 24 '11 at 11:27
  • *"Also, this is a bit of an academic question, since, generally, main memory is cheap and abundant and I don't really need to worry about cache size too much."* - Wrong, unless you're working on a small app. Databases, web servers, and many others use a lot of RAM for caching and would profit from more RAM or a better strategy. – maaartinus Apr 24 '11 at 11:30

5 Answers5

16

I guess the best answer is that it depends. In my experience there are a lot of factors that go into choosing caching algorithms.

Factors to consider

  1. Read/Write Balance. (What percentage of accesses are reads vs writes)
  2. Amount of cache.
  3. Type of media behind the cache. (Are they slow SATA drives or fast SSD drives?)
  4. Hits vs Misses. (How often are things rewritten or reread?)
  5. Average access size (This goes into choose the page size)
  6. How expensive are reads and writes.

Once you consider all the different factors you then need to find a cache algorithm that handles that best. For example say that you have an application where there are a lot of writes, some rewrites, reads of recently written data and some sort of spinning media. In this case you would want a sort of hybrid caching algorithm. To handle the write data you might want something like Wise order of Writes (WOW) and an LRU algorithm for data that has been read from disk. The reason for this is that disk accesses are very expensive and the WOW algorithm will make it more efficient to write out data and the LRU will keep frequently accessed data always in cache.

Say you have SSD disks, which have very fast access time, you might want to gear your choice toward LRU algorithm since disk accesses are relatively inexpensive.

So really what I want to say is that there is no "best" answer. The best answer is know the factors that apply to you and choose an algorithm that best handles them.

How to find the algorithm for you

Profile your system. This usually involves adding code to keep statistics for memory accesses. By profiling you can see which factors are most important to you.

In the past I have added code to track all memory accesses over a period of time. Then later I look for patterns. I look for re-reads, re-writes, sequential access, random access, etc..

Once you have identified things of importance you need to look at all the different types of caching algorithms to see which handle which things the best.

barrem23
  • 3,161
  • 23
  • 21
  • Great break-down of factors. But I'm not sure how to apply those, given that I know the app domain and the factors. – ashes999 Apr 23 '11 at 01:41
  • @ashes: There's the old engineering technique: build a few in different ways and measure which works best. – Donal Fellows Apr 23 '11 at 15:47
  • When I hear "cache" I think of the storage between memory and the CPU registers. Here you are talking about disk cache, which is a layer inbetween memory and one or more i/o devices. – Omega Centauri Apr 23 '11 at 23:10
  • @barrem23 If you're doing distributed programming, there's also the "distance between the cache and the back-end storage being cached" to consider. It doesn't matter, much, if you have an SSD or spinning rust as your large, stable, storage if the storage is 15 ms away, you will always incur a minimum 30 ms round-trip anyway. – Vatine Dec 28 '13 at 14:46
9

Assuming you know almost nothing about the application you're going to develop, you should know more about it before actually choosing and implementing a cache system. In other words, there are no default implementations: some are good for some purposes, and are totally bad for others.

For example, take just two implementations: Least Recently Used and Least Frequently Used. How to decide which one to use prior to another?

  • LRU is good when you are pretty sure that the user will more often access the most recent items, and never or rarely return to the old ones. An example: a general usage of an e-mail client. In most cases, the users are constantly accessing the most recent mails. They read them, postpone them, return back in a few minutes, hours or days, etc. They can find themselves searching for a mail they received two years ago, but it happens less frequently than accessing mails they received the last two hours.

  • On the other hand, LRU makes no sense in the context where the user will access some items much more frequently than others. An example: I frequently listen to the music I like, and it can happen that on 400 songs, I would listen the same five at least once per week, while I will listen at most once per year 100 songs I don't like too much. In this case, LFU is much more appropriate.

By taking only two of the implementations, you see that there's no "default" algorithm you can use when you don't want to think about which one is better or don't have enough information about the application. It's, well, like asking if by default, you must add, subtract, multiply or divide two numbers to find a result of a calculus when you don't know anything about it.

Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513
  • Ok, so how do I go about picking an algorithm? Run through Wikipedia's list and see what makes the best fit? – ashes999 Apr 23 '11 at 01:41
  • @ashes999: exactly! First, you learn more about the requirements of the application to do, then you analyze the pros and the cons of the different cache algorithms, and finally you choose the more appropriate one. – Arseni Mourzenko Apr 23 '11 at 02:16
3

Why to limit your choices only to Wikipedia? If you have access to a research database like the ACM Digital Library you'll find even more algorithms. Also be aware about messing with patents. For example ARC is a good algorithm but unfortunately it is patented.

sakisk
  • 3,377
  • 2
  • 24
  • 24
2

You could spend lots of time agonizing over the 'best' algorithm, or you could just implement a simple algorithm and GET ON WITH THE REST OF THE SYSTEM. When you have something testable then worry about the algorithm.

Premature optimization ...

Ross
  • 29
  • 1
0

There is no perfect cache algorithm - you can always find a case which behaves very badly.

Therefore it is important to know the problem being cached in order to determine the one which will behave least bad.

Also, you should be considering how long you need to cache things and how long you can cache things...