How would you design a “multithreaded” LRU cache using C++ (unordered_map and Linkedlist)?

Question

Please note that this is about a thread-safe LRU cache (not simply a LRU cache as specified in https://leetcode.com/problems/lru-cache/description/). It isn't a duplicate of LRU cache design question as there are some tricky aspects of Locking Hashtable/Linkedlist(LL) that aren't addressed in other multithreaded LRU design questions.

The credited approach on how to make LRU cache thread-safe in C++ seems to be all over the place. I noticed some links mentioning that locking both Hashtable/LL is ok while some others don't directly address the issue in C++ and just vaguely discuss various ways to do it.

Can someone describe a functional and efficient approach using locks that includes what's locked and when implementing a multithreaded version of the LRU cache in C++ https://leetcode.com/problems/lru-cache/description/ so all get/set operations are O(1) ?
how do caches such as memcached by Facebook implement multithreading support?

Well, at first, locks won't change complexity in any way while adding significant constant overhead. Second, you are limiting yourself not considering other synchronization methods, or no synchronization at all if we are speaking about single threaded applications, Redis for example. It performs quite well nevertheless single threaded. — Andrew Selivanov, Aug 14 '18 at 08:53
@andrew any effective synchronization method to make it concurrent will do. Can you post code? Memcached has multi threaded lru so they probably use concurrent locks. — Joe Black, Aug 14 '18 at 19:36

score 4 · Answer 1 · answered Apr 13 '19 at 16:05

I have implemented this using a multiple readers/single writer lock. This is a type of lock that allows multiple threads to read from the cache at the same time, but only allows a single thread to write to the cache. The basic gist of it is something like this (pseudocode):

void put(ItemType item, KeyType key)
{
    cacheLock.lockForWriting();

    // If it is already in the list and map remove it
    list.remove(key);
    map.remove(key);

    // Now put it back into both
    map [ key ] = item;
    list.append(key);

    // If we're past the size limit of the cache, remove the least recently used item
    if (list.size() > MAX_CACHE_SIZE)
    {
        ListNode *oldKey = list.head;
        list.head = list.head.next;
        map.remove(oldKey->key);
        delete oldKey;
    }

    cacheLock.unlock();
}

ItemType* get(KeyType key)
{
    cacheLock.lockForReading();
    ItemType* item  = map [ key ];
    cacheLock.unlock();

    // NOTE: This assumes that the cache is not the owner of this object
    // If it is, then you need to either make a copy of it inside the lock 
    // above or use smart pointers instead of raw  pointers
    // because when it gets kicked out of the cache it will be deleted.
    return item;
}

It is not a an Least Recently Used cache because the accessor do not modify the cache. Element remove is the oldest created which may not be the oldest used. — sandwood, Mar 13 '21 at 19:53

huseyin tugrul buyukisik · Answer 2 · 2021-03-19T06:52:18.333

I wrote a graphics card memory cacher in RAM: https://github.com/tugrul512bit/VirtualMultiArray

It uses n number of parallel "clock-second-chance-with-2-pointers" caches instead of doing multithreading for 1 cache. This way, each element access of virtual array needs only 1 lock and this lowers 50MB/s of single threaded lockless access throughput down to "only" around 25MB. The more locks added per cache, the worse for a single threaded access, like 10MB/s. But when all caches are used concurrently, running 64 threads on 8 logical cores, it goes up 215MB/s (these numbers from a Gaussian Blur benchmark with 4 byte data size per access and when getting "tiles of pixels" instead, the bandwidth goes 1.2GB/s).

Every independent cache has its own lock and maps onto a strided pattern of virtual array. Combining with other strides makes a whole array where any n neighbor cache lines are completely independent to lock.

Having an unordered map to find clock elements in O(1) already adds a not-so-negligible latency. Adding multithreading to each cache would further slow it down for single threaded usage. I think, losing latency to improving the hit-ratio must be better than reducing scalability. You can always have independent caches for arbitrary regions of array, not just a strided pattern. For example, 2D cache for image processing, where each cache is active only for a tile of image. Then many threads can work concurrently and fully efficient, even without locks, without atomics, but with requirement for a guarantee for threads to not go out of their tile boundaries.

How would you design a “multithreaded” LRU cache using C++ (unordered_map and Linkedlist)?

2 Answers2