1

So I have an engine in progress that's structured like this:

  • entities are simple ids (unsigned short)
  • components are stored in pools which are static members
  • systems are stored by the manager, and all inherit from a single base

What this looks like in code is like this:

struct transformComponent {
    const unsigned short id;
    vec2 pos, dim;

    transformComponent(vec2 pos, vec2 dim, unsigned short id): pos(pos), dim(dim), id(id) {}

    static componentPool<transformComponent> pool; // allows for scalability
};

struct physicsComponent {
    const unsigned short id;
    vec2 v, a, f;
    unsigned short mass;
    float invmass

    physicsComponent(vec2 v, vec2 a, vec2 f, unsigned short mass, unsigned short id): v(v), a(a), f(f), mass(mass), invmass(1.0/mass), id(id) {}

    static componentPool<physicsComponent> pool;
};

// same for other components

struct system {
    virtual void update() = 0; // don't care about virtual call, there will be only one per system
};

struct physicsSystem: public system {
    virtual void update() override; // the problem
};

struct room { // the manager
    std::vector<system*> systems;
    std::unordered_set<unsigned short> activeIds;

    void update() {for(auto* sys: systems) {sys->update();}}
};

Now I did everything I could, until now, to make this as cache-friendly as possible. The thing is: if the physics loop reads the physics components and writes to the transforms, doesn't that completely eliminate the point? I'll let some code explain what I mean:

/* either:
  physicsSystem has a hashmap of ids that it has to loop over to component pointers, which only gets updated when necessary (std::unordered_map<unsigned short, std::tuple<transformComponent*, physicsComponent*>> comps;), breaks purity (systems have no state)
or
  comps is created every frame (for every system), terribly inefficient
*/

// either:
void physicsSystem::update() {
    for(auto pair: comps) {
        physicsComponent* phy = std::get<physicsComponent*>(pair.second);
        transformComponent* tra = std::get<transformComponent*>(pair.second);

        phy->a = phy->f * phy->invmass;
        phy->v += phy->a;
        tra->pos += phy->v;

        // cache misses to load both transform and physics twice, for every entity, for every system, for every frame
    }
}

// or:
// comps is actually an std::unordered_map<unsigned short, std::tuple<transformComponent*, physicsComponent*, vec2>>
void physicsSystem::update() {
    for(auto pair: comps) {
        physicsComponent* phy = std::get<physicsComponent*>(pair.second);
        vec2& schedule = std::get<vec2>(pair.second);

        phy->a = phy->f * phy->invmass;
        phy->v += phy->a;
        schedule = phy->v;
    }
    for(auto pair: comps) {
        std::get<transformComponent*>(pair.second)->pos += std::get<vec2>(pair.second);
    }
    // smooth first loop, but still cache misses in the second (I think)
}

So, my question is: how would one go about making this loop cache-friendly? By that I mean, no cache misses for every entity, but only when the array size exceeds the cache and when switching between component types to update. I'd be happy to receive any different approaches, as well. TIA.

  • If performance is such a priority you may also wish to look at your memory management. Try using [Quiescient States](https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/?) for garbage collection. – Kain0_0 May 31 '20 at 22:56

1 Answers1

1

So, turns out I didn't know how cache actually works, and I'm writing this in hopes that someone else doesn't have to go through the same headache.
My first approach is actually fine. Because yes, cache lines are (usually around) 64 bytes (depends on architecture), but the caches themselves are much bigger (on my 2016 laptop, L3 is 3 mega - to check, on windows, open task manager -> more details -> performance).

So, what happens when you ask for address 0x000001 is the bus brings to the cache addresses 0x000001 through 0x000040. So now, if you ask for any of these bytes, they will already be cached and at your disposal.
But, if your loop asks for address 0x000041, the CPU looks in L1 (smallest and fastest), if it finds your byte, it gives it to you, otherwise it goes to L2 cache (bigger and slower), and if it doesn't find it there it looks in L3 (biggest and slowest). Only then, if it still didn't find it, it asks for RAM. And what happens now? Addresses 0x000041 through 0x000080 are loaded, but they don't overwrite the other ones. Which means, if now you ask for address 0x00000f, for example, it will still be cached and ready to go.