What is a common practice to refer to other datas within a data oriented design?

Question

This question has been in my mind for a while now, especially in the context of high performance, interactive 3d applications. Just want to find out what is the general practices in DoD for referencing other objects.

Let's take an example: Animations that need to update skeletons and an ECS that knows which animation to trigger for the skeleton:

struct Skeleton {
  vector<vec3> jointPositions;
  vector<quat> jointRotations;
  vector<vec3> jointScales;
  vector<mat4> jointTransforms;
};

struct Animator {
  uint32_t currentAnimation = 0;
};

struct Keyframe {
  vector<float> times;
  vector<vec4> values;
  uint32_t joint;
  Target target; // Rotation|Position|Scale;
};

struct Animation {
  vector<Keyframe> keyframes;
}

void AnimationSystem::update(float dt) {
  for (auto &[skeleton, animator]: entityContext.iterate<Skeleton, Animator>()) {
    auto &animation = getAnimation(animator.currentAnimation); // <-- How do I get this?
    
    for (auto &keyframe: animation.getKeyframes()) {
      // linear interpolation
      auto interpolatedValue = interpolate(keyframe, dt);
      applyToJoint(skeleton.joints[keyframe.joint], interpolatedValue, keyframe.target);
    }
  }
}

My question is, how do I access the real object if I have an ID? Easiest solution is storing all the animations in a hash map but hash maps are not really cache friendly unless it is specifically built with that in mind. Another approach that I have seen a lot is using vectors for the animations and using references as the ID but then can one deal with deletions. Will deletion mean to add a dummy animation in place of the deleted one and let it run, which won't do anything since it is an empty animation? What if a dummy element cannot replace the hole (e.g binding vertex buffers for rendering meshes)?

These are the questions that I constantly as when comes to Data oriented design and data oriented ECS but unfortunately I do not know how to make references work in a way that cache hits can be improved.

EDIT: Updated the code snippet

An animation consists of multiple keyframes and each keyframe stores times and values. The system needs to go through every single keyframe, calculate the interpolated value, and update the skeleton joint data.

Animation has keyframes that can change the contents of a specific joint in a skeleton. So, keyframe 0 = change rotation property of joint 5, keyframe 1 changes position property of joint 2 etc. The system reads the keyframes of an animayion and updates the skeleton joints. — Gasim, Apr 29 '22 at 12:26
Updated the component! Let me know if the description and the new code snippet is sufficient. — Gasim, Apr 29 '22 at 12:37
I think there is still too much context missing (not more code). I cannot see anything about the lifetime and scope of those animations in your system - where are they coming from, how are they managed, how do they change? You ask about `getAnimation` - "how do I get this" - but how can we know, if you don't? — Doc Brown, Apr 30 '22 at 08:13

Anti Gamer · Answer 1 · 2023-02-01T01:34:09.797

Data Structures/Allocators With Holes to Keep Stable Indices/Pointers

Typically a high-performance ECS where element removal is very common [*] will use a random-access sequence with holes in them to make removal of an element a constant-time operation. Vacant spaces get reclaimed upon subsequent insertions, similar to a free list. So it's similar to the dummy solution you mentioned, although in a language like C++, you can actually invoke destructors on the elements being removed (placement new and manual destructor invocation: this is how std::vector avoids the need for default constructors in push_backs in spite of often allocating a memory capacity that exceeds the number of elements inserted so far) and can associate data in parallel to indicate what elements are removed.

I should note that there are even ones that will completely deep copy and rearrange components on the fly to perfectly suit the contiguous access patterns (no holes) of a particular system based on the component types they access. Those can still end up using a stable index or pointer into some main component list that's accessible to all systems. I don't think the production ones will generally use associative data structures for this purpose. I suspect in those cases that entity removal is relatively infrequent and that components are relatively cheap to deep copy around.

There are a lot of ways to go about what I'm talking about with "holes" in your sequence. There's the straightforward solution of associating a bitset or array of booleans in parallel to (or possibly even interleaved with) your components and a list of free indices to reclaim for subsequent insertions. The former tends to outperform the latter if you use very efficient bitwise operations like FFS/FFZ, but worse if not (surprisingly bitsets can perform worse in spite of requiring 1/8th the memory of a boolean array even for sequential access if you have to test bits individually, at least with std::bitset and operator[] although that has to conceptually have an additional bitwise AND for what would normally be modulo and right shift for what would normally be divison to know what byte and relative bit to access given the random-access nature of operator[]).

Another way I've seen used in game engines is to separate some data structure of valid indices from the invalid ones and even using erase-swap-back methods (swap the element to remove with the element at the back and pop back in constant time, but update indices or pointers to elements in linear time with a level of indirection) to keep element removal in constant time (although the update of the indices can degrade to linear time with such solutions, but at least it's only updating integers and not hefty data)*.

One trick I rely on a lot if you're iterating over indices to elements is to store ranges of them as pairs of indices in the form, [first, last), not every single index. That allows us to represent an integer range [0, 1,000,000] with just two integers instead of a million integers we have to load in memory. That works best when your containers have very few holes in them most of the time which will end up being reclaimed on subsequent insertions. Efficient bitsets can be a nice alternative otherwise, including atomic bitsets where 64-bit bitsets can be manipulated atomically across threads (required for thread-safety with parallel loops over bitsets given the overlap in data without formal locks but ideally you distribute the work among your threads such that they aren't writing to the same 64-byte aligned set of 64 bytes in parallel to avoid false sharing for your most common cases -- assuming a 64-bit architecture with a cache line size and alignment of 64 bytes or 512 bits).

There are also intrusive solutions that aren't very generalized relying on the specific component type's representation. As a simple example, a component type of int32_t (although I doubt few will find use for it -- admittedly contrived example) might only store non-negative integers in the normal case. Then you might use -1 to indicate that it's removed. That's usually too much effort than is so beneficial IMO to have to x-ray your component type's representation in your ECS unless you have a lot of extra time to specialize your class templates for specific component types and don't mind the extra coupling, especially since real-world use cases will generally have component types more complex than a scalar number, and will also often require destructors to be called to properly free their memory.

Iteration Overhead

All of these solutions incur a branching overhead to iterate over elements to check for vacant spaces or a memory access overhead (more cache misses) to iterate through parallel arrays storing valid indices to elements, but it tends to be quite trivial in exchange for avoiding the need for an associative structure with key searches to access elements (hash table,red-black tree, trie, etc). You end up with a trivial constant-time way to reference and access elements in your scene with stable indices.

Bitwise FFS For Component Iteration

Bitwise FFS is my favored solution and the best-performing in our particular use case for skipping over vacant spots in arrays after exhaustively trying every solution out there that I could find, as it avoids the need to test for vacant spaces on a per element level (it can check many at once, like 64+ on 64-bit architectures) and only requires ~1 bit of extra memory per entity.

Peculiarly, it's a solution I don't see used much even among really serious gamedevs even though I've had -- by far -- the best (measured) performance with those, but maybe because there's no C or C++ standard FFS/FFZ intrinsic or function the last time I checked (we have to get a bit compiler and/or platform-specific to use such instructions). That's a really glaring omission from the C and C++ standard lib IMO for those of us who have to obsess with real-time applications and consistently smooth frame rates. But it's not too difficult to write a cross-platform version of FFS/FFZ functions yourself; GCC even offers it as a GCC-specific intrinsic (ffs*) while you can find versions in <intrin.h> on MSVC and ICC and likely similar on other compilers besides those three (ffz is just ffs(~bitset)). Still, I really wish we had a std::ffs as I hate putting platform or compiler-specific code into our codebase and require an approval process from my team to do it where I have to write a proposal arguing for the merits of allowing such code in our codebase.

Stable Keys

In any case, referencing other entities/components from a scene wants some sort of stable key. Typically the fastest is going to be an index into a random-access structure, not a key that requires searching (not even the most optimized hash table will beat that). So in summary I suggest, if you actually need it, to stop thinking in terms like "IDs" or "Keys" and more in terms of "indices" or "pointers" which offer not only constant-time access to another element in your scene, but the cheapest form of constant-time access possible. The second question then becomes how to avoid invalidating your indices/pointers; I've seen a lot of AAA devs solve this with a layer of indirection (even though I think there are more efficient ways, but that's a way, and it's used widely in serious productions at least from what I can tell). In any case, it's always going to require a structure of some sort that has holes in it for vacant spaces at some fundamental level in memory (this includes the lowest-level memory allocators which are required to have holes if they allocate multiple elements together in a single block to avoid invalidating existing pointers).

Getting to Gross Details

One thing to note is that when your scene data is starting to resemble more of a graph than an array with lots of links from one element to another (be they indices or pointers, as with the case of a lot of motion parenting for FK/IK), you're typically going to incur a lot of cache misses with the sporadic access patterns following the graph links. It can be really handy in those cases to have a super optimized, parallelized sorting algorithm to sort things based on their links (if you can afford it somewhere in between your frame outputs) so that the memory access patterns take on a more orthodox and cache-friendly pattern with improved spatial and temporal locality. Having a super fast sorting algorithm available at your fingertips can help a lot to prevent you from rolling a fancier solution which ends up performing worse. I spent a few days coming up with a multithreaded SIMD radix sort drawing from the best papers and I haven't looked back since -- it allowed me to delete boatloads of code I have in my ECS by trying to get way fancier to reduce cache misses by just sorting everything periodically with an algorithm so fast that it never causes frame hiccups (it can sort over 100 million elements on a dinky i3 in less than 60ms, and at least in my use cases, I'm dealing with way less than 100 million elements so the actual time to sort is often less than a millisecond, and affordable to do periodically without any noticeable frame rate drops getting us below that desirable ~16.67ms for 60+ FPS).

What is a common practice to refer to other datas within a data oriented design?

1 Answers1