Well, of course the first loop is slower. Without even getting that deep into page faults and how the OS pages in memory on access, the loop stride is 8200 bytes long (assuming 32-bits of padding) to get from one n
data field to the next.
The machine is going to typically fetch all the surrounding field data in ind_vec
when loading in pages (say 4k pages) down to cache lines (say 64 bytes) down to a register (say 64-bits), only to access the n
field (say 4 bytes) and then waste that whole time paging in 4 kilobytes of memory and moving down the hierarchy with 64 byte cache lines down to a general purpose register only to process 4 of the 64 bytes of data in the cache line and 4 of the 4 kilobytes of data in the page. You're basically making the machine waste so much time moving memory not relevant to your loop (accessing only 4/8200 bytes of your ind_vec
struct each iteration) down the hierarchy. That's going to obliterate spatial locality.
Meanwhile with your second loop, you are generally accessing all data loaded into a page and into a cache line prior to eviction. The second loop goes through a densely packed array of integers where you aren't causing the hardware and OS to waste time moving memory down the hierarchy only to process a minuscule fraction of it prior to eviction.
Memory Hierarchy
The hardware has a memory hierarchy ranging from smallest (register) to largest (disk followed by DRAM) and likewise fastest to slowest (smallest being fastest, largest being slowest).
To avoid accessing the big but slow memory too often, the machines loads in memory from slow regions in contiguous chunks (ex: 4 kilobytes for a page, 64 bytes for a cache line). It grabs memory from the slow types of memory by the handful, so to speak. It does that whenever you request to do anything with a specific memory address if the memory around the address isn't already paged in or already in a cache line. To get maximum benefit with this process, you want to write code in a way so that when the hardware and OS are grabbing data from slow memory in big, contiguous chunks (by the handful) and moving down the memory hierarchy, you're not going to waste that expensive process by just accessing a few bytes of those chunks before trying to access a lot of memory elsewhere from a totally different, distant address (which is what your first loop does).
M&M Analogy

The code you have originally with the first loop is making the hardware jump through hoops only to waste most of the effort. It's like using a giant spoon to dig into a bowl of M&Ms only to then pick out and eat only the green M&Ms and then toss the remaining M&Ms aside. That's very inefficient as opposed to eating all colors of M&Ms in a spoonful at once or having a bowl consisting only of green M&Ms, the ones you are actually going to eat, so that you can eat entire spoonfuls of green M&Ms at once.
Meanwhile the second loop is like eating from a bowl of only green M&Ms. With that second loop, you can eat entire spoonfuls of green M&Ms at once since you're accessing all the data in the array and not skipping massive sections with an epic 8 kilobyte stride.
Hot/Cold Field Splitting
Q2) Is there any workaround here?
The most direct answer to the conceptual problem in your example case is hot/cold field splitting. Don't put a boatload of data you don't access in critical paths in the same structure or class as data you access frequently. Store that on the side in parallel, like so:
struct ind_vec_cold {
// Cold fields not accessed frequently.
float ind_1[4096];
float ind_2[4096];
};
struct ind_vec_hot {
// Hot fields accessed frequently.
int n;
};
struct ind_vec {
ind_vec(int n): hot(n), cold(n) {}
vector<ind_vec_hot> hot;
vector<ind_vec_hot> cold;
};
ind_vec data(NT);
for(int i = 0; i<100; i++){ // Not slow anymore!
data.hot[i].n = 1;
}

Using the M&M analogy above, this hot/cold field splitting approach effectively achieves the effect of having a bowl of only green M&Ms so that we can now consume them quickly by the spoonful. We use a different bowl to store the other M&M colors. Previously you had a mixed bowl of M&Ms and were making the machine grab M&Ms by the spoonful only to pick out the few green M&Ms in that spoon and then toss the rest aside.