34

I was reading a bit about garbage collectors and I am wondering if the garbage collector of a program scans the entire heap memory or what is allocated to it? If it reads the entire system memory, does it mean it is reading memory locations that are used by other applications? I understand that this does not make much sense security or performance wise.

If garbage collector only reads the memory that is allocated to it, how does it mark those areas?

Sorry for the rookie question, I am not a software engineer and this is pure out of my curiosity

Tom Newton
  • 27
  • 4
PoJam
  • 475
  • 4
  • 6
  • 19
    In modern OSs, applications are isolated. It is usually not allowed for an application to touch other app's memory space. And I don't see why GC would want to touch other app's memory space. – Euphoric Aug 15 '22 at 19:24
  • 3
    This presentation is old and Java-specific but I recommend it if you are serious about understanding GC: https://www.infoq.com/presentations/Understanding-Java-Garbage-Collection/ – JimmyJames Aug 15 '22 at 20:31
  • 4
    This might be implementation specific and e.g. differ in Java and C#. Also, "entire memory" is a vague term, because it's not clear whether you understood the difference between physical RAM and virtual memory. Also, virtual memory will still not be scanned entirely: the C# GC will not scan the "native" memory like C++ allocated memory. – Thomas Weller Aug 16 '22 at 17:20
  • What heap? If you mean the GCs own internal heap then yes, it scans the whole heap where it allocated objects in, eventually. If you mean what the C runtime calls heap then no. The GC scans only those parts of memory that it manages. – Goswin von Brederlow Aug 17 '22 at 13:48

7 Answers7

60

I was reading a bit about garbage collectors and I am wondering if the garbage collector of a program scans the entire heap memory or what is allocated to it?

That depends on the garbage collector. There are many different kinds of garbage collectors.

For example, Reference Counting Garbage Collectors don't "scan" anything at all! In a Reference Counting Garbage Collector, the system counts references to objects, something like this:

SomeObject foo = new SomeObject();

Let's say, this new object was allocated at memory address 42. The GC records "there is 1 reference to the object at address 42".

SomeObject bar = foo;

Now, the GC records "there are 2 references to the object at address 42".

foo = null;

Now, the GC records "there is 1 reference to the object at address 42".

bar = null;

Now, the GC says "there are 0 references to the object at address 42, therefore, I can collect it".

At no point did the GC "scan" anything.

What you are probably thinking about is an extremely simplistic implementation of a so-called "Tracing Garbage Collector", namely the Mark-and-Sweep GC.

Any Tracing GC starts off with a set of objects that they know are always reachable. This is called the root set. The root set typically includes all global variables, the local variables, CPU registers, the stack, and some other stuff. For all of these objects, the GC looks at the instance variables and checks the objects that the instance variables point to. Then it checks those objects' instance variables, and so on and so forth.

This way, the GC "sees" all "live" objects, i.e. the objects that are reachable from the root set. What the GC does with those "live" objects depends on the kind of GC.

As I mentioned above, what you are thinking of is the most simplistic kind of Tracing GC, which is the Mark-and-Sweep GC. During the tracing phase I described above, the GC will "mark" all live objects by either setting a flag directly in the object header itself, or in a separate data structure.

Then, the GC will indeed "scan" the entire memory and find all objects and do one of two things:

  • If the object is marked, remove the mark.
  • If the object is unmarked, de-allocate the object.

After this "sweep" phase, you end up with all unreachable objects destroyed and all live objects unmarked, ready for the next "mark" phase.

But, as I mentioned, this is only one of many different kinds of Tracing GCs, and is a very simple one with many disadvantages. The two major disadvantages are that scanning the entire memory is expensive and leaving the live objects where they are and only collecting the dead objects in between leads to memory fragmentation.

Another very simple but much faster Tracing GC is Henry Baker's Semi-Space GC. The Semi-Space GC "wastes" half of the available memory, but gains a lot of performance for it. The way the Semi-Space GC works is that it divides the available memory into two halves, let's call them A and B. Only one of the two halves is active at any one time, meaning new objects only get allocated in one of the two halves.

We start out with half A: The GC "traces" the "live" objects just as described above, but instead of "marking" them, it copies them to half B. That way, once the "tracing" phase is done, there is no need to scan the entire memory anymore. We know that all live objects are in half B, so we can simply forget about half A completely. From now on, all new objects are allocated in half B. Until the next garbage collection cycle, when all live objects are copied to half A, and we forget about half B.

These are just two examples of Tracing GCs, and only one of those two scans the entire memory.

If it reads the entire system memory, does it mean it is reading memory locations that are used by other applications? I understand that this does not make much sense security or performance wise.

This is simply impossible. No modern Operating System allows a process to read another process's memory. (And when I say "modern", I mean "since the 1960s or so".)

But even if it were possible, it would not make sense. If the memory belongs to another process, then the GC has no idea what the objects that are in this memory even look like. But it needs to know what the objects look like in order to find all the instance variables and to know how to interpret those references. If it uses an internal marker flag inside the object itself, it also needs to know how to find that marker flag and how to set it. And that is assuming that the marker flag is even there at all! What happens if the application that owns that memory doesn't use marker flags?

Or, worse: what happens if the application that owns that memory does use a GC which uses marker flags. Now, the one GC is overwriting the other GC's markers!

If garbage collector only reads the memory that is allocated to it, how does it mark those areas?

There are two popular approaches.

The first approach is that there is flag in the object header of each object reserved for marking. During the "mark" phase, the GC sets this flag. The major advantage of this approach is that there is no separate bookkeeping involved and it is thus very simple: the mark is right there on the object itself. The major disadvantage is that objects are scattered all through the memory, and thus during the marking phase, the GC writes all over the entire memory. This means that there are "dirty" pages all over memory, in a multiprocessor system (which almost all systems are nowadays) this means that we have to notify the other CPU cores that we have modified some memory, we have polluted the cache with tons of writes that we will never need again, and so on.

The alternative is to keep a separate data structure where we keep a table of all marked objects. This has the disadvantage of more bookkeeping (we need to keep a relationship between the mark table and the objects) but it has the major advantage that we are only writing to one place in memory, which means we can keep this one piece of data in the cache all the time.

But again, not all GCs even have a concept of "marking" at all.

Jörg W Mittag
  • 101,921
  • 24
  • 218
  • 318
  • 6
    "Then, the GC will indeed "scan" the entire memory and find all objects." This is not really true or necessary. There's nothing to be done about 'dead' objects unless there was some sort of hook to notify about it's 'death'. Also, I consider reference counting to be much more simplistic than mark-sweep. I grant that it can be more performant but it's unable to guarantee all 'garbage' is freed. – JimmyJames Aug 15 '22 at 20:22
  • 5
    The "Mark-and-Sweep GC" also has one advantage over simple reference counting: It removes objects with circular references, that are otherwise not referenced anywhere. Imagine a parent object having a reference to a child object, and vice versa. If both parent and child lose all *external* references, a simple ref counter GC would still mark them alive, since they reference each other. A Mark-and-Sweep GC would find no path from the root to the objects, and would delete them. (Note that weakrefs would solve this) – MechMK1 Aug 16 '22 at 11:24
  • 5
    Of course operating systems allow reading the memory of other processes. Not as a matter of course, but for instance the win32 debugging functions let you do just that. I think what you mean is there's no way for this to happen by accident. – Cubic Aug 16 '22 at 12:08
  • Theres a varient of the semi-space GC that is significantly more efficient in systems with MMUs: Rather than dividing the physical memory in half, you allocate one block of memory which you map at two separate blocks of address space. This wastes address space, but not physical memory. And address space is a lot cheaper than memory. Moving between the two spaces is then performed by updating the pointer and setting a flag on the object that the object is in the other space. – user1937198 Aug 16 '22 at 13:29
  • 2
    @MechMK1 PHP (or, more accurately, the Zend Engine) gets around that by _supplementing_ reference counting with a variant mark-and-sweep: it keeps a list of everything that has had its reference count _reduced_, and periodically sweeps each tree in that list to see if all its inbound references are circular. – IMSoP Aug 16 '22 at 15:23
  • A reference-counting system is not a garbage collector. Or at the very least, I'm not aware of any garbage collector that uses reference counting. A garbage collector's defining characteristic is being imprecise and uncoupled in the temporal dimension: it will collect things (or not) on its own schedule regardless of when they become eligible for collection. A reference counter, on the other hand, is temporally precise and coupled; it is guaranteed to free memory immediately as a part of the process of the reference falling to zero. – Mason Wheeler Aug 16 '22 at 16:17
  • There have been experimental operating systems which ran all processes in the same address space in a managed language. In these operating systems, presumably, the garbage collector was system-wide. – user253751 Aug 16 '22 at 19:11
  • 2
    @user253751: In Microsoft's Singularity OS, each SIP (Software-Isolated Process) had its own instance of the GC (provided by the OS). But there is no reason that has to be the case. E.g. in the original Smalltalk system (which is arguably an OS, or at least plays the role of what we would normally consider an OS – even though one of the design principles behind Smalltalk is "An operating system is a collection of things that don't fit into a language. There shouldn't be one.") there is only one GC for the entire system. – Jörg W Mittag Aug 16 '22 at 19:30
  • 3
    @MasonWheeler My understanding is that the standard CPython runtime uses reference counting primarily with a generational collector as a backup to catch cycles. The combination is pretty pragmatic and performant. Cycles are relatively rare in most applications. – JimmyJames Aug 16 '22 at 20:22
  • Do you know of any variants of semi-space that would use about 1/4 of the memory as dead space, rather than 1/2? If e.g. the address space ranges from 0x0000 to 0xFFFF, and is fills downward from 0xFFFF to 0x4000, before triggering a GC, start by copying all live objects in the 0x4000-0x7FFF range to the bottom of memory; if that fills 0x3000 bytes, then copy all live objects from 0x8000-0xDFFF to the range 0x3000-0x7FFF. Then copy live objects from 0xD000-0xFFFF to the top of the last batch of objects, and then bulk move everything up to the top of RAM. – supercat Aug 16 '22 at 22:16
  • Such an approach would require more steps than a semi-space GC, but if e.g. the live objects in a program total 40% of the memory space, a semi-space GC would need to run every time once every time allocations total 10% of the memory; using an approach like I described would reduce that by a factor of 3.5, and could also be readily adaptable to support generational collections. – supercat Aug 16 '22 at 22:19
  • 1
    @JimmyJames "There's nothing to be done about 'dead' objects unless there was some sort of hook to notify about it's 'death'." – you still have to know where you can allocate new objects (and where not to). This can be done by having a list of free memory blocks (like the allocators in C are doing), or by somehow knowing which memory areas are completely free, but both needs some way of updating this knowledge during/after a sweep. – Paŭlo Ebermann Aug 17 '22 at 00:46
  • @supercat the issue with moving objects is that you somehow have to update all the references to these objects. This will require additional scans through the objects you are not moving currently (or a mapping table, which will also grow large). I think semi-space somehow manages to avoid this by just leaving the old object there as forwarders until the next collection. – Paŭlo Ebermann Aug 17 '22 at 00:52
  • Perhaps reverse the order of the two items in the list (if marked then unmark; if unmarked then deallocate). – Ray Butterworth Aug 17 '22 at 03:43
  • 3
    @MasonWheeler it may be down to the specific definition you use for GC, but [Python](https://devguide.python.org/internals/garbage-collector/index.html) has this to say: _"The main garbage collection algorithm used by CPython is reference counting."_ so they consider it a GC. And you can disable the automated cycle counter and run it on demand, which will make it completely temporally deterministic. – Davidmh Aug 17 '22 at 11:11
  • @RayButterworth It's not a list but two branches of a single conditional: `if marked then unmark else deallocate` – TooTea Aug 17 '22 at 11:43
  • @PaŭloEbermann You only need to know where the live objects are. Everywhere else in an allocation space is available for allocation. Which dead objects were in that space and where is irrelevant to whether it's free. – JimmyJames Aug 17 '22 at 15:05
  • @PaŭloEbermann: If one reserves a couple of bits in the first byte/word of an object, and objects are always allocated enough space for that plus an address, then code which examines a reference will be able to either (1) see that the object hasn't been relocated yet, but needs relocation, in which case the reference used to find it can be corrected to the new location; (2) see that the object has been relocated, and where it is now, and update the address or (3) see that the object doesn't need relocation, mark it as tentatively having been explored, and then explore child objects. If... – supercat Aug 17 '22 at 15:06
  • ...one has enough backtracking storage to finish an object, set a second bit indicating it has been fully explored. If backtracking storage overflows, set a flag indicating that it will be necessary to revisit all objects which haven't been fully explored and clear the "tentatively explored" flags, after which objects will need to be rescanned. – supercat Aug 17 '22 at 15:11
  • @JimmyJames then your allocator needs a list of all live objects. I guess your "marks" might count for this, if you have them somewhere easily accessible. – Paŭlo Ebermann Aug 17 '22 at 15:20
  • @PaŭloEbermann I'm not sure if you are asking a question. My answer talks about how mark-sweep-compact works and touches on copy collectors as well. – JimmyJames Aug 17 '22 at 16:13
  • @PaŭloEbermann: If one has a means, given a reference to an object, of identifying all of the objects referenced thereby, then one only needs a list of objects--called "rooted objects", which might not be referenced by any other live objects. Otherwise, the existence of a reference to an object within some other live object will make the referenced object live, whether or not it appears in any other object or list of objects. – supercat Aug 17 '22 at 22:24
  • @supercat this is how I identify live objects during a traversal phase, not during allocation (otherwise you'd have a very slow allocator). As I understood the original "mark and sweep", no object is moved, just space of dead objects is made available again for allocation, and my comments were mentioning that in this case, we still need to "scan the entire heap" (or a section thereof) after the live objects are marked, to give the space for dead objects back to the allocator. Of course it gets easier when live objects are moved away. – Paŭlo Ebermann Aug 17 '22 at 22:42
  • @PaŭloEbermann: The garbage collectors in frameworks like .NET and JVM move objects to consolidate free space, so the allocator won't need to care about the whereabouts of individual live objects. Having a list of all the objects in each 4K block of storage may enhance the usefulness of a "card table" to minimize the time spent scanning older objects, but the optimal approach will vary depending upon the host OS. – supercat Aug 17 '22 at 22:47
  • @supercat I'm aware that this is what is happening in current GC algorithms, but this was not the "super-simplistic mark-and-sweep" described in this answer – there nothing is moved, just the memory previously owned by dead objects is released back to the allocator (which causes fragmentation). And my comment was meant to correct JimmyJames' first comment about this. I guess we are all misunderstanding each other here, but it's not that important, so I'll stop discussing here. – Paŭlo Ebermann Aug 17 '22 at 22:54
  • @PaŭloEbermann: My first exposure to the concept of garbage collection was a compactifying collector that was even simpler than the "super-simplistic mark and sweep". Every reference to a string held the length and the address of its first byte, and strings were stored in memory with no other overhead, and the total amount of storage required for string management was probably about a dozen bytes. Garbage collection performance was, well, garbage, but allocations required nothing more than a pointer subtraction and a pointer comparison. – supercat Aug 17 '22 at 23:02
  • 2
    @JörgWMittag "No modern Operating System allows a process to read another process's memory. (And when I say "modern", I mean "since the 1960s or so".)" - you're discounting practically every 8-bit microcomputer Kernal OS. – Eight-Bit Guru Aug 18 '22 at 08:31
  • @Eight-BitGuru: Indeed, a better statement would be "No modern *multi-user* operating system". In the first multi-user time-share systems, different processes wouldn't be in memory at once, but parts of the operating system would be in memory with a user program. A user program that wanted to tamper with another task's storage could alter the operating system's "switch program" routine so that between loading a task and resuming its execution, it would check whether a particular task was running and--if so--modify some of its storage. – supercat Aug 18 '22 at 15:56
  • @Eight-BitGuru: I would guess that process isolation predated virtual memory, since an implementation could guard against the trickery described above by simply adding a latch to select whether the portion of memory used by the OS was writable, allowing the OS to set that latch, and clearing it when a certain piece of OS code is executed. Depending upon system design, it might be possible to implement such functionality with somewhere between 10 and 100 discrete transistors. – supercat Aug 18 '22 at 16:01
14

There are a lot of details in garbage collection. Lots lots lots lots lots. Each one of them has different behaviors, so it's hard to cover everything.

Garbage collectors almost universally scan the memory within a single process. This is for two natural reasons. One is that garbage collection is always closely tied to how the memory is laid out, and thus language-specific. The second is that the OS won't let you read someone else's memory anyway (without special steps).

The most naïve solutions for garbage collection do indeed look at everything in the process's memory, but that's obviously slow. Faster solutions do exist. The most common is a "generational" garbage collector. It has been observed that most objects have a very short lifespan, so putting young objects in a "nursery" lets you quickly reclaim memory as efficiently as possible. With a generational garbage collector, one only scans all of memory when one has to (when doing the shorter cheaper scan isn't sufficient).

There are also lots of clever tricks to avoid scanning extra memory. One of them is "card marking." Whenever a reference is updated, the language sets a bit on a "card," which is just an array in memory. During a garbage collection, only cards that have been marked could possibly have changed what they were referring to, so you can avoid scanning all of the data managed by the unmarked cards.

But don't think those are the only approaches. We've come up with countless different ways to do garbage collection, each with their own pros and cons.

Toby Speight
  • 550
  • 3
  • 14
Cort Ammon
  • 10,840
  • 3
  • 23
  • 32
10

No garbage collector scans the entire memory. Because, when reading some random memory content, it is not possible to guess if it's used or not. So garbage collection goes through objects or list of objects that were previously allocated and created to see if they are still used.

Most of the GC algorithms belong to the more general family of marking algorithms: very naively, objects referenced in variables that are still alive are marked as used. The marking is then propagated to all the related objects, and so on until there is no new object to be marked. The unmarked objects can then be discarded. The marking usually takes the form of a flag or a counter in some technical data that describes each object. The marking and cleaning can be done at once, or incrementally, or at every referencing/dereferencing.

But this is very simplified, and a book would not be sufficient to describe all the existing object management strategies and the garbage collection algorithms (not to speak of the more than 700 patents on the topic, e.g. this one). But here already a compact overview.

Christophe
  • 74,672
  • 10
  • 115
  • 187
9

Top-level, GC is specific to the application. It doesn't really make much sense for a Python application to be scanning a Java application's memory space, even if it was allowed to do so. Also, not all memory in a system is 'heap'.

Broadly, while there are many different approaches and strategies to garbage collection, we can generalize. Without getting into the weeds of GC algorithms, the idea that it's scanning even the entire memory space that it's allocated is a little off. In a mark-sweep-compact, the approach is to start from 'roots' that are inherent to the application and look at everything that they point to and then everything those point to, and everything those point to, etc. until all 'live' objects are found. That's the 'marking' phase. Everything that's not marked is considered garbage. The live objects are then moved around in a way similar to defragmenting a disk drive and the remaining heap is then considered free space.

In a way, I think the term garbage-collection is a bit of a misnomer. The point of GC is really about keeping track of what is still in use, not really finding garbage, or 'collecting' it. A good example of this is the copy-collector approach. In that design, you keep 2 (or more) spaces where new objects (structures, whatever) are allocated. When on of the spaces fills up, the live objects are identified (see paragraph above) and copied to the other 'side' which becomes the new area for allocation. Nothing specific is done with the 'garbage'; it is simply forgotten and the memory that it occupied is used for new allocations. There are some minor caveats to that but at a high level, there's really no 'collecting' of unused memory, it's just being tracked as available for use.

JimmyJames
  • 24,682
  • 2
  • 50
  • 92
0

Your program, including the garbage collector, will not be able to access memory belonging to other processes unless you have a totally ancient OS. Plus the garbage collector has no reason at all to look at the memory of other processes.

In a primitive garbage collector, every chunk of memory is identified as “free”, “examined” or “used”. The garbage collector marks everything that isn’t “free” as “examined”, then systematically turns memory that can be accessed from “examined” to “used”, and when there is nothing left to mark “used”, everything “examined” is marked “free”.

Now a garbage collector is used a lot for example by any Java program, so it is a prime target for optimisation, and to use any trick in the book to make it run faster. Not visiting all memory if avoidable would help, so a good garbage collector will try to avoid it.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
0

The Garbage collector needs to know which part of memory is a pointer (and how it is encoded), and what spans of the address space are effectively allocated. Those informations are known only to the application or its virtual machine.

Therefore, a GC can't scan the whole memory. If the GC is VM-based, it could effectively span multiple applications, provided they share a common address space. For security reason, it is rarely the case.

As for marking, there are multiple ways to achieve that (bitmap, position-size table, etc). It depends on the GC implementation and the memory-manager of the language or VM.

0

A garbage collector only need to scan the memory which have been actively allocated by the program. It does not need to scan unused memory or memory used by other processes on the system.

When heap memory is allocated, e.g by malloc in C or new Foo() in Java, the language runtime keeps track of what memory areas has been allocated. In its simplest form, it just keeps a list of the memory regions with start and end addresses. (Ideally there is only a single region which grows when new memory is allocated, but since memory can be freed in the middle of a region, you end up with multiple regions over time)

A mark-and-sweep garbage collector scans the stack for pointers and then recursively follows these pointers and mark all objects it meets. Then it scans all objects in the allocated memory, and if some allocated objects have not been marked, the destructor is invoked and the memory is freed.

So even a naive mark-and-sweep garbage collector only need to scan the memory which have been actively allocated by the program at some point.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • "Then it scans the allocated memory, and if some allocated objects have not been marked, the memory is freed." What is the purpose of 'scanning' for dead objects here? There are two scenarios for free memory: it was never used or it isn't being used anymore. In either case, it's free. Once you know where the live allocations are, the free space is simply the complement of that. – JimmyJames Aug 17 '22 at 18:34
  • @JimmyJames: Depending on the language, the GC might need to call destructors on the objects before freeing the memory. – JacquesB Aug 17 '22 at 18:41
  • 1
    Sure but that's a extremely naive and inefficient approach unless you have a very small allocation space. If you have one object with a destructor (or some other hook on it's collection), looking through the entire allocation space for them is like the GC equivalent of bubblesort. It's far more efficient to keep list of objects or just objects with these 'hooks' and again, once you have the set of live object, the dead ones are simply the complement of that. – JimmyJames Aug 17 '22 at 18:51
  • @JimmyJames Yes, what I describe is a naïve mark-and-sweep implementation. GC is a huge area with numerous different algorithms and optimizations available. – JacquesB Aug 18 '22 at 05:59