I think the only realistic way to solve this without exotic (and expensive) components and subsystems is with cameras and machine vision. Still, that won't be easy, the image processing is not trivial, will require careful calibration, and 10 mm is still a bit too aggressive.
There are systems that work on this principle. One that comes to mind is motion capture from live actors. They usually wear special suits with reflective and sometimes emissive patches on them at key points like the joints. Each camera captures the 2D angle to a bunch of patches each frame, and some pretty fancy software on powerful processors pieces all these angles together to find the 3D location of the patches.
Again, I think 10 mm accuracy within a 100 m cube is out of reach, but perhaps it will still be good enough.
This method is complicated, expensive, and probably not accurate enough, but everything else will be even more complicated and most likely prohibitively expensive unless you're NASA or the military.