Compiling to some bytecode is an old tradition. UCSD P-code existed in 1978, and had many precursors. Today, LLVM can be seen as a bytecode, targetted by Clang/LLVM ahead-of-time compiler suite and GCCJIT can be viewed as a JIT related to GCC (with GIMPLE sort-of being some internal bytecode).
(hence, bytecode, JIT, ... has quite fuzzy meanings today; JIT's broadest sense is compilation inside the process running the compiled code.)
The JVM bytecode was initially implemented as interpreter. But Java become popular enough to get JIT based JVMs (and Sun invested a lot in JIT technology, so this helped Java to become successful).
And JIT existed long time ago (in the early 1980s, e.g. in Lisp machines, and even in 1960 on the CAB 500 computer and others), before even the name was used. Many Common Lisp or Smalltalk implementations had JIT compilers (and today, SBCL is fully JIT-ing).
In my understanding, Microsoft designed the CLR bytecode to be JIT compiled (hence got different tradeoffs in its bytecode than the JVM). And it has recently published its implementation as open-source software and ported it to Linux (before that, Mono existed on Linux).
A bytecode is often more compact than native binary executables, it can be made portable to several architectures (e.g. x86 32 bits and x86-64 and also ARM 32 bits, ARM/Aarch64, ...) and might be designed to avoid (or at least soften) dependency hells.
A big advantage of JIT compilation is that the VM can recompile some parts of the bytecode based upon dynamic contextual information (e.g. profiling, call stack introspection, ...) some code. Some JIT-ing infrastructures like libjit, asmjit, LLVM, GCCJIT, ... don't do that (however, the implementation using them could do that by repeated use of the JIT-ing infrastructure), but most industrial JVM or CLR implementations do it (and some people call JIT only that lazy on-demand dynamic compilation; for me JIT is just a buzzword for dynamic compilation at runtime). This is difficult or impossible with AOT compilation (at least, requires LTO), and is impossible if you want to do profile-guided optimization dynamically at runtime (as most JVM or CLR JIT implementations are rumored doing). Also, a bytecode VM don't need to JIT-compile all the bytecode, but only the most used parts (as HotSpot does) and keep interpreting the rarely used cold code.
Also JIT implementations can cooperate much more (and better...) with sophisticated garbage collectors.
PS. I know nothing about Windows. I never used it. I'm using Linux since 1994 and Unix since 1987.