Bare metal coding for newest Intel or AMD processors with I/O access

Question

I am working in the field of real-time simulation for power electronics. The simulator is based on the most recent Intel and AMD processors. The simulation consists in a loop of code executed the fastest possible together with some I/O access to connect to real-world devices. In our current scheme, we use a custom Real-time Linux OS and we shield some CPU core to obtain the maximum speed.With this OS-based approach however, we cannot have execution cycle lower than 1-2 micro-seconds, because of some variation in the OS task (we believe). Access to I/O also limit the performance.

We actually can acheive sub-micro-seconds simulation using FPGA but the programming of FPGA is difficult and our FPGA computing structure slowly tend to mimic CPU-ALU, so I am saying to myslef, why not used Intel ALU and benefit from 50 years of optimization!

So, I am looking to past this limit of 1-2 us by using some kind of bare metal approach to the problem, using our Intel processors.I read that this is really difficult (and not recommended) for recent Intel processors.

But I wish to insist a little, just to get started with a proof-of-concept case. For example, toggling one bit in a forever loop, with some output to any I/O.

Could someone point to the best starting direction?

Get an assembler, code it, and see what happens. What with cache, scoreboarding etc, you'll get speed, but no predictability or consistency. Bite the FPGA bullet. Or maybe look at the I/O programming available in the Raspberry Pi Pico, it sounds built for your job! — Neil_UK, Dec 01 '22 at 16:02
The processor itself does not have GPIOs. What kind of I/O device are you using? — CL., Dec 01 '22 at 16:30
PCIe packet routing takes hundreds of nanoseconds (each way), and your options for faster buses are quite limited, so if your software is getting with a factor of perhaps 2-4 of the intrinsic latency of the buses, you are doing quite well. If you need faster, you need more specialized hardware. — user1850479, Dec 01 '22 at 17:01
If you really want to stay within a PC platform whilst minimizing latency, you could put an FPGA on a DIMM and use that as a memory mapped GPIO driver, but that's likely more work than moving the compute to the FPGA. — user1937198, Dec 01 '22 at 18:47
A common mistake in HDL design for FPGAs is to see the HDL as a programming language and to try and write a computer program for the FPGA. Designs evolve with state machines trying to be CPUs and lots of work getting done one thing at a time. Has your existing design gone into any of those traps? Normally, HDL design would try to increase throughput and performance by doing tasks in parallel or by pipelining it and it depends if the work lends itself to that. It also does require more design planning to get the fullest performance out of the FPGA. — TonyM, Dec 01 '22 at 20:41
Have you considered an FPGA with a hard processor core so you can offload the complex dynamics to the CPU and the parallel calculations and simple but repetitive fast IO access to the FPGA? — DKNguyen, Dec 07 '22 at 16:41
@DKNguyen: this is actually what we do in practice here but the communication time between the FPGA card and the CPU, done on PCIe, becomes the bottleneck then. But this could be another way to solve my issue: optimizing PCIe for short and fast packets of data (like 128 Bytes with 500 ns) — Christian, Dec 14 '22 at 22:32
@DKNguyen is making a good suggestion. For example, a Xilinx Zynq is a free-standing single/dual-core 800-odd MHz ARM microcontroller and a Xilinx FPGA on the same silicon. The MCU works completely independently of the FPGA but there's a set of very fast and wide AXI buses between the two for squirting data at each other. This part might not be the right one for you but take a look into these MCU+FPGA single-chip parts, if you hadn't already. — TonyM, Dec 14 '22 at 22:43

score 1 · Answer 1 · answered Dec 01 '22 at 16:52

There shouldn't be a problem getting the CPU to work fast enough; it might not be necessary to use assembly language, as a simple C program will probably do the job.

The problem is getting the data from the CPU to the I/O; if the interface is on a board such as PCI, there will be an OS driver, and bypassing that driver with bare-metal code won't be easy.

One option might be to go with a Raspberry Pi equivalent that uses an Intel processor, since you may be able to direct-drive the I/O pins without an OS device-driver. Alternatively you could offload the I/O to a suitable device (FPGA, fast microcontroller, etc.) with a fast link (USB, Ethernet, etc.) to the PC that formulates the I/O commands.

Bypassing a driver on Linux can be easier than you think. The problem is then you have to do what the driver normally does. — user253751, Dec 01 '22 at 17:59

score 0 · Answer 2 · answered Dec 07 '22 at 16:30

Thank you everyone for your inputs. All the comments make sense to me and I prefer to answer them all together here.

The comment of TonyM can help understand better my motivation. It is very true that FPGAs are not really designed for complex algorithms. The reason my company uses FPGA is that the I/O access is extremely fast (because we can put the I/O directly on the FPGA board!). But on the negative side, we have to mimic complex algorithm for dynamic system simulation on the FPGA (Ex: an induction motor differential-algebraic equations, or DAE).

In ‘C’ code in CPUs, DAE are really easy to simulate, but I/O latency is the problem. In the domain of real-time simulation, I/O speed issues are not the typical ones. In most applications, we wish to have the maximum data transfer speed for a large chunk of data. In our applications, we send and receive from I/Os very small chunk of data, but with very low latency at each simulation time step (ex: at each 1 us, we wish to send receive 100 Bytes of data, that is like 12 double numbers only).

The idea of user1937198 to connect an FPGA on the DIMM is interesting but could face some problems I believe with regards to cache memory coherency. We would need to disable cache memory to make sure the processor does not think the required data is in cache but always from the DIMM. Or else play with QuickPath Interconnect protocol… (comments welcomed here).

I wish to say also that I am aware of the ‘naiveté’ of my question. My undergraduate studies occurred with the first IBM-PC, and we were able in our laboratory to launch such ‘bare-metal’ (DOS-free) programs at the time. I remember something about the Interrupt 7, I am not sure.. . Then we could wire some hardware directly on ISA prototyping board. Well, the ISA bus frequency was something like 8 MHz, this is 8 times faster the 1 microseconds period I wish to have for my I/Os 35 years later.

Christian - Hi, This answer isn't clear to me either. Either (a) You are writing it as an answer, because it's the final answer to your question. If so, please "accept" an answer, yours or another one (green "tick") to close the whole question. Or (b) You are mistakenly writing this as an answer, when in fact you still want replies / help. In that case, this is an update, not an answer. In this case I suggest replying to individual comments below each one. Then edit the question to explain what further help / explanation you want and is still missing. || Which applies here, (a) or (b)? Thanks — SamGibson, Dec 07 '22 at 16:35
Hi SamGibson, I just felt a common comment could help the discussion to grow on this very broad topic. Basically, I wish to redo my undergraduate experiment on the newest Intel processor. That is execute a short program directly from boot. Any IO could do, the simplest one. Just to make sure that it can be done. — Christian, Dec 09 '22 at 16:26

Bare metal coding for newest Intel or AMD processors with I/O access

2 Answers2