I consider writing a small program for the ARM Cortex-A9 in the Xilinx Zynq-7000 FPGA, so the program will be small enough to fit into the 32 KB L1 instruction cache. The data will also be less than 32 KB, thus can fit into the L1 data cache.
I expect it to be a very linear program, thus without penalty for branch misprediction etc. Also I expect there will be no data dependency within the length of the pipeline, or that bypass will be able to provide the data.
Is the L1 instruction cache fast enough so I can get an execution speed of at least at least 1 Instruction Per Cycle (IPC)?
Is the L1 data cache fast enough so data load and store access will not have any affect on the execution speed of at least 1 instruction per cycle (IPC)?