Thankfully, PC is no longer a GPR in ARM64. Making PC a GPR seems elegant at fir...

crest · on Feb 10, 2023

It's neat when writing assembler e.g. add a scaled byte value to the PC to implement a jump table or perform a scaled and indexed load to the PC. In ARM it also produced a neat short and fast function prologue/epilogue. In my opinion the worst problem causes are the 1001 and one special cases it adds in an optimised out of order implementation. The Thumb interworking makes it more worse, but is useful to increase code density in ARM v6-M and can even increase performance (per clock) of ARM v7-M cores. I don't expect it causes too much problems in single-issue in-order implementations like the Cortex M3 and M4. I would like to know how much design time and core area is spend on this in the M7 and M85 cores.

sweetjuly · on Feb 11, 2023

Even for regular in-order cores, it makes branch prediction a massive pain because now your fast frontend predictors need to essentially fully decode the instruction in order to determine if it can be considered a branch. Most other ISAs make this simple because there are only a few opcodes that change control flow and so you can very easily just stuff that in your early frontend decoder.

RISCV unfortunately didn't quite do this well since return uses the same opcode for call, return, and indirect branch and so you have to fully decode the instruction in order to determine whether you should use the RAS or your other predictors. This isn't a problem that can't be overcome (next line predictors help a lot for these early predictions) but it makes something very performance critical just that much harder.