Thankfully, PC is no longer a GPR in ARM64. Making PC a GPR seems elegant at first glance, but when you actually dive into it and see how it affects processor implementations and how it affects the code you write, it turns out to be extremely messy and inconvenient. Good riddance PC as GPR, don’t let the door hit you on the way out.
It's neat when writing assembler e.g. add a scaled byte value to the PC to implement a jump table or perform a scaled and indexed load to the PC. In ARM it also produced a neat short and fast function prologue/epilogue. In my opinion the worst problem causes are the 1001 and one special cases it adds in an optimised out of order implementation. The Thumb interworking makes it more worse, but is useful to increase code density in ARM v6-M and can even increase performance (per clock) of ARM v7-M cores. I don't expect it causes too much problems in single-issue in-order implementations like the Cortex M3 and M4. I would like to know how much design time and core area is spend on this in the M7 and M85 cores.
Even for regular in-order cores, it makes branch prediction a massive pain because now your fast frontend predictors need to essentially fully decode the instruction in order to determine if it can be considered a branch. Most other ISAs make this simple because there are only a few opcodes that change control flow and so you can very easily just stuff that in your early frontend decoder.
RISCV unfortunately didn't quite do this well since return uses the same opcode for
call, return, and indirect branch and so you have to fully decode the instruction in order to determine whether you should use the RAS or your other predictors. This isn't a problem that can't be overcome (next line predictors help a lot for these early predictions) but it makes something very performance critical just that much harder.