Sounds like a really interesting technical challenge.
Couple of issues I can see: (1) most devices out there would probably be mobile, so no NVIDIA/CUDA for you; (2) even with binding to, say, Apple Silicon, you might still be memory limited, i.e. can you fit the entire net on a single mobile GPU; (3) network latency?
Couple of issues I can see: (1) most devices out there would probably be mobile, so no NVIDIA/CUDA for you; (2) even with binding to, say, Apple Silicon, you might still be memory limited, i.e. can you fit the entire net on a single mobile GPU; (3) network latency?