A quick benchmark using float32 copies using torch cuda->cuda copies, comparing some random machines:
Raptor Lake + 5080: 380.63 GB/s
Raptor Lake (CPU for reference): 20.41 GB/s
GB10 (DGX Spark): 116.14 GB/s
GH200: 1697.39 GB/s
This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.
Nice! Thanks for that.
55 t/s is much better than I could expect.