Key TechniquesSSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups.
Trust the OS — No custom expert cache.
Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead.
Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration.
The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
1 час назад @ github.com
infomate
