Get the latest tech news
Going faster than memcpy
While profiling Shadesmar a couple of weeks ago, I noticed that for large binary unserialized messages (>512kB) most of the execution time is spent doing copying the message (using memcpy) between process memory to shared memory and back.
To make it easy to integrate custom memory copying logic into the library, I introduced the concept of Copier in this commit. The original reason for introducing this construct was to allow cross-device usage, where a custom copier would be implemented to tranfer between CPU and GPU. For small to medium sizes Unrolled AVX absolutely dominates, but as for larger messages, it is slower than the streaming alternatives.
Or read this on Hacker News