Stephen Fischer, chief Intel Penryn architect revealed that one of 47 new SSE4 instructions embarked on the next Intel architecture, will be able to accelerate 8 times the bandwidth between CPU and GPU.
Special plugs for faster bandwidth
Thanks to this instruction, MOVNTDQA, CPU can load 64 bytes data packages from memory, instead of 16 bytes. Moreover, these batches are not stored in the traditional CPU cache levels 1, 2 or 3, they are transferred in dedicated plugs. This make system save transactions, FSB bandwidth and increase transfer effectiveness.
To see the future in the SSE
A similar instruction had already been introduced with the SSE2 for the opposite exchanges, when data coming from peripheral must be written in main memory. But which relationship with a graphics board do you wonder? Eh well, this kind of writings and readings of many consecutive data is usually met within the framework exchanges between CPU and GPU, when each one access the frame buffer, i.e data forming image.
This kind of optimizations will not bring notable results before many months, but, referred to the current context (Larrabee, AMD Fusion) we can note Intel effort to CPU and GPU fusion.