SSE4 instruction to improve CPU-GPU collaboration

We already know that Intel's upcoming 45nm processors, code-named Penryn, will harbor a new instruction set called SSE4. The folks at ExtremeTech have now learned some details about an instruction in the SSE4 instruction set that may pave the way for integration between microprocessors and graphics processors. As the site explains, this "streaming load" instruction allows graphics data to bypass the processor's cache:
The streaming load instruction is a 16-byte aligned load instruction. But interestingly, the results are held in a temporary stream buffer that bypasses the normal cache hierarchy, a high-priority expressway that other data types haven't received. Intel identified the streaming-load instruction as ideal for GPU-CPU sharing, as well as imaging.
According to the lead architect on Penryn, Stephen Fischer, "This is an interesting instruction, as it opens the door to new areas of collaboration between CPU and the GPU." Fischer adds that the instruction "improves the read buffer from the GPU to the CPU by a factor of eight." ExtremeTech says that, when asked at a lunch panel whether the instruction was a response to AMD's "Fusion" integrated CPU-GPU, Fischer replied, "I could see where people would say that."
