lilbuddhaman wrote:If PS4 has a real-time OS, with a libGCM style low level access to the GPU, then the PS4 1st party games will be years ahead of the PC simply because it opens up what is possible on the GPU.
Anything that would be doable with the PS4 would be doable with the PC ... libgcm (from what I've read) is merely a low-level opengl implementation. As shown by Valve's work with linux/opengl ports of the Source engine, the benefits are reachable and there; they just need to be capitalized on. Also, low-level coding = harder...so don't expect most 3rd parties to jump on it.
I'm excited for PS4, not that I'm going to buy one, but for what it means for possible PC ports.
not everything... check what game developers have to say
The #1 issue with PC gaming is API overhead. Batching proved to be terrible on PC compared to consoles.
"On consoles, you can draw maybe 10,000 or 20,000 chunks of geometry in a frame, and you can do that at 30-60fps. On a PC, you can't typically draw more than 2-3,000 without getting into trouble with performance, and that's quite surprising - the PC can actually show you only a tenth of the performance if you need a separate batch for each draw call. "
http://www.bit-tech.net/hardware/graphi ... -directx/1from
http://timothylottes.blogspot.com/ shows you some of the stuff that AMD put in their GPU thats console only.
"Now lets dive into what isn't provided on PC but what can be found in AMD's GCN ISA docs,
Dual Asynchronous Compute Engines (ACE) :: Specifically "parallel operation with graphics and fast switching between task submissions" and "support of OCL 1.2 device partitioning". Sounds like at a minimum a developer can statically partition the device such that graphics can compute can run in parallel. For a PC, static partition would be horrible because of the different GPU configurations to support, but for a dedicated console, this is all you need. This opens up a much easier way to hide small compute jobs in a sea of GPU filling graphics work like post processing or shading. The way I do this on PC now is to abuse vertex shaders for full screen passes (the first triangle is full screen, and the rest are degenerates, use an uber-shader for the vertex shading looking at gl_VertexID and branching into "compute" work, being careful to space out the jobs by the SIMD width to avoid stalling the first triangle, or loading up one SIMD unit on the machine, ... like I said, complicated). In any case, this Dual ACE system likely makes it practical to port over a large amount of the Killzone SPU jobs to the GPU even if they don't completely fill the GPU (which would be a problem without complex uber-kernels on something like CUDA on the PC).
Dual High Performance DMA Engines :: Developers would get access to do async CPU->GPU or GPU->CPU memory transfers without stalling the graphics pipeline, and specifically ability to control semaphores in the push buffer(s) to insure no stalls and low latency scheduling. This is something the PC APIs get horribly wrong, as all memory copies are implicit without really giving control to the developer. This translates to much better resource streaming on a console.
Support for upto 6 Audio Streams :: HDMI supports audio, so the GPU actually outputs audio, but no PC driver gives you access. The GPU shader is in fact the ideal tool for audio processing, but on the PC you need to deal with the GPU->CPU latency wall (which can be worked around with pinned memory), but to add insult to injury the PC driver simply just copies that data back to the GPU for output adding more latency. In theory on something like a PS4 one could just mix audio on the GPU directly into the buffer being sent out on HDMI.
Global Data Store :: AMD has no way of exposing this in DX, and in OpenGL they only expose this in the ultra-limited form of counters which can only increment or decrement by one. The chip has 64KB of this memory, effectively with the same access as shared memory (atomics and everything) and lower latency than global atomics. This GDS unit can be used for all sorts of things, like workgroup to workgroup communication, global locks, or like doing an append or consume to an array of arrays where each thread can choose a different array, etc. To the metal access to GDS removes the overhead associated with managing huge data sets on the GPU. It is much easier to build GPU based hierarchical occlusion culling and scene management with access to these kind of low level features.
Re-used GPU State :: On a console with low level hardware access (like the PS3) one can pre-build and re-use command buffer chunks. On a modern GPU, one could even write or modify pre-built command buffer chunks from a shader. This removes the cost associated with drawing, pushing up the number of unique objects which can be drawn with different materials.
FP_DENORM Control Bit :: On the console one can turn off both DX's and GL's forced flush-to-denorm mode for 32-bit floating point in graphics. This enables easier ways to optimize shaders because integer limited shaders can use floating point pipes using denormals.
128-bit to 256-bit Resource Descriptors :: With GCN all that is needed to define a buffer's GPU state is to set 4 scalar registers to a resource descriptor, similar with texture (up to 8 scalar registers, plus another 4 for sampler). The scalar ALU on GCN supports block fetch of up to 16 scalars with a single instruction from either memory or from a buffer. It looks to be trivially easy on GCN to do bind-less buffers or textures for shader load/stores. Note this scalar unit has it's own data cache also. Changing textures or surfaces from inside the pixel shader looks to be easily possible. Note shaders still index resources using an instruction immediate, but the descriptor referenced by this immediate can be changed. This could help remove the traditional draw call based material limit.
S_SLEEP, S_SETPRIO, and GDS :: These provide all the tools necessary to do lock and lock-free retry loops on the GPU efficiently. DX11 specifically does not allow locks due to fear that some developer might TDR the system. With low level access, the S_SLEEP enables placing wavefront to sleep without busy spinning on the ALUs, and the S_SETPRIO enables reducing priority when checking for unlock between S_SLEEPs.
S_SENDMSG :: This enables a shader to force a CPU interrupt. In theory this can be used to signal to a real-time OS completion of some GPU operation to start up some CPU based tasks without needed the CPU to poll for completion. The other option would be maybe a interrupt signaled from a push buffer, but this wouldn't be able to signal from some intermediate point during a shader's execution. This on PS4 might enable tighter GPU and CPU task dependencies in a frame (or maybe even in a shader), compared to the latency wall which exists on non-real-time OS like Windows which usually forces CPU and GPU task dependencies to be a few frames apart.
Full Cache Flush Control :: DX has only implicit driver controlled cache flushes, it needs to be conservative, track all dependencies (high overhead), then assume conflict and always flush caches. On a console, the developer can easily skip cache flushes when they are not needed, leading to more parallel jobs and higher performance (overlap execution of things which on DX would be separated by a wait for machine to go idle).
GPU Assembly :: Maybe? I don't know if GCN has some hidden very complex rules for code generation and compiler scheduling. The ISA docs seem trivial to manage (manual insertion of barriers for texture fetch, etc). If Sony opens up GPU assembly, unlike the PS3, developers might easily crank out 30% extra from hand tuning shaders. The alternative is iterating on Cg, which is possible with real-time profiling tools. My experience on PC is micro-optimization of shaders yields some massive wins. For those like myself who love assembly of any arch, a fixed hardware spec is a dream.
...
"