Technical Background
AGP has several advantages over PCI. It offers a minor data transfer rate advantage when it comes to moving the geometry stream from the CPU to the graphics card. When it comes to managing large texture databases, AGP's GART table allows the OS to manage textures in off screen memory as well as in system memory, and allows the graphics card to access them directly in either location. Prior to AGP, the game developer had three options for methods to manage textures:
Limit the texture database to whatever would fit in off-screen memory only. This usually delivers outstanding frame rates, but memory size constraints can limit artistic creativity. Depending on the graphics card, texture space could be as low as only one megabyte or possibly as high as five or six megabytes.
Use the OS to manage textures in main memory, and require the CPU to copy textures from main memory to graphics memory as needed. This is PAINFULLY slow. In order to make AGP look as good as possible, Intel likes to compare AGP to this mode. This is what DirectX Retained Mode does, but game developers do not generally use this method.
Place the most frequently used textures in off screen memory (like #1 above), then lock down a few additional megabytes of system memory for the remainder of the texture database. If the graphics chip needs a texture that is not in graphics memory, then the accelerator must use PCI master mode (DMA) to copy the needed texture, on demand, into texture swap space in graphics memory. Performance is very good. This has been the preferred approach for game developers, and can be programmed under Direct X Immediate Mode.
AGP is a modified approach to option 3. Memory management is a little more flexible, and the data transfer rate is better because of AGP's faster clock speed and bus pipelining.
AGP offers two ways to deal with textures. One is called DMA mode which operates almost exactly like Option 3 above, but transfers occur over AGP rather than PCI. The other is Execute Mode, which allows the graphics chip to access the texture information in main memory without first copying it to graphics memory. The effective bus throughput of Execute mode and DMA mode are the same. If anything, DMA mode could be faster because of better concurrency and deeper pipelining.
Intel has gone to great lengths to convince game developers that DMA mode stinks. They have even gone so far as to refer to method #2 above as DMA mode in order to confuse everybody. DMA stands for "Direct Memory Access". There is nothing direct about using the CPU to copy data. This is pure deception. True DMA uses a hardware bus master, like a PCI or AGP graphics accelerator.
Game developers and users should prefer AGP's DMA mode because it offers excellent performance while still being architecturally compatible with the installed base of PCI accelerators. Intel prefers Execute Mode because it only runs with AGP. As we all know, Intel is always trying to stir up more ways to persuade users to dump their PCI Pentium 233 systems ASAP, and go buy a more costly P2 AGP system. Intel's Mission is not to "Accelerate 3D Graphics", but rather to "Accelerate Obsolescence".
AGP's DMA mode offers excellent concurrency because the game can detect if it must swap textures before the texture is actually needed, while the CPU is still calculating the geometry (in the geometry setup stage). This way, the graphics card can begin fetching the texture before it is needed to paint pixels on the screen. Concurrency is one of the keys to performance.
In AGP Execute mode, texture accesses are driven by the rasterizer at the final stage of the 3D pipeline. At this point, the accelerator is dead if it cannot have immediate access to textures. In this way, AGP creates a nasty performance bottleneck. Instead of having immediate access to textures in high bandwidth local memory, the accelerator must stall while it arbitrates for access to slower system memory via AGP.
30% Reduction In Accelerator Performance
Intel has developed a software tool which is useful in comparing the performance of AGP vs. Local texturing modes. It is called IBASES. As a matter of principle though, I do not recommend this tool to anyone. IBASES does not support AGP DMA mode texturing. Instead of AGP DMA mode, it substitutes the extremely useless CPU copy mode (method #2 above). Oddly enough, the software still refers to this as "DMA Mode". This is clearly NOT a mistake, but rather a blatant attempt to deceive. For anyone using this tool, results from the "DMA" test should be completely disregarded. One should instead assume that AGP DMA mode results would be about the same as AGP Execute mode.
I evaluated local vs AGP texture performance of the following accelerators:
ATI Rage Pro
3D Labs Permedia 2
nVidia Riva 128
The program allows the user to change the number of times the textures are accessed per frame. This has the effect of gradually increasing the total texture bandwidth demand. As seen over AGP, the texture bandwidth demand created by IBASES in the chart below ranges from 256K per frame at the left extreme, up to 4 megabytes per frame at the right extreme of the chart.
The chart shows the average difference in rendering performance for all of the accelerators using AGP Execute mode texturing, compared to local graphics memory. This data demonstrates that overall, AGP execute mode is about 30% slower than local texture mode.
10% Reduction In CPU Performance
The other side of the performance equation is the CPU. What happens to the CPU when AGP texturing is activated? When AGP texturing is turned on, the graphics card takes control of the main memory bus in order to access texture data. When this happens, the CPU is locked out of main memory. If the CPU is not very busy, this may not be a problem. But if the CPU is engaged in a computationally challenging task (such as a game) it is highly possible that the CPU may stall, waiting for its turn to access main memory.
Right now there is no good way to physically test this scenario. It must be modeled. For this and other reasons, I have built a rather complex software model of the entire PC architecture. Using this model, I am able to estimate the CPU performance impact of main memory arbitration conflicts resulting from AGP texturing.
Intel has publicly released figures that show that a 300MHz P2 requires about 100MB/s of external bandwidth while running 3D Winbench. Third party testing has demonstrated that games can create a CPU bandwidth demand of 50 to 120MB/s from main memory. In the face of this load, AGP texturing could also place a concurrent main memory bandwidth demand of about 50 megabytes per second (or more).
Using this data, my system performance model shows that main memory conflicts between the CPU and the graphics controller will result in a CPU performance reduction of more than 10%. This assumes the use of a Pentium II with the back side cache intact. The cacheless Covington would be brought to its knees under these circumstances.
The assumption that AGP is a little faster now, but will be a lot faster when it is "really turned on" is completely false. In fact the opposite is true. In most cases, system which actually use AGP for textures will be potentially 40% slower than systems which use local graphics memory for textures instead.
Is this enough motivation to make sure you get a graphics card with enough memory to do the job?