Traditional software tuning focuses on finding and tuning hot spots, the 10% of the code in which a program spends 90% of its time. Most graphics hardware accelerators are arranged in a pipeline, where one stage may perform vertex transformation and lighting while another draws the actual pixels into the framebuffer. Because these stages operate in parallel, it is appropriate to use a different approach: look for bottlenecks - overloaded stages that are holding up other processes.
At any time, one stage of the pipeline is the bottleneck. Reducing the time spent in that bottleneck is the best way to improve performance. Conversely, doing work that further narrows the bottleneck, or that creates a new bottleneck somewhere else, can further degrade performance.
If different parts of the hardware are responsible for different parts of the pipeline, the workload may instead be increased at one part of the pipeline without degrading performance, as long as that part does not become a new bottleneck. In this way, an application can sometimes be altered to draw, for example, a higher-quality image with no performance degradation.
Different programs (or portions of programs) stress different parts of the pipeline, so it is important to understand which elements in the graphics pipeline are the bottlenecks for your program.
Note that in a software implementation, the CPU does all the work. As a result, it does not make sense to increase the work for any stage if another is using more CPU time; you would be increasing the total amount of work for the CPU and decreasing performance.