Performance
This is a subject that most programmers believe they understand once they get past the O(n) vs. O(n^2) stage, but very few really feel in the bones. Yes, it is true that for a simple chunk of code that is doing a bunch of work and is O(n^2), changing the algorithm to be O(n) or O(2n) is going to have a much bigger effect than tweaking a few lines within the code and leaving the overall complexity alone. Right? Isn't it?
Usually.
Locality Is King
| Location | Latency | Data Size |
|---|---|---|
| CPU (Registers) | None | Teeny |
| L1 Cache | Essentially None | Tiny |
| L2 Cache | Little | Some |
| RAM | Some | Some - Lots |
| Page File (HD) | Lots | Lots - Tons |
| LAN | Tons | Lots - Boatloads |
| Broadband | Ouch | Yes |
Think about latency in the terms of CPU cycles; what is the processor not doing while it is trying to access that memory. In other words, if the memory you want to access is already in a register, there's no wait! The CPU can just use it to do work. If it is in the current cache line, that's just as good (and why serious hard-core down-n-dirty performance tuning can be all about tweaking cache line contents). Once you start hitting L2 cache you're talking about giving up a bunch of CPU cycles, and you need to begin asking a very fundamental performance question:
Could I have recalculated this data during the latency it takes to fetch it?
Code Size
Locality is important for object code, too. Imagine you have an exe with three small functions but they are each more than a virtual memory page away from the others. In the worst case, these functions are very interdependent and called all the time, which would force all three pages to be in memory all the time. Now change the code so they're either all in the same page, they aren't so interdependent, or they aren't called all the time. Any of those solutions could allow two more pages to be swapped out. The smaller your compiled code, the fewer pages it occupies, the smaller the chance of a page fault.
Related to this: Don't load a library unless you need to (investigate delay-load for DLLs your code may use but not every time your application runs), unload it as soon as you can.
Memory Usage
The worst case is having code on a page that is rarely but regularly called that would cause a page fault on a regular basis, or code that causes a lot of different places in memory to be touched and faulted in. Some data structures (such as linked lists) can lead to situations where in order to traverse the structure you are roaming all around memory, something as simple as a pointer dereference can be deceptively expensive if it was pointing to memory that is paged out.
Reducing your program's working set reduces the number of memory pages your application occupies, which reduces the number of potential page faults you can have. It doesn't solve performance problems itself but has a high correlation with things that do.
CPU Cycles
As you think about the best way to code your application, keep in mind that CPUs are going to get faster and faster. Latency is unlikely to get enormously better anytime soon. Code to guard against the bottleneck, meaning take advantage of the processor power and stop caching values that are quickly and locally recalculatable. Let's say you calculate a relatively simple piece of data (it takes 2,000 instructions) and cache it in a global. If at any time the memory that the global is in is paged out and you go to access it you will incur a much larger cost going to fetch it than you would to recalculate it from scratch. Furthermore, as CPUs get faster the relative cost for your decision will increase. If you had recalculated it your users would have seen better performance for their money, not the case if the processor is sitting idle waiting to be fed memory.
Responsiveness
Most GUI programs are not CPU-bound, so the feeling of a 'slow' program often result from operations that access hard drives, CD ROMs, and the network. Any user-initiated operation that can possibly take more than half a second needs to be written in such a way that if it does the application provides:
- In-progress UI. The hourglass mouse cursor isn't enough, but I've seen applications that don't even use it.
- ETA whenever possible, both specific information on how long the operation will take and some active UI so the user knows progress is being made.
- A way to cancel the operation.
A really good application will allow the user to continue to interact with other parts of the UI while waiting for the operation to finish.
Here's the trick: If the operation can take more than half a second but won't necessarily do so you don't want to flash in-progress type UI up and make it go away immediately when things go fast. You may have three levels (the times used are just examples):
- Up until .5 of a second don't do anything. If the operation completes in this time it will feel instant without flashing any UI.
- At .5 of a second if the operation isn't done, switch in hourglass cursor.
- At 2 second if the operation isn't done, switch to in-progress UI which will stay up at least another 2 seconds even if the operation completes before then.
Once you've committed to your cursor change or in-progress UI you need to stick with it long enough for the user to not be annoyed by flashing or unreadable UI. Of course if you can determine ahead of time how long the operation will take do the right thing from the start.
Threading
All of this assumes that you can have interactive UI while your program is doing other things. For this, you need multiple threads. This is a major danger area of computer programming, only a small percentage of programmers seem to really wrap their heads around the concepts and are able to extract themselves from the pitfalls (I don't know of any who have never fallen in them).
Here are some ways to simplify the problem.
Single UI Thread
Only have one thread that shows, modifies, manipulates, and interacts with the UI. This thread should be totally event-driven and never do heavy lifting itself. No file I/O, no network I/O, no I/O of any sort whatsoever. If you're making API calls from it they must either be asynchronous to begin with or have no chance of blocking.
Thread Pool
To do any heavy lifting or any I/O at all you want to have a pool of pre-created threads, any of which can do any piece of work. They should never have to interact with each other, the only communication they ever do is with the UI thread in very specific and well understood ways.
They are pre-created for two main reasons because of the cost of creating and destroying threads, which is significant. One aspect of the cost that isn't always well understood goes back to the locality stuff I talked about earlier: Every time you create or destroy a thread it calls into every DLL you have loaded to tell it. This is so the DLLs can do any per-thread processing or cleanup they need to. These DLLs may have been paged out forever, you're almost guaranteeing that you'll begin a page-fault fest.
By creating a pool of threads all at once you solve a set of problems nicely:
- You're allocating all this data all at once in a tight section of memory (locality!).
- You will be accessing the same code in all these DLLs for each thread in a brief span of time, so at worst all the pages will only be faulted in once per DLL and you get the extra threads for no additional cost.
- If any of the DLLs do allocate information per thread, it will also have locality relative to related information about the other threads, making your worker threads more interchangeable.
Now your thread synchronization load is simplified to a smaller number of pitfalls:
- Assignment of work items (probably in a queue) to threads in the pool. The UI thread should be the only writer, and either one thread can be designated the reader and assign out the tasks to the others in the pool or you can write some interesting sync logic so they can wake themselves and the first one to take an item wins.
- Signaling job completion and status to the UI thread. Once each worker finishes their task, they need to store relevant information and final status/error information in a well known place and signal the UI thread. When the UI thread receives the signal it knows where to find the information and assumes ownership of all the memory involved (and is responsible for cleaning it up).
These aren't trivial problems, but they are well understood ones. By reducing the possible surface area of your code that deals with threading issues (by avoiding having worker threads be able to add work items to the queue, for instance) you minimize the amount of code and the amount of debugging that will need to be done to get it right.
How many threads should be in your pool? That depends heavily on what your application does, but I recommend starting with very few (maybe one or two) and doing testing to see if you need more. You probably won't. Marginal utility will drop quickly, so for each thread you add you will get less of a performance boost for the same incremental cost.
COM
All I can really say here is that if you are using COM heavily and/or using COM-based APIs, you had better thoroughly understand how the different threading models work, how they interact, and all the fun pitfalls you're going to encounter. Generally I would lump COM calls in with I/O for the purposes of this discussion. Don't do them in the UI thread unless you know it won't block and understand why (and furthermore understand why you are probably wrong and have a very good reason to think you aren't for this specific case).
UI Performance
- Debug UI painting/flicker/etc. type problems using Remote Desktop across a slow connection.
- Spend the CPU cycles to calculate the minimum amount of redraw you need.
- Use common controls whenever possible, let Windows do the work for you.
Network Performance
- Debug using a very slow or purposefully slowed connection.
- Capture the traffic associated with your connection and go over it.
- Batch requests and responses to mitigate latency.
- Byte count isn't as important as latency in most cases.