Anyone that has done much programming, especially web programming, has from time to time had to look into a performance issue. I have had to do this numerous times on numerous projects and the cause(s) can be incredibly varied. Sometimes the culprit is a non-indexed table, other times large objects are loaded multiple times instead of shared throughout a page's lifecycle, or maybe someone decided a cursor was a good thing (a cursor is never a good thing btw).
This brings me to this week. As some of you know, the application I'm currently working on is a large (> 2M lines of code) redistributed web based management application. There has always been talk of it being slow in places but nothing quantitative until recently. So this week the testing department ran some timing tests of the current production version of the application against the version that is currently in Beta and found that a few parts of the application are performing 3x slower than previously. This isn't good.
Enter dotTrace. I've heard about this application from multiple sources, most recently in twitter, and it has sounded like a good way to see what's going on. After I did a quick look at the application it sounded good so I downloaded it and started the 10 day trial. It didn't take long for me to completely fall in love with dotTrace. I immediately referred it to the other developer also doing performance evaluation. Soon my manager was looking to buy a few licenses, but I'm jumping ahead.
This brings us to what I found. On initial testing nothing stood out as a bad offender, yes the pages in question weren't great, but nothing jumped out as a real trouble spot. Then I was finally able to load a real customer database that the testing had been done against and re-ran the test.
This is what I first saw:
Wow. 95% of processing and > 17s all in Page_Load. Now that's not unexpected as most method calls in are within that phase of the lifecycle but still, the numbers just jumped out like a red flag attached to a 2000lb bull being chased by a locomotive. So I dug a little deeper and found this:
Hmmmm, 10,802 calls to GetGroupSummaryChild, which is doing auditing that results in 10 calls each to GetCustomAttributes. That's 108,020 reflection calls which even at only .1 milliseconds each add up to almost 11 seconds. Totally unacceptable.
So I looked a little deeper and found that there are 10,802 groups in the system. Not all are needed for this page so the first though was to create a call that only returned the groups being needed. This is still a good solution but when I looked even deeper at why pulling those groups was slow that's when I noticed the auditing calls that in turn are using the reflection calls which is where much of the time is actually being spent. Which brought up the question "Why is this being audited, it's a read-only collection fetch"?
As it ends up a boolean was being hard-coded to true for all loads of GroupSummaryChild and there was never any update code in place. So that resulted in the reflection calls to load attributes that would be needed in case of auditing but were never actually used. So by flipping that hardcoded boolean from true to false, the total execution time of the page went from nearly 18 seconds to under 2.5 seconds when tested with all variables the same but that one boolean. There is still room to improve the page performance but that now falls into the acceptable range. Here is the trace file after just flipping that boolean to disable the loading of unused audit data:
Vast improvement there, we could have stopped at that point, but we kept going. We decided to also address the issue of loading > 10,000 groups when we only needed about 25. Seems like a good idea right? In our system this entails not just adding the overload for the GetCollection method but also that new signature has to be passed up through Interfaces and Factories (we use reflection a lot) and of course the stored procedure needs modified as well. Overall there would need to be changes in around half a dozen different places, not major changes, but files touched anyway.
Well, unbeknownst to me the exact overload I needed was added about three weeks ago (the page calling the old version was built about two years ago), so all I had to do was modify the page to use the new version of GetCollection passing in the correct parameters and done. Here's the trace after that fix was also tested:
That's right, under 1 second. Exact same functionality as before any changes were made. For doing everything this page is doing, that's not bad at all.
Best of all, all of this from running the initial trace against real data to testing the first fix against real data only took maybe 30 minutes. Very worth a license of dotTrace as to find and fix that bottleneck without a good tracing tool would have been much more difficult and time consuming.