Collecting performance profiles in Azure WebApps on demand

Agenda: All of sudden your Application starts responding slow. What to do?

In order to figure out why it is slow, you’ll need to understand what happens internally, what does each thread do? How the wallclock time is spent while processing your request? The source of that data is dynamic profiling.

Azure WebApps can collect performance profiles on demand in KUDU advanced tools:

Advanced Tools (aka Kudu) provides a collection of developer oriented tools and extensibility points for your App Service Apps.

Profiler should be switched on for half a minute from Process Explorer view:

It just takes a few steps to perform the collection:

  • Once Kudu is opened, switch to Process Explorer tab
  • Select w3wp.exe in the list of processes (not the one with scm as it stands for KUDU ui)
  • Ensure to check Collect IIS Events
  • Click Start Profiling
  • Wait for 20-30 seconds
  • Click Stop Profiling

Wait for a minute to let instance-name-process.diagsession to be produced and downloaded. That is much much simpler than collecting performance profiles in Docker.

Once the performance profile is downloaded, we can jump to analysis with a help of PerfView.

For extreme cases (application turns unresponsive at all) collection shall time out, you’ll need to collect a few sequential memory snapshot and inspect what does each thread do.

Memory snapshots in Azure Web Apps

A memory snapshot could explain why an app does not behave well. The process of collecting memory snapshot is straightforward locally:

  • Download ProcDump
  • Find the process ID in Task Manager
  • Trigger onDemand collection: procdump.exe -accepteula -ma 123

Why that does not work in cloud?

Despite KUDU can generate memory snapshots on demand, it frequently fails in real life:

KUDU native way to collect snapshot

The operation attempts to capture a few GB in size file, as well as download it by plain HTTPS protocol; full halt in case a connection timeout / bad network weather bubbles up.

How to do it better?

Azure WebApps already arrive with sysinternals tools, including ProcDump; therefore command-line collection looks possible with the help of Kudu console.

Resulting file would land on the file system and can be downloaded any later time.

Collection time

Lets copy ProcDump into the local\temp folder as snapshots would be saved there:

D:\devtools\sysinternals> copy-item .\procdump.exe -destination D:\local\Temp

ProcDump was copied successfully:

D:\local\Temp> ls

Next step is to find process ID we’d like to make a snapshot of (12628 in our case):

Time to trigger collection with accepteula flag 😉

The resulting file is close to 2 GB in size. Since 7Zip is included into OOB WebApp, we could optionally archive the snapshot:

The compressed file is ~310MB. Still might be too much for HTTPs.

FTP to transfer large files

D:\home is already exposed for WebApp deployments via FTP, copy files there:

The publishing profile contains FTP credentials to use in order to reach file system:

Upon connecting we see the snapshot is waiting to be downloaded on a high speed:

Summary

We’ve triggered memory dump collection on WebApp side using software already installed in WebApp. The resulting file can be optionally archived. The fastest way to download snapshot would be to use FTP.

Performance crime: Concat

Quiz time

Brainstorm to begin with – which code is faster and why:

var language = new object();
var version = new object();

var a = string.Concat(itemId, language, "¤", version);
var b = string.Concat(itemId, language.ToString(), "¤", version.ToString());

Answer

Despite both lines look super familiar, a is 30% slower due to an overload with params being picked:

As you might have guessed (from previous post), params force an array allocation (which I treat as crime):

Calling ToString explicitly cuts out one third of wall clock time (1076 vs 767):

Does the code look familiar?

The sample code looks to be a suitable candidate to act as a cache key for an item (in fact, it is). Not only it produces a new string on each go, but also forces brute-force string over direct by-field comparison. We can easily fix those flaws and cook a faster version:

struct Key: IEquatable<Key>
 {
   private readonly ID _id;
   private readonly Language _language;
   private readonly Version _version;

   public Key(ID id, Language language, Version version)
    {
      _id = id;
      _language = language;
      _version = version;
     }
...
public static bool operator ==(Key a, Key b) => Equals(a._id, b._id) && Equals(a._version, b._version) && Equals(a._language, b._language);

The struct based-key uses by-field comparison and does not need allocations resulting in 4 times faster execution:

Summary

Joining strings to build a cache key might be four times as slow as doing it right with structs. You’ll tax system users with ~50ms on each page request (real-life numbers).

On the one hand, 1/20 of a second sounds very little.

On the other hand, the delay is brought by a few lines of code, whereas enterprise-level frameworks have huge code bases with non-zero chances having only one non-optimal code.

Make unit tests public

Public unit tests for framework is win-win both for vendor, and framework consumers. My top 5 reasons for that.

Led by example: Unicorn

Unicorn is popular serialization mechanism with > 1.3M Nuget downloads!

All you need to get started is to download the package and read readme. Custom requirements (f.e. store only specific part of the tree) are simply implemented as test configuration already provides all the samples you may ever need, example:

	<include name="Basic" database="master" path="/sitecore/layout/Simulators">
		<exclude path="/sitecore/layout/Simulators/Android Phone" />
		<exclude path="iPhone" />
		<!-- relative path exclusion based on parent include path -->
		<exclude path="iPhone Apps/1.0" />
		<!-- exclude where name starts with a different include -->
	</include>

Wondering about processing order and dependent configs? Answers are already in configuration dependency tests for you.

Led by example: Sitecore.NSubstituteUtils

Sitecore.NSubstituteUtils allows mocking centric Sitecore objects (Item/Database) and so much more, see Unit testing in Sitecore. The learning curve is just to look at the tests:

        [Theory, InlineAutoData("test field", "test value")]
        public void ItemIndexerByFieldId_AfterAdd_ReturnsAddedValue(string fieldName, string addedFieldValue, ID fieldId)
        {

            var fakeItem = new FakeItem();

            fakeItem.Add(fieldId, fieldName, addedFieldValue);
            Item item = fakeItem;


            var actualFieldValue = item[fieldId];


            actualFieldValue.Should().Be(addedFieldValue);
        }

All features are discovered by looking through test names. What could be better than reading neat test code and have an option to execute in your environment on demand?

Never misleads you

What could be worse than non-existing documentation? The wrong/outdated one.

Since unit tests are tied to build process, they would never outdate. Even better, they include maintenance effort to stay in a healthy state, welcome Living Documentation!

Brings light where you need it

Have you ever wondered how does Sitecore.Pipelines.HttpRequest.ItemResolver work?

Would it honor item display name over name, how does it deal with encode replacements or security access?

The logic must be guarded by vendor to avoid behavior changes, therefore code should have been covered with tests. And yes, should tests be public, you could debug it!

Shortcut for bug reports

A pull request with a failing unit test can reach Product Department the same day you’ve found it! You no longer need to spend ages to hand-pick reproduction steps to make Customer Service happy and raise a bug on your behalf.

Increases product code quality

Code must be in a good shape (f.e. without hidden dependencies) to become a subject for unit testing.

Summary

A framework consumer gets following benefits:

  • Simplified learning curve, led by example
  • Always accurate documentation
  • Ability to verify system parts behaviors based on fed in data

A framework vendor gets in return:

  • Accurate code-level bug reporting
  • Automated documentation maintenance
  • Increased code quality thanks to increasing safety harness


Myth busting: Memory snapshots

Photo by Chris Peeters on Pexels.com

Find out the slowness cause from a single snapshot

Likewise looking at an image on top, can you identify why the car is ahead? No, unless something very obvious (burning engine). You rely on performance profiles for the purpose.

Photo by Pixabay on Pexels.com

Are there any controls/combos to navigate through the snapshot?

Yes, I’ve handpicked the useful commands here.

Photo by imeenhae on Pexels.com

Check cache statistics

Yes, in-memory caches can be located by type and inspected:

Photo by RF._.studio on Pexels.com

Inspect what operations are being performed

Yes, you can see what does each thread do at a time of snapshot generation. If application was hanging, you’ll see why.

Is docker file system slow?

Let’s measure how fast file operations are performed in docker:

The diskspd.exe Microsoft tool will generate load:

  • Operate with 2GB file size
  • A few seconds for warming-up and slowing-down
  • Block size of 128K
  • 4 threads
  • Random reads
  • Run test for 2 minutes

diskspd.exe -b128K -d120 -o16 -t4 -r -c2G -w60 -W2 -C2 filepath

No other resource-consuming activities were performed in the system during test.

Test specs

Docker Desktop v.2.2:

Tested docker version

The host Windows version is 1909:

Even though same hardware used inside/outside container, I’ll mention year-old SSD & 8 core CPU hardware used in benchmark.

Results

Surprisingly, local file system is 10 times faster than one sitting inside container:

Results, summarized

Disclaimer: At the moment of the article initial publish, I do not have a solid technical explanation why that happens. Detailed results could be found below.

Baseline: local file system

Outside container, local results

Inside container: exposed volume

Exposing volume outside

Inside container: do not expose volume

Do not expose file system outside container

Performance crime: hidden culture-specific parsing

Paying respect to user culture while consuming user input is totally correct and legal. While hiding the fact of value being parsed with specific user culture may cost a dime.

How expensive that could be?

Make a few predictions to begin with:

  • What is the most time-consuming Sitecore hook?
  • How much time to get initialized might it take?
  • What is the most time-consuming operation in that hook?

Did you predict ~ 3/4 second spent just to transform 00.00:05:00 to TimeSpan?

Flame graph scoped to HookManager callees

Sitecore.Configuration.Settings.GetTimeSpanSetting(string name, string defaultValue)would secretly parse defaultValue using context user culture that is stored in user profile and quite expensive to get on first run:

3/4 second is spent on fetching UserProfile culture

The neighboring API uses CultureInfo.InvariantCulture for reading config-based value which perfectly makes sense for the task:

Only a few consumers rely on the performance criminal ParseTimeSpan API in 9.3:

Summary

Should you need to introduce parse method that pays respect to user culture, please make dependency explicit. Otherwise your intent might be misused with a price tag.

In our example 3/4 seconds of startup time could be saved by:

  • Having default value set as TimeSpan.FromMinutes(5)
  • Document DateUtil.ParseTimeSpan to use context user culture

Reviewing docker performance profile

We have profiled application start in the previous post, and will carry on wall clock time investigation; let’s start by looking at Thread Time report:

Selecting the Thread Time report

First thing to notice – a set of processes running next to w3wp:

  • CSC is a compilation effort, every entry takes ~4.6 seconds at least (delays application start as well)
  • Robocopy util that deals with files sync
Processes inside container

Selecting a flame graph for w3wp seems quite sparse so far with little of use:

Initial view of the w3wp

That happens as all threads are displayed now without filtering. Let’s include only application_start -related entries by expanding IncPats filter:

Showing only threads that have Application_Start in call stack

Let’s hide/fold tiny bits that are <4% of time to get a high-level understanding:

See nodes with at least 4% of overall time taken

Folding method wrappers to get a cleaner output:

  • dynamicclass – a runtime wrapper around Sitecore config processor
  • CorePipeline – a runtime wrapper around Sitecore pipelines
  • Reflection – .NET reflection to invoke methods/props
Cleaner output with lower flames

The top-right method belongs to Sitecore.CES, let’s focus on it by changing IncPats:

Three threads mention Sitecore.CES

There are 3 threads on the picture (the second frame from bottom divided into three):

  • 2796: Sleeps all the time, does not add any value; can be skipped
  • 2780: Application_Start thread is blocked until initialization is done
  • 1612: Performs initialization

Removing 2796 from view by adding it to ExcPats:

How CES blocks Sitecore startup

It seem that Sitecore attempts to load device database to file system whenever container is created, which takes ~23 seconds of startup time.

The solution is simple – persist device detection folder via docker.compose:

  - .\data\cm\DeviceDetection:C:\inetpub\wwwroot\App_Data\DeviceDetection

Secondly, the DeviceDetection initialization wait timeout could be reduced to a single second via setting change in a local/dev sandbox:

Setting to control how long to wait for device detection database to be loaded

Peeling off already resolved

After Sitecore.CES impact on startup was eliminated, further application_Start investigation (excluding Sitecore.CES from view via ExcPats) could be carried on:

Remaining culprits

Further steps

Over 87% time in await likely indicates some room for improvement:

Almost 90% of time is spent awaiting

Every remaining flame bar is to be inspected (in a similar manner as we did just now) to understand if there is a potential to do better.

Performance in Docker

Have you ever wondered how to make system run faster in container?

PerfView is the tool to measure performance, right. Being a UI-based application running on top of Windows Event Tracing sounds to be irrelevant for containers, or no?

How could it work?

Operating system has to virtualize the events and forward them to the ETW session in the appropriate container. Since containers and the underlying host share a single kernel, the container’s base image OS version must match that of the host. If the versions are different, the container may start, but full functionally isn’t guaranteed:

Local /Host Windows version
Windows version running inside container

Step one: Copy PerfView into container

Copy PerfView files via volume mechanism; I’ll re-use logs folder for sake of speed:

Already existing volume to output log files
The content of the PerfView folder on host machine

Step two: Connect to container with cmd

List running containers via docker container ls:

Launching command line inside container

And connect to the Sitecore instance to profile (CM in our case) via docker exec -i -t [container_id] cmd command.

Switch to exposed folder and verify PerfView is inside container:

Inspecting the content inside container

Step three: Trigger collection

Prepare the scenario you want to profile, and trigger profiling via command:

PerfView.exe /AcceptEULA /threadTime /logFile=log.txt /LowPriority /maxCollectSec=80 collect -Providers:AFF081FE-0247-4275-9C4E-021F3DC1DA35,DD5EF90A-6398-47A4-AD34-4DCECDEF795F,7B6BC78C-898B-4170-BBF8-1A469EA43FC5

ETW providers installed inside container: logman query providers

Listing existing providers

Step four: Is collection running?

Even though it might look like nothing has happened upon invoking the PerfView, it is collecting stuff that is confirmed by log.txt:

PerfView logging started at 9/2/2020 3:55:31 PM
[EXECUTING: PerfView /AcceptEULA /threadTime /logFile=log.txt /LowPriority /maxCollectSec=80 collect -Providers:AFF081FE-0247-4275-9C4E-021F3DC1DA35,DD5EF90A-6398-47A4-AD34-4DCECDEF795F,7B6BC78C-898B-4170-BBF8-1A469EA43FC5]
Circular buffer size = 0, setting to 500.
[Kernel Log: C:\inetpub\wwwroot\App_Data\logs\PerfView\PerfViewData.kernel.etl]
Kernel keywords enabled: ThreadTime
[User mode Log: C:\inetpub\wwwroot\App_Data\logs\PerfView\PerfViewData.etl]
Enabling Provider:PerfViewLogger Level:Verbose Keywords:0xffffffffffffffff Stacks:0 Values: Guid:a8a71ac1-040f-54a2-07ca-00a89b5ab761
Turning on more CLR GC, JScript and ASP.NET Events.

What it was doing is launching the GUI, which you don’t see, and detaching from the current console. The profile becomes visible in PerfView folder once collection is done:

The result of the collection

Result confirmation

A quick look into flame graph shows some potential slow points to be addressed:

Sample flame graph
  • The leftmost bar is DeviceDetection costing 10 % (or ~5.5 seconds) of time
  • Middle bar is Nexus licensing check taking ~30 % (or ~16 seconds) of time
  • The first bar is Analytics subsystem start taking ~57% (or ~30 seconds) of time

Definitely, the profile needs a deeper look in a separate article.

Summary

PerfView collection inside container runs in the same manner as in old good days, except lack of user interface. The only difference is usage of command line arguments.

All you need to do is copy PerfView inside container & launch it in correct point in time (slow operation).

Performance crime: Params

You might heard params are evil. Have you ever seen the performance bill for that?

How to find out where params are used?

Memory allocation report shows all allocations, we’ll start from here:

Method with params creates an array behind the scenes, so hunting for object[]:

IL code proves sad truth – array is always created:

Even though there is a faster Concat with 4 arguments, it is not picked due to type mismatch. The similar issue takes place for Item.GetUniqueId():

How faster could it be without params?

2.7 times faster by simply calling .ToString prior combining values. Impressive, huh?

        [Benchmark]
        public void Stock()
        {
            var data = _item.InnerData;
            for (int i = 0; i < N; i++)
            {
                var clone = new Item(_id, data, database);
                var uniqueId = clone.GetUniqueId();
            }
        }

        [Benchmark]
        public void NoBoxing()
        {
            var data = _item.InnerData;
            for (int i = 0; i < N; i++)
            {
                var clone = new Item(_id, data, database);
                var uniqueId = string.Concat(clone.Database.Name, clone.ID.ToString(), clone.Language.Name, clone.Version.ToString());                
            }
        }

2.4 times less memory used (omitting justCopyItem from both runs: 27.5 vs 11.47 MB)

Conclusion

Params could make your program slower out of nothing. Luckily, there are automated solutions (like Roslyn Clr Heap Allocation Analyzer) that help detecting these kind of issues in early development stages.

Although modern compilers produce smarter code that does not involve array creation in this case, new language features (string interpolation) brings new challenges.