Performance crime: Non-unique strings

Problem statement

The application experienced a gradual increase in memory usage over the years, becoming notably problematic when it reached 20 GB of RAM.

Although memory snapshots revealed that more than 90 million custom objects were in memory, developers asserted all these objects were being utilized, owing to intricate business rules and calculations within the domain.

Introduction

Even if the development team is willing to re-architect the solution, a complete rebuild would take at least a year. Meanwhile, the business is currently grappling with performance issues.

In other words, if the situation continues, there may not be a business left to rebuild the solution for. The objects used by the application appear to be standard models, consisting of strings, numbers, and enums; the development team hasn’t engaged in any practices that would be considered “illegal” in software development terms.

Data distribution

A few things are notable around largest RAM consumers:

Free ranges suggest that large objects are being instantiated, likely due to data arriving in massive batches.

While forcing garbage collection (GC) to compact the Large Object Heap (LOH) can reclaim space, this approach treats the symptom rather than the underlying cause, constituting a performance anti-pattern.

Despite this, we’ve managed to reduce LOH fragmentation and reclaim approximately 4 GB of RAM, good start.

Why so many strings?

There are nearly one million instances of “true” or “false” strings in memory, they are not getting wiped by GC:

The presence of ‘rooted’ objects that are being used, yet each holds its own copy of the same string. We identified these strings’ owners as application models and modified their setters to use string interning. We opted for a custom interning solution over the built-in .NET mechanism for specific reasons. Upon further investigation, a similar pattern in a number of other models was discovered.

Although data is appropriately stored in a relational database, the materialized view used to populate the models joins the definition table with the actual table, abstracting the storage details from the application. The data reader instantiates a new string each time a row is materialized.

Will it be beneficial?

On one hand, string interning will be effective for model fields with repeatable values. On the other hand, models that contain unique strings could potentially lead to memory leaks.

Given that the memory snapshot contains all the data the application operates on, we can identify fields with a high or low number of unique values. This allows us to implement string interning selectively, based on empirical data:

Despite having over 633,000 class instances in memory, there are 12.5 times fewer unique values (50K):

By enabling string interning for that specific field alone, the expected memory consumption is reduced by a factor of 12. The savings are even greater as we’ve chosen to skip rare/big values.

Considering over a dozen of places located, a significant memory pressure relief has already been achieved, but is there further room for optimization?

Spoiler alert – yes, since object count is huge, few bytes saved on each object translate into massive overall win. Stay tuned for the second part.

Performance crime: Reflection

Problem statement

Commerce customer with phantom slowness recently asked for help – random API duration spikes.

Introduction – not as easy

Enabling pipeline tracing was not much of use due to:

  • Huge amount of records per each operation
  • A lot of concurrent blocks executed in parallel

A sole commerce flow API produced over 5.6K lines in log in test environment, which must have landed with extra cost due to observer effect.

ETW was not enabled (no event source provider in compiled code). 99 % of memory snapshot collection attempts caught GC running = leaving heap in non-traversable state.

Load test to find the bending point

The series of ‘lightweight‘ performance tests was launched to find the bending point, memory snapshots were collected along the way. RAM is observed to be overcrowded with reflection trails:

These leftovers remain from ReflectionPipelineBlockRunner – stock mechanism in XC to execute processors/blocks; we’ve seen 5.6K executions in ~15 seconds according to pipeline tracing just for one execution:

Faster block runner with expression trees

Instead of finding method on each go, we can compile it once and use all the way down:

x10 faster invocations with 1/4 of allocated objects:

Why this makes perfect sense?

A large portion of blocks are designed with quit-fast checks, f.e:

if (arg.Entity != null) return arg;

The idea is visible in log-driven profiling since 92% of block executions take <0.1 sec:

It is far more costly to create the block, arrange its execution performance, find the ‘run’ method, and finally return 92% of the times after ‘initial’ check.

What could we observe more from the snapshot?

  • The pipeline profiling runs always, but not flushed if config says so.
  • Stateless blocks are always created from DI despite might have been reused.
  • The is null check produces fresh objects on each go:
  • Why on Earth would you have ~15K ReaderWriterLockSlim primitives for just a few API calls?

Summary

The memory snapshot has shown the performance speedup potential. Some gains are made without even enabling ETW which required touching customer code.

Collecting performance profiles in Azure WebApps on demand

Agenda: All of sudden your Application starts responding slow. What to do?

In order to figure out why it is slow, you’ll need to understand what happens internally, what does each thread do? How the wallclock time is spent while processing your request? The source of that data is dynamic profiling.

Azure WebApps can collect performance profiles on demand in KUDU advanced tools:

Advanced Tools (aka Kudu) provides a collection of developer oriented tools and extensibility points for your App Service Apps.

Profiler should be switched on for half a minute from Process Explorer view:

It just takes a few steps to perform the collection:

  • Once Kudu is opened, switch to Process Explorer tab
  • Select w3wp.exe in the list of processes (not the one with scm as it stands for KUDU ui)
  • Ensure to check Collect IIS Events
  • Click Start Profiling
  • Wait for 20-30 seconds
  • Click Stop Profiling

Wait for a minute to let instance-name-process.diagsession to be produced and downloaded. That is much much simpler than collecting performance profiles in Docker.

Once the performance profile is downloaded, we can jump to analysis with a help of PerfView.

For extreme cases (application turns unresponsive at all) collection shall time out, you’ll need to collect a few sequential memory snapshot and inspect what does each thread do.

Memory snapshots in Azure Web Apps

A memory snapshot could explain why an app does not behave well. The process of collecting memory snapshot is straightforward locally:

  • Download ProcDump
  • Find the process ID in Task Manager
  • Trigger onDemand collection: procdump.exe -accepteula -ma 123

Why that does not work in cloud?

Despite KUDU can generate memory snapshots on demand, it frequently fails in real life:

KUDU native way to collect snapshot

The operation attempts to capture a few GB in size file, as well as download it by plain HTTPS protocol; full halt in case a connection timeout / bad network weather bubbles up.

How to do it better?

Azure WebApps already arrive with sysinternals tools, including ProcDump; therefore command-line collection looks possible with the help of Kudu console.

Resulting file would land on the file system and can be downloaded any later time.

Collection time

Lets copy ProcDump into the local\temp folder as snapshots would be saved there:

D:\devtools\sysinternals> copy-item .\procdump.exe -destination D:\local\Temp

ProcDump was copied successfully:

D:\local\Temp> ls

Next step is to find process ID we’d like to make a snapshot of (12628 in our case):

Time to trigger collection with accepteula flag 😉

The resulting file is close to 2 GB in size. Since 7Zip is included into OOB WebApp, we could optionally archive the snapshot:

The compressed file is ~310MB. Still might be too much for HTTPs.

FTP to transfer large files

D:\home is already exposed for WebApp deployments via FTP, copy files there:

The publishing profile contains FTP credentials to use in order to reach file system:

Upon connecting we see the snapshot is waiting to be downloaded on a high speed:

Summary

We’ve triggered memory dump collection on WebApp side using software already installed in WebApp. The resulting file can be optionally archived. The fastest way to download snapshot would be to use FTP.

Performance crime: Concat

Quiz time

Brainstorm to begin with – which code is faster and why:

var language = new object();
var version = new object();

var a = string.Concat(itemId, language, "¤", version);
var b = string.Concat(itemId, language.ToString(), "¤", version.ToString());

Answer

Despite both lines look super familiar, a is 30% slower due to an overload with params being picked:

As you might have guessed (from previous post), params force an array allocation (which I treat as crime):

Calling ToString explicitly cuts out one third of wall clock time (1076 vs 767):

Does the code look familiar?

The sample code looks to be a suitable candidate to act as a cache key for an item (in fact, it is). Not only it produces a new string on each go, but also forces brute-force string over direct by-field comparison. We can easily fix those flaws and cook a faster version:

struct Key: IEquatable<Key>
 {
   private readonly ID _id;
   private readonly Language _language;
   private readonly Version _version;

   public Key(ID id, Language language, Version version)
    {
      _id = id;
      _language = language;
      _version = version;
     }
...
public static bool operator ==(Key a, Key b) => Equals(a._id, b._id) && Equals(a._version, b._version) && Equals(a._language, b._language);

The struct based-key uses by-field comparison and does not need allocations resulting in 4 times faster execution:

Summary

Joining strings to build a cache key might be four times as slow as doing it right with structs. You’ll tax system users with ~50ms on each page request (real-life numbers).

On the one hand, 1/20 of a second sounds very little.

On the other hand, the delay is brought by a few lines of code, whereas enterprise-level frameworks have huge code bases with non-zero chances having only one non-optimal code.

Make unit tests public

Public unit tests for framework is win-win both for vendor, and framework consumers. My top 5 reasons for that.

Led by example: Unicorn

Unicorn is popular serialization mechanism with > 1.3M Nuget downloads!

All you need to get started is to download the package and read readme. Custom requirements (f.e. store only specific part of the tree) are simply implemented as test configuration already provides all the samples you may ever need, example:

	<include name="Basic" database="master" path="/sitecore/layout/Simulators">
		<exclude path="/sitecore/layout/Simulators/Android Phone" />
		<exclude path="iPhone" />
		<!-- relative path exclusion based on parent include path -->
		<exclude path="iPhone Apps/1.0" />
		<!-- exclude where name starts with a different include -->
	</include>

Wondering about processing order and dependent configs? Answers are already in configuration dependency tests for you.

Led by example: Sitecore.NSubstituteUtils

Sitecore.NSubstituteUtils allows mocking centric Sitecore objects (Item/Database) and so much more, see Unit testing in Sitecore. The learning curve is just to look at the tests:

        [Theory, InlineAutoData("test field", "test value")]
        public void ItemIndexerByFieldId_AfterAdd_ReturnsAddedValue(string fieldName, string addedFieldValue, ID fieldId)
        {

            var fakeItem = new FakeItem();

            fakeItem.Add(fieldId, fieldName, addedFieldValue);
            Item item = fakeItem;


            var actualFieldValue = item[fieldId];


            actualFieldValue.Should().Be(addedFieldValue);
        }

All features are discovered by looking through test names. What could be better than reading neat test code and have an option to execute in your environment on demand?

Never misleads you

What could be worse than non-existing documentation? The wrong/outdated one.

Since unit tests are tied to build process, they would never outdate. Even better, they include maintenance effort to stay in a healthy state, welcome Living Documentation!

Brings light where you need it

Have you ever wondered how does Sitecore.Pipelines.HttpRequest.ItemResolver work?

Would it honor item display name over name, how does it deal with encode replacements or security access?

The logic must be guarded by vendor to avoid behavior changes, therefore code should have been covered with tests. And yes, should tests be public, you could debug it!

Shortcut for bug reports

A pull request with a failing unit test can reach Product Department the same day you’ve found it! You no longer need to spend ages to hand-pick reproduction steps to make Customer Service happy and raise a bug on your behalf.

Increases product code quality

Code must be in a good shape (f.e. without hidden dependencies) to become a subject for unit testing.

Summary

A framework consumer gets following benefits:

  • Simplified learning curve, led by example
  • Always accurate documentation
  • Ability to verify system parts behaviors based on fed in data

A framework vendor gets in return:

  • Accurate code-level bug reporting
  • Automated documentation maintenance
  • Increased code quality thanks to increasing safety harness


Myth busting: Memory snapshots

Photo by Chris Peeters on Pexels.com

Find out the slowness cause from a single snapshot

Likewise looking at an image on top, can you identify why the car is ahead? No, unless something very obvious (burning engine). You rely on performance profiles for the purpose.

Photo by Pixabay on Pexels.com

Are there any controls/combos to navigate through the snapshot?

Yes, I’ve handpicked the useful commands here.

Photo by imeenhae on Pexels.com

Check cache statistics

Yes, in-memory caches can be located by type and inspected:

Photo by RF._.studio on Pexels.com

Inspect what operations are being performed

Yes, you can see what does each thread do at a time of snapshot generation. If application was hanging, you’ll see why.

Is docker file system slow?

Let’s measure how fast file operations are performed in docker:

The diskspd.exe Microsoft tool will generate load:

  • Operate with 2GB file size
  • A few seconds for warming-up and slowing-down
  • Block size of 128K
  • 4 threads
  • Random reads
  • Run test for 2 minutes

diskspd.exe -b128K -d120 -o16 -t4 -r -c2G -w60 -W2 -C2 filepath

No other resource-consuming activities were performed in the system during test.

Test specs

Docker Desktop v.2.2:

Tested docker version

The host Windows version is 1909:

Even though same hardware used inside/outside container, I’ll mention year-old SSD & 8 core CPU hardware used in benchmark.

Results

Surprisingly, local file system is 10 times faster than one sitting inside container:

Results, summarized

Disclaimer: At the moment of the article initial publish, I do not have a solid technical explanation why that happens. Detailed results could be found below.

Baseline: local file system

Outside container, local results

Inside container: exposed volume

Exposing volume outside

Inside container: do not expose volume

Do not expose file system outside container

Performance crime: hidden culture-specific parsing

Paying respect to user culture while consuming user input is totally correct and legal. While hiding the fact of value being parsed with specific user culture may cost a dime.

How expensive that could be?

Make a few predictions to begin with:

  • What is the most time-consuming Sitecore hook?
  • How much time to get initialized might it take?
  • What is the most time-consuming operation in that hook?

Did you predict ~ 3/4 second spent just to transform 00.00:05:00 to TimeSpan?

Flame graph scoped to HookManager callees

Sitecore.Configuration.Settings.GetTimeSpanSetting(string name, string defaultValue)would secretly parse defaultValue using context user culture that is stored in user profile and quite expensive to get on first run:

3/4 second is spent on fetching UserProfile culture

The neighboring API uses CultureInfo.InvariantCulture for reading config-based value which perfectly makes sense for the task:

Only a few consumers rely on the performance criminal ParseTimeSpan API in 9.3:

Summary

Should you need to introduce parse method that pays respect to user culture, please make dependency explicit. Otherwise your intent might be misused with a price tag.

In our example 3/4 seconds of startup time could be saved by:

  • Having default value set as TimeSpan.FromMinutes(5)
  • Document DateUtil.ParseTimeSpan to use context user culture

Reviewing docker performance profile

We have profiled application start in the previous post, and will carry on wall clock time investigation; let’s start by looking at Thread Time report:

Selecting the Thread Time report

First thing to notice – a set of processes running next to w3wp:

  • CSC is a compilation effort, every entry takes ~4.6 seconds at least (delays application start as well)
  • Robocopy util that deals with files sync
Processes inside container

Selecting a flame graph for w3wp seems quite sparse so far with little of use:

Initial view of the w3wp

That happens as all threads are displayed now without filtering. Let’s include only application_start -related entries by expanding IncPats filter:

Showing only threads that have Application_Start in call stack

Let’s hide/fold tiny bits that are <4% of time to get a high-level understanding:

See nodes with at least 4% of overall time taken

Folding method wrappers to get a cleaner output:

  • dynamicclass – a runtime wrapper around Sitecore config processor
  • CorePipeline – a runtime wrapper around Sitecore pipelines
  • Reflection – .NET reflection to invoke methods/props
Cleaner output with lower flames

The top-right method belongs to Sitecore.CES, let’s focus on it by changing IncPats:

Three threads mention Sitecore.CES

There are 3 threads on the picture (the second frame from bottom divided into three):

  • 2796: Sleeps all the time, does not add any value; can be skipped
  • 2780: Application_Start thread is blocked until initialization is done
  • 1612: Performs initialization

Removing 2796 from view by adding it to ExcPats:

How CES blocks Sitecore startup

It seem that Sitecore attempts to load device database to file system whenever container is created, which takes ~23 seconds of startup time.

The solution is simple – persist device detection folder via docker.compose:

  - .\data\cm\DeviceDetection:C:\inetpub\wwwroot\App_Data\DeviceDetection

Secondly, the DeviceDetection initialization wait timeout could be reduced to a single second via setting change in a local/dev sandbox:

Setting to control how long to wait for device detection database to be loaded

Peeling off already resolved

After Sitecore.CES impact on startup was eliminated, further application_Start investigation (excluding Sitecore.CES from view via ExcPats) could be carried on:

Remaining culprits

Further steps

Over 87% time in await likely indicates some room for improvement:

Almost 90% of time is spent awaiting

Every remaining flame bar is to be inspected (in a similar manner as we did just now) to understand if there is a potential to do better.