The amount of memory cache can use is defined in config:
<!-- ACCESS RESULT CACHE SIZE
Determines the size of the access result cache.
Specify the value in bytes or append the value with KB, MB or GB
A value of 0 (zero) disables the cache.
-->
<setting name="Caching.AccessResultCacheSize" value="10MB" />
That is needed to protect against disk thrashing – running out of physical RAM so that disk is used to power virtual memory (terribly slow). That is a big hazard in Azure WebApps – much less RAM compared to old-school big boxes.
Sitecore keeps track of all the space dedicated to caches:
The last ‘Cache created’ message in logs for 9.3
The last entry highlights 3.4 GB is reserved for Sitecore 9.3 OOB caching layer. Neither native ASP.NET cache, nor ongoing activities, nor the assemblies to load into memory are taken into the account (that is a lot – bin folder is massive):
Over 170 MB solely for compiled assemblies
3.4 GB + 170 MB + ASP.NET caching + ongoing activities + (~2MB/Thread * 50 threads) + buffers + …. = feels to be more than 3.5 GB RAM that S2 tier gives. Nevertheless, S2 is claimed to be supported by Sitecore. How come stock cache configuration can consume all RAM and still be supported?
The first solution that comes to the head – cache detection oversizes objects to be on the safe side; whereas actual sizes are smaller. Let’s compare actual size with estimated.
How does Sitecore know the data size it adds to cache?
There are generally 2 approaches:
FAST: let developer decide
SLOW: use reflection to build full object graph -> calc primitive type sizes
Making class to be either Sitecore.Caching.Interfaces.ISizeTrackable (immutable) or Sitecore.Caching.ICacheable (mutable) allows developer to detect object size by hand, but how accurate could that be?
Action plan
Restore some cacheable object from memory snapshot
Let OOB logic decide the size
Measure size by hand (from the memory snapshot)
Compare results
The restored object from snapshot is evaluated to be 313 bytes:
While calculating by-hand shows different results:
Actual size is at least two times larger (632 = (120 + 184 + 248 + 80)) even without:
DatabaseName – belongs to database that is cached inside factory;
Account name – one account (string) can touch hundreds of items in one go;
AccessRight – a dictionary with fixed well-known rights (inside Sitecore.Security.AccessControl.AccessRightProvider);
In this case measurement error is over 100% (313 VS 632). In case the same accuracy of measurements persists for other caches, the system could easily spent double configured 3.4 GB -> ~ 6.8 GB.
fast:// query would be interpreted into SQL fetching results directly from database. Unfortunately, the mechanism is no longer widely used these days. Is there any way to bring it back to life?
Drawback from main power: Direct SQL query
Relational databases are not good at scaling, thereby a limited number of concurrent queries is possible. The more queries are executed at a time, the slower it gets. Adding additional physical database server for extra publishing target complicates solution and increases the cost.
Drawback: Fast Query SQL may be huge
The resulting SQL statement can be expensive for database engine to execute. The translator would usually produce a query with tons of joins. Should item axes (f.e. .// descendants syntax) be included, the price tag raises even further.
Fast Query performance drops as volume of data in database increases. Being ok-ish with little content during development phase could turn into a big trouble in a year.
Alternative: Content Search
Content Search is a first thing to replace fast queries in theory; there are limitations, though:
Index is a metadata built from actual data = not 100 % accurate all the time
Alternative: Sitecore Queries and Item API
Both Sitecore Queries and Item APIs are executed inside application layer = slow.
Firstly, inspecting items is time consumable and the needed one might not be even in results.
Secondly, caches are getting polluted with data flowing during query processing; useful bit might be kicked out from cache when maxSize is reached and scavenge starts.
Making Fast Queries great again!
Fast queries can be first-class API once performance problems are resolved. Could the volume of queries sent to database be reduced?
Why no cache for fast: query results?
Since data in publishing target (f.e.web) can be modified only by publishing, queries shall return same results in between. The results can be cached in memory and scavenged whenever publish:end occurs.
public sealed class ReuseFastQueryResultsDatabase : Database
{
private readonly Database _database;
private readonly ConcurrentDictionary<string, IReadOnlyCollection<Item>> _multipleItems = new ConcurrentDictionary<string, IReadOnlyCollection<Item>>(StringComparer.OrdinalIgnoreCase);
private readonly LockSet _multipleItemsLock = new LockSet();
public ReuseFastQueryResultsDatabase(Database database)
{
Assert.ArgumentNotNull(database, nameof(database));
_database = database;
}
public bool CacheFastQueryResults { get; private set; }
#region Useful code
public override Item[] SelectItems(string query)
{
if (!CacheFastQueryResults || !IsFast(query))
{
return _database.SelectItems(query);
}
if (!_multipleItems.TryGetValue(query, out var cached))
{
lock (_multipleItemsLock.GetLock(query))
{
if (!_multipleItems.TryGetValue(query, out cached))
{
using (new SecurityDisabler())
{
cached = _database.SelectItems(query);
}
_multipleItems.TryAdd(query, cached);
}
}
}
var results = from item in cached ?? Array.Empty<Item>()
where item.Access.CanRead()
select new Item(item.ID, item.InnerData, this);
return results.ToArray();
}
private static bool IsFast(string query) => query?.StartsWith("fast:/") == true;
protected override void OnConstructed(XmlNode configuration)
{
if (!CacheFastQueryResults)
{
return;
}
Event.Subscribe("publish:end", PublishEnd);
Event.Subscribe("publish:end:remote", PublishEnd);
}
private void PublishEnd(object sender, EventArgs e)
{
_singleItems.Clear();
}
A decorated database shall execute the query in case there has been no data in cache yet. A copy of the cached data is returned to a caller to avoid cache data corruption.
The decorator is placed into own fastQueryDatabases configuration node and referencing default database as constructor argument:
The action plan is to create N objects (so that allocations are close to PerfView stats), and measure time/memory for stock version VS hand-written one.
Null check still remains via extension method:
public static Guid Required(this Guid guid)
{
if (guid == Guid.Empty) throw new Exception();
return guid;
}
public static T Required<T>(this T value) where T : class
{
if (value is null) throw new Exception();
return value;
}
No other ongoing activities – single thread synthetic test
Physical machine – the whole CPU is available
How fast that would be in Azure WebApp?
At least 2 times slower (405 VS 890 ms):
Stock version takes almost 900 ms.
Optimized version remains 2 times faster at least:
Optimized version takes 440ms.
What about heavy loaded WebApp?
It will be slower on a loaded WebApp. Our test does not take into account extra GC effort spent on collecting garbage. Thereby real life impact shall be even greater.
Summary
Tiny WebApp with optimized code produces same results as twice as powerful machine.
Each software block contributes to overall system performance. The slowness can either be solved by scaling hardware, or writing programs that use existing hardware optimal.
How faster can real-life software be without technology changes?
Looking at CPU report top methods – what to optimize
Looking at PerfView profiles to find CPU-intensive from real-life production system:
CPU Stacks report highlight aggregated memory allocations turns to be expensive
It turns out Hashtable lookup is most popular operation with 8% of CPU recorded.
No wonder as whole ASP.NET is build around it – mostly HttpContext.Items. Sitecore takes it to the next level as all Sitecore.Context-specific code is bound to the collection (like context site, user, database…).
The second place is devoted to memory allocations – the new operator. It is a result of a developer myth treating memory allocations as cheap hence empowering them to overuse it here and there. The sum of many small numbers turns into a big number with a second CPU consumer badge.
Concurrent dictionary lookups are half of allocation price on third place. Just think about it for a moment – looking at every cache is twice as fast as volume of allocations performed.
What objects are allocated often?
AccessResultCache is very visible on the view. It is responsible for answering the question – can user read/write/… item or not in fast manner:
Many objects are related to access processing
AccessResultCache is responsible for at least 30% of lookups
Among over 150+ different caches in the Sitecore, only one is responsible for 30% of lookups. That could be either due to very frequent access, or heavy logic in GetHashCode and Equals to locate the value. [Spoiler: In our case both]
One third of overall lookups originated by AccessResultCache
AccessResultCache causes many allocations
Top allocations are related to AccessResultCache
Why does Equals method allocate memory?
It is quite weird to see memory allocations in the cache key Equals that has to be designed to have fast equality check. That is a result of design omission causing boxing:
Since AccountType and PropagationType are enum-based, they have valueType nature. Whereas code attempts to compare them via object.Equals(obj, obj) signature leading to unnecessary boxing (memory allocations). This could have been eliminated by a == b syntax.
Why GetHashCode is CPU-heavy?
It is computed on each call without persisting the result.
What could be improved?
Avoid boxing in AccessResultCacheKey.Equals
Cache AccessResultCacheKey.GetHashCode
Cache AccessRight.GetHashCode
The main question is – how to measure the performance win from improvement?
How to test on close-to live data (not dummy)?
How to measure accurately the CPU and Memory usage?
Getting the live data from memory snapshot
Since memory snapshot has all the data application operates with, it can be fetched via ClrMD code into a file and used for benchmark:
private const string snapshot = @"E:\PerfTemp\w3wp_CD_DUMP_2.DMP";
private const string cacheKeyType = @"Sitecore.Caching.AccessResultCacheKey";
public const string storeTo = @"E:\PerfTemp\accessResult";
private static void SaveAccessResltToCache()
{
using (DataTarget dataTarget = DataTarget.LoadCrashDump(snapshot))
{
ClrInfo runtimeInfo = dataTarget.ClrVersions[0];
ClrRuntime runtime = runtimeInfo.CreateRuntime();
var cacheType = runtime.Heap.GetTypeByName(cacheKeyType);
var caches = from o in runtime.Heap.EnumerateObjects()
where o.Type?.MetadataToken == cacheType.MetadataToken
let accessRight = GetAccessRight(o)
where accessRight != null
let accountName = o.GetStringField("accountName")
let cacheable = o.GetField<bool>("cacheable")
let databaseName = o.GetStringField("databaseName")
let entityId = o.GetStringField("entityId")
let longId = o.GetStringField("longId")
let accountType = o.GetField<int>("<AccountType>k__BackingField")
let propagationType = o.GetField<int>("<PropagationType>k__BackingField")
select new Tuple<AccessResultCacheKey, string>(
new AccessResultCacheKey(accessRight, accountName, databaseName, entityId, longId)
{
Cacheable = cacheable,
AccountType = (AccountType)accountType,
PropagationType = (PropagationType)propagationType
}
, longId);
var allKeys = caches.ToArray();
var content = JsonConvert.SerializeObject(allKeys);
File.WriteAllText(storeTo, content);
}
}
private static AccessRight GetAccessRight(ClrObject source)
{
var o = source.GetObjectField("accessRight");
if (o.IsNull)
{
return null;
}
var name = o.GetStringField("_name");
return new AccessRight(name);
}
The next step is to bake a FasterAccessResultCacheKey that caches hashCode and avoids boxing:
public bool Equals(FasterAccessResultCacheKey obj)
{
if (obj == null) return false;
if (this == obj) return true;
if (string.Equals(obj.EntityId, EntityId) && string.Equals(obj.AccountName, AccountName)
&& obj.AccountType == AccountType && obj.AccessRight == AccessRight)
{
return obj.PropagationType == PropagationType;
}
return false;
}
[MemoryDiagnoser]
public class AccessResultCacheKeyTests
{
private const int N = 100 * 1000;
private readonly FasterAccessResultCacheKey[] OptimizedArray;
private readonly ConcurrentDictionary<FasterAccessResultCacheKey, int> OptimizedDictionary = new ConcurrentDictionary<FasterAccessResultCacheKey, int>();
private readonly Sitecore.Caching.AccessResultCacheKey[] StockArray;
private readonly ConcurrentDictionary<Sitecore.Caching.AccessResultCacheKey, int> StockDictionary = new ConcurrentDictionary<Sitecore.Caching.AccessResultCacheKey, int>();
public AccessResultCacheKeyTests()
{
var fileData = Program.ReadKeys();
StockArray = fileData.Select(e => e.Item1).ToArray();
OptimizedArray = (from pair in fileData
let a = pair.Item1
let longId = pair.Item2
select new FasterAccessResultCacheKey(a.AccessRight.Name, a.AccountName, a.DatabaseName, a.EntityId, longId, a.Cacheable, a.AccountType, a.PropagationType))
.ToArray();
for (int i = 0; i < StockArray.Length; i++)
{
var elem1 = StockArray[i];
StockDictionary[elem1] = i;
var elem2 = OptimizedArray[i];
OptimizedDictionary[elem2] = i;
}
}
[Benchmark]
public void StockAccessResultCacheKey()
{
for (int i = 0; i < N; i++)
{
var safe = i % StockArray.Length;
var key = StockArray[safe];
var restored = StockDictionary[key];
MainUtil.Nop(restored);
}
}
[Benchmark]
public void OptimizedCacheKey()
{
for (int i = 0; i < N; i++)
{
var safe = i % OptimizedArray.Length;
var key = OptimizedArray[safe];
var restored = OptimizedDictionary[key];
MainUtil.Nop(restored);
}
}
}
Twice as fast, three times less memory
Summary
It took a few hours to double the AccessResultCache performance.
Even though I hope my readers are capable of downloading files from link without help, I would drop an image just in case:
Ensure to download bothPerfView and PerfView64 so any bitness could be profiled.
Running the collection
Launch PerfView64 as admin user on the server
Top menu: Collect -> Collect
Check all the flags (Zip & Merge & Thread Time)
Click Start collection
Click Stop collection in ~20 seconds
How to collect PerfView profiles
Collect 3-4 profiles to cover different application times.
The outcome is flame graph showing the time distribution:
Summary
The only way to figure out how the wall clock time is distributed is to analyze the video showing the operation flow. Everything else is a veiled guessing-game.
Only Sitecore configuration node is stored there, while native config parts could be read via !wconfig:
In-memory config parts
Alternative approaches
Showconfig tool
One could use SIM Tool, showconfig to re-build configs from web.config + App_Config copy. Are there any guarantees it uses exactly same algorithm as the Sitecore version you are using does?
Showconfig admin page
Requesting sitecore/admin/showconfig.aspx from live instance – the most trustworthy approach. However CDs have that disabled for security reasons, so not an option.
Support package admin page
Support Package admin page would ask all servers online to generate their configs:
The general assumption is – not using a feature/functionality means no performance price is payed. How far do you agree?
Real life example
Content Testing allows verifying which content option works better. Technically it enables multiple versions of same item being published and injects processor into getItem pipeline to figure out which version to show:
As a result, every GetItem API gets extra logic to be executed. What is the price to pay?
Analysis
The processor seem to roughly do following:
Check if result is there
yes -> quit
no – fetch item via fallback provider
If item was found – is context suitable for Content Testing to be executed:
Are we outside shell site?
Are we seeing site in Normal mode (neither editing, nor preview) ?
Are we requesting the specific version, or just latest?
Should all be true, a check for cached version is triggered:
String concatenation to get unique item ID
Append sc_VersionRedirectionRequestCache_ to highlight domain
As a result, each getItem call shall produce strings (strings example)
Question: How much excessive memory does it cost for a request?
Coding time
IContentTestingFactory.VersionRedirectionRequestCache is responsible for caching and defaults to VersionRedirectionRequestCache implementation.
Thanks to Sitecore being config-driven, stock implementation can be replaced with own:
using Sitecore;
using Sitecore.ContentTesting.Caching;
using Sitecore.ContentTesting.Models;
using Sitecore.Data.Items;
using Sitecore.Reflection;
namespace ContentTesting.Demo
{
public sealed class TrackingCache : VersionRedirectionRequestCache
{
private const string MemorySpent = "mem_spent";
public TrackingCache() => Increment(TypeUtil.SizeOfObjectInstance); // self
public static int? StringsAllocatedFor
=> Context.Items[MemorySpent] is int total ? (int?) total : null;
private void Increment(int size)
{
if (System.Web.HttpContext.Current is null)
{
return;
}
if (!(Context.Items[MemorySpent] is int total))
{
total = 0;
}
total += size;
Context.Items[MemorySpent] = total;
}
protected override string GenerateItemUniqueKey(Item item)
{
var value = base.GenerateItemUniqueKey(item);
Increment(TypeUtil.SizeOfString(value));
return value;
}
public override string GenerateKey(Item item)
{
var value = base.GenerateKey(item);
Increment(TypeUtil.SizeOfString(value));
return value;
}
public override void Put(string key, VersionRedirect versionRedirect)
{
Increment(TypeUtil.SizeOfObjectInstance); // VersionRedirect is an object
base.Put(key, versionRedirect);
}
}
}
The factory implementation:
using Sitecore.ContentTesting;
using Sitecore.ContentTesting.Caching;
using Sitecore.ContentTesting.Models;
using Sitecore.Data.Items;
namespace ContentTesting.Demo
{
public sealed class DemoContentTestingFactory: ContentTestingFactory
{
public override IRequestCache<Item, VersionRedirect> VersionRedirectionRequestCache => new TrackingCache();
}
}
And the EndRequest processor:
using Sitecore;
using Sitecore.Abstractions;
using Sitecore.Pipelines.HttpRequest;
namespace ContentTesting.Demo
{
public sealed class HowMuchDidItHurt : HttpRequestProcessor
{
private readonly BaseLog _log;
public HowMuchDidItHurt(BaseLog log)
=> _log = log;
private int MinimumToComplain { get; set; } = 10 * 1024; // 10 KB
public override void Process(HttpRequestArgs args)
{
var itHurts = TrackingCache.StringsAllocatedFor;
if (itHurts is null) return;
if (itHurts.Value > MinimumToComplain)
{
_log.Warn($"[CT] {MainUtil.FormatSize(itHurts.Value, translate: false)} spent during {args.RequestUrl.AbsoluteUri} processing.", this);
}
}
}
}
Log how much memory spent on request processing by CT (sitecore.config):
The local results are: 16.9MB empty VS 3.2MB warm HTML caches.
50 requests processed at a time would give extra 150 MB pressure, and that is only for a single feature out of many. Each relatively low performance price getting multiplied on number of features results in a big performance price to pay.
A system with 20 same price unused components would waste over 3GB of memory. That does not sound as cheap any more, right?
Conclusion
Even though you might not be using Content Testing (or any other feature), the performance price is still payed. Would you prefer not to pay the price for unused stuff?
RESOURCE_SEMAPHORE_QUERY_COMPILE is on top of the wait list highlighting query compilations are bottleneck. Good thing we could inspect already compiled plans to get a glimpse of queries:
Ironically, query plan cannot fit default XML output size.
SQL Server needs over 30MB solely for describing how query is to be executed. There are many ‘huge’ queries processed at a time causing compilation gate to hit the limit.
Inspecting memory snapshot
!sqlcmd -s open command gives us a list of all SQL requests:
No wonder MSSQL Server is dying as the first SQL command has 1.7M characters:
The command is run by 58th thread:
It turns out that huge commands are produced by Analytics SQL bulk storage provider. Instead of doing many lightweight SQL calls, it collects them into a batch to avoid network latency cost for each command.
Solution: Stored procedures would reduce the length of SQL query & allow reusing compiled query plans => resolve bottleneck.
Intermediate summary
The behavior was reported to Sitecore:
Performance issue 370868 has been registered
AD-hoc queries are converted into Stored procedures for by Support
Hotfix to increase the aggregation speed has been provided
As a result, aggregation speed increased two times at least.
Even though further improvements are on the deck, I’ll save them for future posts.
!dumpheap -stat reports which objects live in heap:
First column is a type descriptor (method table) aka type unique identifier.
The second column shows how many objects of that type are in heap.
The third column shows own object size (without following nested references).
Use case: Comparing live stats between memory snapshots collected within a few minutes interval would flag growing types, and potential cause of a memory leak.
-name include only types that contain string in name
-live (slow) leaves only objects that are alive
-dead shows only objects that are eligible for garbage collection
-short show only object addresses
-strings print strings only
-mt print objects only of given type by with method table ID
GC heap stats
!gcheapstat (or !gcheapinfo by MEX) the running size of each heap segment, as well as its fill ratio:
Heap statistics
Use case:
Large objects end in LOH that is not compacted by default. As a result, large objects can provoke additional memory allocations in case existing blocks are insufficient = look like a memory leak.
What are the dead objects in ‘older’ generations:
Dump dead objects from generation start till segment generation end(? 0x0000022fadb4e6c0+0n1191488) = 0000022fadc71500:
Sum heximal (0x) format with decimal (0n)
Performing same operation for GEN2 from heap 5:
Strings are in top 3, getting the content of strings by adding -string option:
Huge Work queue flags operations are added faster than being processed
Notes
Command is overloaded by MEX, thereby unload extension prior to the call
CPU is system wide meaning other processes could steal the show
Ongoing operations performed by managed threads
!eestack -ee by SOSEX prints each managed thread call stack:
Use case: Figure out what activities are performed in the snapshot.
Notes
FileWatcher/LayoutWatcher (threads dedicated to certain activity) wake up once in a while (like every half second) to do the status check, they are not usually interesting.
Sitecore.Threading.ManagedThreadPool threads are persisted during application lifetime so that visible in output. Should no operations be added for processing, threads would be idle-ing, hence not consuming any CPU.
How much CPU did each thread consume?
!runaway shows how much CPU time did each thread consume since app start:
Use case: Comparing the numbers between two sequential snapshots would highlight top CPU-consuming threads in between. Stats from single snapshot rarely makes any use since data is aggregated per-thread since application start.
Non-managed stack traces
~* kb 20 outputs top N frames from every thread in application:
Use case: Figuring out what unmanaged operations are running (GC/finalization, attempts to enter critical sections).
Notes: 81% CPU usage is a magic number that likely indicates garbage collection is running, top frame would show phase (f.e. mark/compact). The MSDN article mentions dedicated threads for GC – you can see them on your own (the ones with gc_heap) in unmanaged call stacks.
~[ThreadNumber]s would set [ThreadNumber] as context one, picture sets last ThreadPool worker thread as context one.
Use cases: inspect objects that live on thread stack, map them to executed frames.
Dump stack objects
!mdso by SOSEX prints objects that live in thread stack:
Use case: Get the data thread operates with (method args would live here).
Restoring thread call stack
Call stacks could make no sense from first sight as CLR can optimize code execution. !dumpstack & clrstack use different semantics to restore call stacks: