Analytics reports have suspicious statistics with lower conversion rates compared to other systems. Can we find out why?
It seem that healthy data is diluted with junk/empty interactions with no value. We assume robot/crawlers activity gets recorded. Is there any OOB protection in Sitecore?
Filter out robots by user agents
Sitecore blacklists robots via a list of user agents defined in the config:
Theoretically, zero interactions with these user agents should be recorded, right? Well, I do not blog about straightforward tasks. We could check the actual number of robots via huge SQL composed by replacing line break character with
But still, 20% of all contacts (that have interactions) report to have robot user agents. Could it be due to the fact Sitecore uses case-sensitive match for robots, while SQL column has case-insensitive
SQL_Latin1_General_CP1_CI_AS collation? While that can easily happen in your case, a cause is different in our case study:
Let’s leave this question unanswered (for now) and focus on what
How to identify robot by behavior?
Page crawler requests pages one by one without persisting any cookies – it will not have interactions with more than one page. We could try to find user agents that do not have more than one page recorded:
WITH PagesInInteractions AS( SELECT Pages = ( SELECT COUNT(1) FROM OPENJSON([Events]) WITH ([Event] NVARCHAR(100) '$."@odata.type"') WHERE [Event] = '#Sitecore.XConnect.Collection.Model.PageViewEvent'), Created, LastModified, UserAgent, InteractionId, ContactId FROM [xdb_collection].[Interactions]) SELECT COUNT(1) AS Hits, [UserAgent], DATEDIFF(DAY, MIN(Created), MAX(LastModified)) AS [DaysBetween], MIN(Created) AS [Earliest], MAX(LastModified) AS [Latest] FROM PagesInInteractions GROUP BY [UserAgent] HAVING MAX(Pages) <=1 AND COUNT(1) > 500 ORDER BY COUNT(1) DESC
This query finds unique user agents that have viewed single-page only and have over 500 interactions:
20% of total interactions recorded system-wide belong to user agents that do not have over 1 page in visit across 500 visits. These user agents are most likely to be added into the blacklist to stop them from being tracked.
Could contacts without interactions exist?
Although that should not happen in theory… you got it:
SELECT COUNT(DISTINCT(c.ContactId)) FROM [xdb_collection].Contacts c LEFT JOIN [xdb_collection].Interactions i ON i.ContactID = c.ContactId WHERE i.ContactId IS NULL
Our case study has 7.5 % of contacts without interactions that was caused by bug.
The query we developed to locate
suspicious user agents allows us to identify robots with better accuracy in future. Unfortunately, the previously collected robot sessions would still remain in system and pollute analytics reports. Needless to say that you pay for storing useless data to your hosting vendor.
In next articles we’ll try to remove the useless data from system to recover the reports.