Case study: polluted reports

Agenda

Analytics reports have suspicious statistics with lower conversion rates compared to other systems. Can we find out why?

It seem that healthy data is diluted with junk/empty interactions with no value. We assume robot/crawlers activity gets recorded. Is there any OOB protection in Sitecore?

Filter out robots by user agents

Sitecore blacklists robots via a list of user agents defined in the config:

App_Config\Sitecore\Marketing.Tracking\Sitecore.Analytics.ExcludeRobots.config

User agents that should be excluded rain or shine

Theoretically, zero interactions with these user agents should be recorded, right? Well, I do not blog about straightforward tasks. We could check the actual number of robots via huge SQL composed by replacing line break character with ',':

But still, 20% of all contacts (that have interactions) report to have robot user agents. Could it be due to the fact Sitecore uses case-sensitive match for robots, while SQL column has case-insensitive SQL_Latin1_General_CP1_CI_AS collation? While that can easily happen in your case, a cause is different in our case study:

Let’s leave this question unanswered (for now) and focus on what robot is?

How to identify robot by behavior?

Page crawler requests pages one by one without persisting any cookies – it will not have interactions with more than one page. We could try to find user agents that do not have more than one page recorded:

WITH PagesInInteractions AS(
	SELECT Pages = (
		SELECT COUNT(1) 
			FROM OPENJSON([Events])
				WITH ([Event] NVARCHAR(100) '$."@odata.type"') 
			WHERE [Event] = '#Sitecore.XConnect.Collection.Model.PageViewEvent'),
	Created,
	LastModified,
	UserAgent,
	InteractionId,
	ContactId
FROM [xdb_collection].[Interactions])
  
SELECT 
	COUNT(1) AS Hits,
	[UserAgent], 
	DATEDIFF(DAY, MIN(Created), MAX(LastModified)) AS [DaysBetween],  
	MIN(Created) AS [Earliest],
	MAX(LastModified) AS [Latest]
FROM PagesInInteractions 
GROUP BY [UserAgent] 
HAVING 
	MAX(Pages) <=1 
	AND COUNT(1) > 500
ORDER BY COUNT(1) DESC

This query finds unique user agents that have viewed single-page only and have over 500 interactions:

20% of total interactions recorded system-wide belong to user agents that do not have over 1 page in visit across 500 visits. These user agents are most likely to be added into the blacklist to stop them from being tracked.

Could contacts without interactions exist?

Although that should not happen in theory… you got it:

SELECT COUNT(DISTINCT(c.ContactId)) FROM [xdb_collection].Contacts c
LEFT JOIN  [xdb_collection].Interactions i
ON i.ContactID = c.ContactId
WHERE i.ContactId IS NULL

Our case study has 7.5 % of contacts without interactions that was caused by bug.

Summary

The query we developed to locate suspicious user agents allows us to identify robots with better accuracy in future. Unfortunately, the previously collected robot sessions would still remain in system and pollute analytics reports. Needless to say that you pay for storing useless data to your hosting vendor.

In next articles we’ll try to remove the useless data from system to recover the reports.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: