Agenda
Analytics reports have suspicious statistics with lower conversion rates compared to other systems. Can we find out why?
It seem that healthy data is diluted with junk/empty interactions with no value. We assume robot/crawlers activity gets recorded. Is there any OOB protection in Sitecore?
Filter out robots by user agents
Sitecore blacklists robots via a list of user agents defined in the config:
App_Config\Sitecore\Marketing.Tracking\Sitecore.Analytics.ExcludeRobots.config

Theoretically, zero interactions with these user agents should be recorded, right? Well, I do not blog about straightforward tasks. We could check the actual number of robots via huge SQL composed by replacing line break character with ','
:

But still, 20% of all contacts (that have interactions) report to have robot user agents. Could it be due to the fact Sitecore uses case-sensitive match for robots, while SQL column has case-insensitive SQL_Latin1_General_CP1_CI_AS
collation? While that can easily happen in your case, a cause is different in our case study:

Let’s leave this question unanswered (for now) and focus on what robot
is?
How to identify robot by behavior?
Page crawler requests pages one by one without persisting any cookies – it will not have interactions with more than one page. We could try to find user agents that do not have more than one page recorded:
WITH PagesInInteractions AS(
SELECT Pages = (
SELECT COUNT(1)
FROM OPENJSON([Events])
WITH ([Event] NVARCHAR(100) '$."@odata.type"')
WHERE [Event] = '#Sitecore.XConnect.Collection.Model.PageViewEvent'),
Created,
LastModified,
UserAgent,
InteractionId,
ContactId
FROM [xdb_collection].[Interactions])
SELECT
COUNT(1) AS Hits,
[UserAgent],
DATEDIFF(DAY, MIN(Created), MAX(LastModified)) AS [DaysBetween],
MIN(Created) AS [Earliest],
MAX(LastModified) AS [Latest]
FROM PagesInInteractions
GROUP BY [UserAgent]
HAVING
MAX(Pages) <=1
AND COUNT(1) > 500
ORDER BY COUNT(1) DESC
This query finds unique user agents that have viewed single-page only and have over 500 interactions:

20% of total interactions recorded system-wide belong to user agents that do not have over 1 page in visit across 500 visits. These user agents are most likely to be added into the blacklist to stop them from being tracked.
Could contacts without interactions exist?
Although that should not happen in theory… you got it:
SELECT COUNT(DISTINCT(c.ContactId)) FROM [xdb_collection].Contacts c
LEFT JOIN [xdb_collection].Interactions i
ON i.ContactID = c.ContactId
WHERE i.ContactId IS NULL
Our case study has 7.5 % of contacts without interactions that was caused by bug.
Summary
The query we developed to locate suspicious
user agents allows us to identify robots with better accuracy in future. Unfortunately, the previously collected robot sessions would still remain in system and pollute analytics reports. Needless to say that you pay for storing useless data to your hosting vendor.
In next articles we’ll try to remove the useless data from system to recover the reports.