It started with a simple mistake. A member of our team forgot to tag one line item of media with URL parameters, meaning we couldn’t find the resulting sessions in Google Analytics.
That’s ok. It’s been a while, but it’s happened before. We can estimate the missing Google Analytics traffic, based on metrics from the ad platform, making a note in the report for full transparency with the client.
Only this time it just didn’t make sense. Based on past performance, the math told us to multiply the number of clicks in the ad platform times 243 to get the number of sessions in Google Analytics. That multiplier historically had been between 0.5 and 1.5 (less than 1 if the ad platform was overcounting clicks or analytics underreporting sessions, more than 1 if the opposite was true). 243 was just plain nonsensical.
We checked. And checked again. We pulled the data correctly. The math was right. The data itself was bad.
We started digging, looking for patterns in Google Analytics to see what might be causing the anomaly and noticed that a disproportionate volume of the traffic was from the same select cities – cities known for large data centers (but not large populations of people) – Columbus, OH; Ashburn, VA; Hampton, AK.
The bounce rates were high. Time on site was low. This was clearly bot traffic.
Bots Are Nothing New
Bots have been an issue in Google Analytics for years. In the 2010s, bots usually just caused ugly spikes in the data – days when traffic to the site would be 10X or more the usual volume.
Best practice back then was to monitor the network domains of the traffic coming to the site and add filters to remove the bot traffic. That ended in February of 2020, when Google removed the ability to see or filter on network domain.
We were left to fall back on a feature that had been in Google Analytics since 2014, but that many analysts had little faith in – Automatic Bot and Spider Filtering. This feature supposedly removed both bots and search engine spiders from the reports before we saw them. What worried experts in the know was that Google’s list of “known” bots was rather disappointing.
We had no choice but to move on. Our clients weren’t concerned and most marketers we talked to weren’t concerned. We needed the data and had to trust Google had the bot thing covered.
Still, we knew our data was off by a few percentage points and we saw signs some bots were getting through. We saw the impact was worst (high bounce rates, traffic clustered to weird cities, etc) when we initially launched new media or made creative changes, but we assumed the distortion was relatively consistent and the conclusions we drew from the data were still sound.
Our little data recreation exercise, however, brought reality clearly into focus. Bots are no longer a small distortion to the data in Google Analytics – they are a big problem and impacting our ability to make sound, data-driven decisions.
So our team set out to identify the bot traffic. We studied the networks the traffic came from, characteristics of the browsers the bots used, data sent with each network request, and more to come up with four distinct methods of identifying the bot traffic.
We picked eight clients (one of which was ourselves) and added scripts to push these four detection techniques into Google Tag Manager. We then configured custom dimensions in Google Analytics 4 and pushed the results from Google Tag Manager into Google Analytics.
Then we waited.
After three weeks, we crunched the numbers. They were ugly.
Percentage of Traffic Identified as a Bot by Client
Only two of the clients analyzed came out with less than 25% bot traffic. One had a whopping 68% of all traffic in Google Analytics coming through as suspected bot traffic!
We were stunned. The data we had been reporting to our clients was wrong. It was horribly distorted, and based on the variation among clients, not in a consistent way.
But it only got worse as we dug deeper into the data…
Radically Uneven Distortions
Percentage of Traffic Identified as a Bot by Source
We found that the distribution of bot traffic is very uneven, depending on the source of that traffic. For example, 72% of the traffic generated by programmatic paid media was bots, whereas paid search was only 2%.
This means that when comparing the efficiency of programmatic display vs. paid search, programmatic display appears much more efficient than paid search at driving traffic, but in reality, they are close to the same.
The story gets much more complicated, however, when we look at traffic quality. For that analysis, we’ll use engagement rate, or the number of engaged sessions divided by the number of total sessions.
When we take bots out of the equation, our engagement rates are much more consistent.
Still, at first glance, the changes can be somewhat enigmatic. Email and Programmatic see boosts in engagement rates when we remove bots.
This is likely due to the fact that the bots visiting from email and programmatic media (antivirus and quality assurance, respectively), don’t do anything after loading the initial page and scanning it.
However, Facebook engagement rate actually drops when we remove bots:
A deeper analysis reveals that this is because the majority of Facebook bot traffic is Facebook’s own quality assurance bot. That bot will generally stay on the landing page for 14 seconds and often scroll, presumably in an attempt to trigger popups or overlays that might negatively affect the user experience.
Bottom line, these bots are having significant impacts in the measurement of both quantity and quality of website engagement. Furthermore, these impacts are not evenly distributed across traffic sources, causing major distortions when comparing one traffic source to another.
We talk about bots as if they are all cut from the same digital cloth. However, there are actually many types of bots out there, some good, some neutral, and some downright evil.
Let’s break down these different types of bots and where we saw their impacts on data in Google Analytics:
QA Bots (Good)
This category represents the largest percentage of the bots we discovered. We know this because we looked at the networks sending the bots and spotted patterns aggregating around a few known players: Oath, Moat, Facebook, and a variety of demand-side platforms (DSPs) used by marketers to place media. These bots are out to make sure that we aren’t sending people to 404 pages, breaking laws like fair housing, health care, and alcohol. They also make sure we are not unwittingly sending traffic to pages infected with viruses.
Because programmatic media passes through several parties (publishing side, multiple exchanges, marketer side), one ad can have several and many repeated checks to ensure quality and compliance.
It is vitally important to note that these bots show up as traffic in Google Analytics, mess up your on-page A/B tests, and wreak havoc on screen recordings, but they do not result in inflated impressions or clicks because they get the URLs to inspect from the network, not by viewing an ad.
Prevalent on: Paid media channels, especially programmatic
Antivirus & Firewall Bots (good)
Corporate firewalls will scan the links on incoming emails to make sure they are safe and not going to cause employees to give up sensitive information through a phishing scam or cause their computers to be infected by a virus. These firewalls will send a bot to load each web page linked before passing the email along to an employee’s inbox.
Prevalent on: email
Search Engine Bots & Spiders (neutral)
These bots are used by search engines to catalog the internet, visiting and pulling content for each page. These are largely filtered by Google Analytics, but can appear in other reporting platforms.
Prevalent on: referral and direct traffic sources
Competitive Intelligence Bots (neutral)
These bots inspect advertising, organic search and organic social media posts to catalog and sell intelligence on what their customer’s competitors are doing online. They are operated by companies like Semrush, WhatRunsWhere, AdPlexity and Rival IQ.
Prevalent on: paid (including search), referral, organic social, organic search and direct traffic sources
Scraping Bots (bad)
These bots are used by researchers, price aggregators, large language models, data brokers and more to not only catalog web pages, but also extract key data to insert into databases.
If this data is later published with attribution, it can improve SEO and drive traffic back to the site. But more often than not, this data is repurposed without attribution, resulting in extra load on the web server, copyright and other intellectual property issues, and out-of-date information floating around on the internet.
Prevalent on: referral, organic social and direct traffic sources
DDoS and Malware Bots (bad)
These bots are operated by hackers with the intent to either harm a targeted company, or infect website visitors with viruses or malware. DDoS (distributed denial of service) bots will repeatedly load pages in an effort to cripple web servers. Malware bots will scan the site code looking for vulnerabilities allowing them to infect the website with their own code.
Prevalent on: referral, organic social and direct traffic sources
Ad Fraud Bots (bad)
Fraudsters in the programmatic advertising industry will set up fake publishers to sell ad space on a website that no human would ever visit because the content is scraped or made up. To generate impressions, they use large bot farms or botnets to impersonate human traffic, reading fake articles but viewing and occasionally clicking on real ads.
These are some of the most sophisticated bots on the internet because it is a constant cat and mouse game between the ad networks and exchanges and the fraudsters. For that reason, we probably only caught a sliver of these bots using our bot detection methods.
Prevalent on: Paid traffic sources
Form Spam Bots (bad)
These bots scan the internet for forms and submit them in the hopes of reaching decision-makers with offers or propagating links on the internet to manipulate search engine rankings.
Prevalent on: referral, organic social and direct traffic sources
Are You Paying for Bot Traffic?
Since some paid channels are heavily hit by bots, it’s fair to jump to the conclusion that marketers like you are paying for a big chunk of this bot traffic.
However, when we dig in further, we see two points that lead us to believe that this is not the case:
- Most of the bot traffic coming in as paid media traffic originates from legitimate networks used for quality assurance and ad intelligence. These networks get the links from the exchange or network before they are displayed to a user and before you pay for a single impression.
- For the majority of the bot traffic, we don’t see corresponding clicks in the ad platforms themselves, clicks that would be recorded regardless of whether it was a human or bot clicking on the ad.
How This Ruins Our Decision-Making
Modern marketers take every opportunity to make data-backed decisions on where to put our limited resources.
Since those decisions are generally influenced by measures of quantity and quality of traffic, when those numbers are distorted, so are our decisions.
One area where bots cause issues is when working to optimize content or experiences. We’ll often set up A/B tests to determine the best messaging or creative execution for an offer or content.
Bots don’t care whether they’re looking at version A or version B. Our word choices or creative treatments make zero difference to their actions. Mathematically, this dilutes the results and kills statistical confidence, causing more and more of our A/B tests to come out as inconclusive.
Similarly, bots dilute the signals in screen recordings from tools like CrazyEgg and HotJar. Our team has found these increasingly more bizarre and has struggled to draw conclusions or experiment ideas from watching them.
The final area where bots are distorting decision-making is in content analysis. The mix of traffic sources for any given piece of content will vary greatly depending on how that piece of content was promoted and how well it performed on each channel (organic search, organic social, paid, etc). Since the volume of bots varies based on channel, the impact of bots on a given piece varies greatly as well.
Since most bots don’t engage or convert, content with high bot traffic will appear to fail at engaging or converting website visitors. This skews any content analysis in favor of the content that doesn’t attract bot attention.
Bad Data = Bad Decisions
We all want to be data-driven marketers. We work hard to use metrics when making decisions on where we put resources, how we optimize visitor experience, and which content needs our attention.
What our study shows, however, is that we have been making these decisions based on flawed data, and in many cases, drawing flawed conclusions.
Impact on Confidence
As data-driven marketers, our reputations are staked on the data we report and the insights we garner. When bots interfere with that data, we risk our reputations by reporting bad data.
Furthermore, if we know the data is dirty, and we continue to report it, we run the risk that some day clean data arrives, and we have to eat crow with our stakeholders. The people to whom we reported months, maybe years of dirty data.
Finally, bots eat at our confidence. When asked, “Why did this number change?” and seeing no clear cause, we are in a position of saying, “we don’t know” or making something up. Either way, we damage confidence in our abilities as marketers and risk others’ confidence in us.
What You Can Do Now
So now that you know about the havoc bots wreak in your data, what can you do about them?
One way to minimize the damage is to rely on a blend of click data from your traffic sources (Google Search Console, Facebook, Google Ads, etc) and data from Google Analytics. You can take the lesser of two numbers (clicks vs. sessions) to minimize the impact of bots filtered by one platform, but not the other. Still, this is not foolproof.
For this reason, we’ve developed a tool called Bot Badger you can install in front of your Google Tag Manager (GTM). Bot Badger will either pass a flag telling GTM a visitor is a bot, or if you’d rather, just not load GTM at all for bots.
This keeps not only your Google Analytics data clean, but will also keep bot data out of your A/B testing tool, marketing automation, screen recordings, ad and social media pixels, and any other tags you have installed.
We hope you take us up on a free trial so you can see for yourself the impact bots have had on your data. Let us know what you learn. We’d love more data points as we work to let the world know the extent of damage bots do to the decision-making of marketers like you.