Unveiling Google’s Secret Influence on SEO and Data Collection: The Second Phase of Mobile-First Indexing and JavaScript – Article 2 of 3

October 11, 2024

This is the second article in a three part series about how Google’s launch and now completion of the transition to Mobile-First Indexing may impact all of us in and outside of the digital marketing industry. As news from the DOJ keeps rolling out, and announcements of possible breakup of Google properties keep coming, a lot of the potential abuses have become more real and concrete. It brings up questions about what might be shared if Google is forced to share information that was collected from Chrome without explicit consent, and this article series strives to delve into that – starting with the concept of rendering as a part of Google’s crawling and indexing process. The first article in the series outlined the background understanding that SEO’s have of how Google’s crawling, rendering, indexing and ranking works and the final article in the series will review the implications of this new potential understanding of Google’s data collection behaviors.

Historically, SEO’s have understood crawlers to be software that follows links on the web to find and index new pages, caching pages as they go so that they can evaluate them for relevance in their ranking algorithms. While much of a page can be understood from raw HTML, the heavily pressured transition of websites to Responsive Design made Google’s ability to crawl JavaScript more important. While page formatting in Responsive Design can generally be accomplished with style sheets, many developers were using JavaScript to accomplish the changes instead. Beyond this, JavaScript was just becoming more common and useful on the web – not just for adapting sites for mobile rendering, but for core functionality. This is why rendering JavaScript became so important to Google.

https://www.seobility.net/en/wiki/Search_Engine_Crawlers

The problem was, Google had historically avoided rendering JavaScript when it was crawling. They considered it a security risk, because if the crawler executed JavaScript that had malicious code in it, it could compromise the crawl and the capability of the crawler – or cause other harm that would be essentially unknown until the code was already executed. Additionally, JavaScript code can be slow and inefficient to render. Google had gotten very fast at crawling poorly coded HTML, but even well-coded JavaScript would be much more costly and inefficient for the crawlers. While Google did start crawling and executing some JavaScript after the release of the Mobile-Friendly Update of 2015 (Mobilegeddon).

The stark difference between their findings raises critical questions in my mind about what is really happening behind the scenes. I had already begun to suspect that there was something confounding Tom’s testing – but I had no idea what it might be, since identifying Google’s crawlers is generally not that tough. But then Malte’s results also seemed impossible – the time, cost and energy that would be needed for Google to execute JavaScript every single time they crawled a page would likely be enormous – especially when you think about how large the web is, and that it has been growing exponentially for years. To me, if Google was executing and rendering JavaScript on all of the pages that they crawled, it seemed like that would be an irresponsible and wasteful use of compute power that would likely even be problematic for large publishing websites that were aggressively and deeply crawled on a regular basis, each time Google’s Bots looked for new content.

It was with all this in mind years later, at a dinner in Spring of 2024, when my friend Tom Anthony was explaining that in his testing and/or review of server logs, GoogleBot Mobile was only executing JavaScript 2% of the time. While I assumed that Google would not be executing JavaScript every time it crawled, I never expected their JavaScript execution to be that low. I considered how this could be true for months, until I read an article co-authored by Malte Ubel, who previously worked on the Performance at Team at Google, but now was working with a company called Vercel. In this article, they claimed to have tested thousands of pages and found that Google was executing JavaScript 100% of the time, even when the JavaScript was complex. This created a conundrum – how could these two very smart guys have such different results.

As I questioned the results of both, it stuck me that there was at least one plausible explanation that would allow both to be right, and would keep the Google index as accurate as it needed to be without wasting the resources of Google or the websites that it crawls; Google owns Chrome, and people surfing the web use Chrome to render pages all the time. Why wouldn’t Google use the results of pages that it has already rendered in Chrome to supplement Phase 1 of the crawling process? This would allow them to render pages with no additional effort at all! The biggest hurdle for Google would have been coping with the reality that more web and search traffic was happening on mobile devices, but Google’s Index was still based primarily on desktop content – Hence the name of this change, Mobile-First Indexing.

All of a sudden, it seemed possible, if not likely that Google was somehow using its vast network of Chrome users as decentralized crawling and processing nodes; Rather than burdening its own servers, Google waits for users to render a page — and then captured that data for indexing. Essentially, Google was outsourcing its JavaScript rendering to Chrome users, turning everyday web browsing into part of its algorithmic machinery. While this may sound crazy, distributed computing is a model in other large processing jobs, such as medical protein folding, monitoring space and bitcoin mining. The difference is, that those distributed processing models are all knowingly opted into by the people who are volunteering as nodes.

https://youtu.be/txNT1S28U3M?si=uuysDXT95qRip1xc&t=755

This model is well-known to Google, because they actually have offered it as part of their Google Cloud services and their hybrid-cloud solution Anthos since 2021:
https://cloud.google.com/distributed-cloud

In this solution, I hypothesized that Tom’s evaluation was looking only at Phase 1 of Mobile-First Indexing, and Malte’s was focusing on Phase 2. Since Malte had worked for Google, he would have known that 99% of Chrome installations were executing JavaScript, and thus, could be used for rendering. This solution would also mean that the second phase of Mobile-First Indexing would not be something that developers who were checking server-logs for crawling information or were user agent detecting and redirecting to cloak content would not be able to easily detect. It would also ensure that the crawling would happen from a variety of locations that were not associated with known Google IP addresses, so spammers could not use IP detection to their advantage either; This would be a significant advantage to Google’s ability to detect and prevent spam from getting into the index.

This understanding of the indexing system might help explain how Google accidentally indexed and ranked a number of Google Groups, Google Community Notes, WhatsApp Groups and private Google Documents that were not meant for public access or consumption. Put simply, these things could have been rendered in a local browser, and though the systems were protected by a firewall at login, the individual pages did not have a NOINDEX meta robots instruction, so they accidentally were indexed and ranked. Some have noted that these pages could have also gotten indexed because they were linked to from other pages, and this is also a possibility – since they didn’t have on-page robots blocking, but this is not dispositive; both possibilities can simultaneously be true. Since Google never addressed these concerns publicly, it seems reasonable to assume that there was something to hide.

Core Web Vitals and Real User Metrics

After that, everything that I found seemed to support this conclusion – which likely implies some level of confirmation bias on my part, but I believed and still believe that the rest of the industry was also suffering from confirmation bias, in assuming that Google’s crawlers still worked basically exactly as they always had before. So many things that I found, when reviewed in a different light, seemed to be obvious indicators about Google using Chrome. Possibly the biggest signal was Google’s introduction of the first Real User Metrics or what they called ‘field data’ in the launch of a new type of page performance measurement utility that they called Core Web Vitals.

Core Web Vitals was designed to help webmasters know exactly how their pages were being experienced by users as they loaded – since most developers were working on high-powered computers and most users were viewing sites on lower-powered, sometimes old and clunky mobile phones. Google went to great pains to explain the difference between synthetic data, which was based on simulations, and was what all Google tools before had relied on, and the new use of ‘field data’ – a suspicious divergence from the standard industry descriptor for this type of data that was already widely in use in other contexts, Real User Metrics or RUM data. While these seemed like another set of SEO best practices, they also served as a subtle way for Google to gather even more user data directly from Chrome.

With this new understanding of what might be going on, the real user data wasn’t limited to page load times — it also includes user interaction like clicks and scroll behavior. This became even more apparent when Google launched the new Core Web Vitals metric, Interaction to Next Paint, which explicitly measured the loading lag time of a page when users clicked on multiple items on a page. Google’s admission that they were warehousing real Chrome loading data and user interaction timing data in a huge database that they called CrUX, seemed like evidence in plain sight.

https://web.dev/articles/inp#no-inp-value

Other less obvious signals were changes in the urgency that Google had about warning webmasters against a practice called cloaking, which is essentially showing one version of a page to a bot, and a different version to a user. Previously, Google had warned that cloaking was a clear violation of their terms, and that it could result in a manual action or removal from the index. After the launch of Mobile-First Indexing, Google softened their stance, and said that as long as you were altering the page for the benefit of the user, cloaking was ok, and was somewhat re-cast as harmless ‘selective serving’. They also changed their communications related to robots.txt – where before Mobile-First Indexing Google said that robots.txt files at the root of a domain were the best way to keep content from getting indexed, and after, they switched to preferring on-page robots meta tags to keep content out of the index.

Google also created the Mobile-Friendly Test and later the URL Inspection Tool, to allow users to see what GoogleBot was seeing, but these tools always used a different version of the bot than what was actually doing the crawling and indexing – presumably because what was captured by the real bot, was not actually meant to be seen by real webmasters. Google began communicating that the information in their ‘cache’ view that had always been linked from search results was not as reliable as the information in the URL Inspection Tool, then eventually, in the beginning of 2024, Google eliminated the cache button from search results, then two months later, also deprecated the cache Chrome operator that worked in the address bar, replacing it with information from the WayBack Machine, which does not tell a webmaster anything about what Google’s crawler is really seeing – Just about what users and WayBack crawlers have seen.

Local Chrome Processing, Preprocessing, CPU & GPU Usage Monitoring

In addition to this, Lucas Castanedo recently revealed a secret extension that is baked into all Chromium browsers, but not listed in the Extensions menu. It can’t be removed or changed, but it is there to monitor CPU and GPU usage in Chrome. Beyond this, the extension gives all .google.com sites access to system CPU, GPU, and memory usage data. This API is not available to other websites — only Google’s own properties. This kind of covert data collection may be enabling Google to gain insights into how users interact with websites on a deeply technical level. This kind of surveillance, exclusive to Google-owned domains, gives Google an unparalleled advantage in data collection, ensuring it can optimize its services, ads, and AI algorithms based on the most granular user activity.

https://youtu.be/txNT1S28U3M?si=_DAPVpRagKv1ywzH&t=1641

Based on the naming, we can infer that the monitoring is to help Google Meet, Google’s browser-based video chat platform run smoothly. Again, while this may be the primary reason that the software is needed, it is suspicious that it is built as an extension, rather than baked into the code, and that it is not shown as an extension and can’t be removed. We should also remember that In the US, once Google has the data, they are free to use it for the primary purpose that they specify, as well as any other purpose that they want, even if it is not specified or made clear in the naming. This is actually not true in the EU, where data use is limited, and can only be leveraged for the designated purpose. Also, if it is just for Google Meet, you would think that users who don’t use Google Meet could turn it off or remove it, but they simply can’t.

The Chrome app hosts a variety of folders, most of which are named, but some are only numbered. Dejan Petrovik, who was also instrumental in finding and parsing the Google leak, has worked out what some of the folders do, which includes some local language and potentially topic modeling; he even indicated that a version of TensorFlow Lite was in the file system – an idea that was verified with someone in the digital security space as well. TensorFlow is an AI model that would only be used if Google was using locally stored data for some type of Machine Learning. Apparently Google says that it is used to power the Network Monitoring tab in Google DevTools, (apparently even if DevTools is never accessed by the user).

https://youtu.be/txNT1S28U3M?si=m12fXX21iB9_Fn3k&t=1711

It was in 2019, one year after the launch of Mobile-First Indexing that Google started talking about optimizing Back/Forward Caching, or the BFCache. With this, websites that were properly optimized allowed Chrome to store a complete snapshot of executed page code for a URL, so that if the user later clicked the ‘back’ button, the page could be reloaded without processing. In 2024, Google announced that Chrome would begin to ignore the ‘no store’ HTTP cache control (and Google officially updated their policy on Oct. 11, 2024), so that BFCache could be used more aggressively – all of course for the speed and benefit of Chrome users.

This, of course, gives Chrome easy access to full snapshots of pages that it could be saving, and sending back to Google data centers, or processing locally. A post in research.google shows exactly how this could be done, noting that the page content could be summarized and annotated, questions could be asked and answered, and navigation could be completed. (NOTE: I fail to explain this in the talk, but Michelle Robbins picture is included with the Google animation because she was the one who brought this research to my attention) Perhaps some variation of this is happening to snapshots locally, before data is sent to Google’s Index.

Even if this is not the case, (or just not the case yet), it seems likely that beyond just clicks and scrolls, and potentially rendering data, Google is also using Chrome to process or preprocess data locally before it is transmitted back to their data centers. As Olaf Kopp reports, there is a Google patent for on-device machine learning that is already in the public domain. This would explain the long, ongoing mystery of why Chrome takes so much memory, heating up processors and causing laptop fans around the world to kick on in response. And maybe this is why Chrome is demanding to be updated so regularly now that it seems like there is never a week without a Chrome update – the Crawler has to match what the Index and the Algorithm are currently requesting and need for processing as they all evolve together. There are a number of things that could be causing Chrome to need to update quite regularly, but there does seem to be at least some correlation with increasing Chrome releases and major Google updates – Obviously correlation and not causation, but still interesting.

https://youtu.be/txNT1S28U3M?si=13851Hrk0VS3hq46&t=983

Some have argued that rendering is not as big of a deal as it used to be, and now there are APIs that can speed it up dramatically, so Google is probably not that worried about, and wouldn’t need local machines to complete it. I can easily agree that JavaScript rendering has gotten faster, but I still think it would still be a significant cost and compute burden that they would want to minimize or offload entirely. Unfortunately, the cost of crawling is not something that Google tracks or even publicly talks about. In fact, when asked, Google will not discuss the cost of crawling and rendering JavaScript at all, but instead tells webmasters not to worry about it, with no further details.

My suggestion is not that every page rendered on your phone or computer is sent back to Google – They have smart systems that are likely used to minimize the load in a variety of ways; Google could be fetching information from user accounts or caches only on an as-needed basis, after completion of a Phase 1 crawl; Google could be selectively rendering content, only if it has changed; Google could also be using the Core Web Vitals system to aggregate and anonymize multiple user’s rendered versions of a page, or could be pre-processing flattened versions of the pages locally, and simply sending that information up as part of the User Account information. We just don’t know.

We can also simply look at different GoogleBots, like the Shopping bot, called ‘Google StoreBot’ that is used to crawl and verify content for the Merchant Center. It is notoriously bad at understanding JavaScript – which at a minimum shows us that the execution of JavaScript is still costly enough that Google is not doing it to the fullest capabilities on all of their crawling; if it was so cheap and easy, Google would surely be crawling JavaScript here too – but of course, this bot only has one phase of indexing – at least now, and at least as far as we know. Additionally, assuming that the cost and efficiency of the bot are the only JavaScript concerns that Google has, ignores the security concerns that Google has related to JavaScript execution.

The Changing Tacit Agreement Between Google, Searchers & Publishers

This article outlined how Google could be using Chrome to help with a massive data collection effort, including potentially using Chrome as a rendering engine for their crawler. While there is not 100% proof of what Google is doing here, the circumstantial evidence does seem to pile up. The deal that we have struck with Google is changing, and most people don’t realize it. Users used to be firmly on one side of the equation, consuming search results with ads, and sometimes clicking the ads, so that Google could make money. Publishers were firmly on the other side of the equation – creating content and potentially buying ads to drive traffic and possibly sales, from the customers that Google brought to the equation.

Now, users and publishers are both on both sides of the equation – because they are all being mined and used for data, to optimize sales for Google, and Google is the only one that is reliably benefitting from both sides of the equation – Personal data and search result quality are both being compromised, while content and the interaction it receives is being used to train AI systems to either replicate that content or more profitably monetize it.

The worst part is that in the future, Google could claim a lack of direct responsibility, if some or all of the relevant decisions and optimizations were made by AI systems rather than humans. While I still think it is reasonable to believe that the crucial decisions here were all still made by humans, Google may soon pioneer a new series of legal defenses, scapegoating AI systems for preventable abuses. We need to stop and set standards now, before things get that far. The third and final article in this series will delve deeper into the potential implications of Google’s robust data pipeline, and what regulators and people in digitally focused industries should be asking of Google.

Unveiling Google’s Secret Influence on SEO and Data Collection: The Second Phase of Mobile-First Indexing and JavaScript – Article 2 of 3

Core Web Vitals and Real User Metrics

Categories

Recent Posts

Archives