Unveiling Google’s Secret Influence on SEO and Data Collection: The Implications of Chrome’s Data Pipeline – Part 3 of 3

October 11, 2024

As more evidence comes to light, it becomes clear that Google’s monopolistic behavior isn’t limited to its dominance in search. The company is leveraging its browser, its vast user base, and the data it collects to stay ahead in AI development, ad targeting, and search, ensuring that no competitor can catch up. The many coincidences surrounding the launch and results of Mobile-First Indexing paint a concerning picture of Google’s operations, especially when viewed under the new suggested lens of Chrome as a potential distributed computing system. The company’s near-total dominance in the browser market allows it to collect unprecedented amounts of data from unsuspecting users, while using that data to build stronger advertising models, train AI including ad modeling, like what is used for PMAX, all to further entrench its monopoly.

https://youtu.be/txNT1S28U3M?si=r-NV7bHtr_4nvGTm&t=2006

This is the third and final article in a three-part series about a potential new understanding of how Google might be getting information about pages when they crawl. The first article outlined the main assumptions and understandings that underlie the SEO community’s interactions with Google, especially related to crawling, rendering, Indexing and ranking. The second article in the series reviewed details of the new theory, specifically about how the second phase of Google’s Mobile-First Indexing might be working. This last article will review the potential implications of this new understanding, how it could be tested and verified, and what Google could do to clarify the situation.

Google’s influence extends far beyond the search engine or even online advertising worlds. By collecting and modeling user data, and by potentially turning Chrome into a tool for rendering and data processing, Google has essentially turned the majority of internet users into unwitting participants in their monopolistic business practices. We know that all the data that Google collects can be shared across all of Google’s properties – it is part of their Terms and Conditions, so Chrome data is certainly being used to create detailed models of user behavior and purchase behavior (Journeys and Journey modeling) used in Google Discover, Google Ads and Google Search, and which Google leverages for everything from categorizing users into a variety of different cohorts to better target ads and potentially further train AI.

Google’s ability to harness data from billions of users through Chrome isn’t just about improving ad targeting or search algorithms. There’s growing concern that Google is also using this data to take organic clicks and send more clicks to paid ads – in order to directly feed their bottom-line and please anxious investors. The cost of AI processing is high, and Google wants to be able to compete; the intentional undercutting of competitive ad networks and low-margin ad models may be part of a larger plan that Google needs to fund its AI development efforts.

The problem is, this appears to be at the cost of small publishers who previously made their living with websites that ranked well and drove organic traffic, then overnight, lost everything in one of the Helpful Content Updates. While it is Google’s right to change their algorithm at will, it is not their right to rob content creators by lifting their content into an AI Overview, and showing it without attribution and/or not even showing websites when they are searched for by name, when the only justification is that Google did not approve or benefit enough from their chosen method of monetization.

https://x.com/jason_kint/status/1834801152254246919/photo/1

This situation is especially concerning in light of the ongoing financial losses and resource challenges faced by other AI companies like OpenAI. Google, with its distributed processing model, may have found a way to sidestep these challenges, giving it an unassailable advantage in the AI space. Google already offers distributed and hybrid cloud solutions for their Google Cloud enterprise clients. Why do we think that they are not leveraging learnings from these ventures to ensure their AI processing capabilities into the future. Google is staffed by the smartest technologists in the world, and if I can see this as a potential solution, so can they – and they probably did years ago. And whether witting or unwitting, the transition from Universal Google Analytics to the far inferior Google Analytics 4, and the deletion of terabytes of historical data for sites that didn’t properly transition also seems suspicious here.

A Call for Transparency About Chrome Data

The time has come for greater scrutiny of Google’s practices. As SEOs, digital marketers, and internet users, we need to stop blindly accepting Google’s explanations and start questioning the extent of its influence. Google’s control over data collection, search rankings, and advertising is not just a matter of market dominance — it’s a matter of ethical concern. By using our personal devices as part of their data collection and processing network, Google is overstepping boundaries in ways that could have far-reaching consequences for privacy, competition, and the future of AI.

As the demand for compute and processing power grows nearly exponentially, and with the growth of expensive and as-yet unprofitable AI processing systems, Google may be pushed to expand its use of distributed computing in Chrome even further. It’s time to demand transparency and accountability from Google before their unchecked power leads to even greater monopolistic control over the internet. Users deserve to know what data is being collected and how it’s being used and regulators must act to limit Google’s stranglehold on the digital world before it’s too late.

https://www.forbes.com/sites/bethkindig/2024/06/20/ai-power-consumption-rapidly-becoming-mission-critical/

This theory, and even the possibility that it is true and causing harm, should serve as a wake-up call for regulators, industry leaders, and users alike. Google’s secret use of Chrome as a data collection tool is more than just unethical; it is abusive and potentially illegal. It’s time for serious action to be taken. In essence, Google has weaponized Chrome’s vast market share to gather data from over 65% of the world’s internet users without most of them realizing it.

Specific Notes for Those Who Want to Delve Deeper:

While the large majority of feedback on this theory has been enthusiastic, the main criticism has been that there are too many logical leaps, and that “big theories require big proof.” I don’t disagree at all that the theory has not been 100% proven in my talk or in this article series. While there is a lot of interesting circumstantial evidence presented, some have requested more details about telemetry logs and packets where the Chrome data is being sent. Others have suggested that the files would be too big to go unnoticed, and that this would be such an invasive practice that no company would allow Chrome to be used by employees – especially to access sensitive corporate data, plans or information.

The idea that Google is capturing information from Chrome is not new – the first time that I can find in our industry that this concept was put forth was 2011, when GoogleBot first started crawling with MediaKit; So this is not new but people do still seem to struggle with the concept that Google may not have explained or gotten explicit permission for everything that they are collecting, and the things that they are using it for.

We should no longer assume that Google is acting in good faith by default. Maybe Google classifies all of this data capture in the broad language of their Terms and Conditions, where we all agree that data can be used to ‘improve search quality’, or some other overly-broad designation. Maybe Google has developed a new, proprietary technology that they are using to compress or hide the data transfer in covertly installed extensions, user account management systems, video calls in Meet, weekly Chrome updates or other common user behaviors. Assuming that we know about all of the technologies and techniques that are being used by Google seems a bit naive. And Chrome is not the only problem; On Android phones, Google has direct access to even more data through the OS, and may be moving data out of Chrome telemetry, to pass it under different labels.

What we know for sure is that Chrome is already collecting page rendering data, if not the full page rendering for CrUX and Core Web Vitals, and that they are collecting page-level interactions of real users – all without explicit consent outside of users accepting the normal Terms and Conditions. Google stipulates all of this – so the idea that the full page rendering may also be captured is really not that much of a stretch in my mind. Some have noted that any type of personalization, cookies or extensions would cause problems with the rendered page, but Google gets a stateless browser experience in Phase 1 of crawling, so they should be able to use this to compare and identify when pages have been modified. Beyond this, permissions have to be given directly to Chrome for a browser extension to modify the viewing experience on a page, so Chrome could easily omit these pages from their dataset.

The default settings in Chrome allow for a variety of different types of data collection, and describe it as periodic, so not all the time, which would certainly make it harder to find and evaluate. We have also learned from security experts that the data that Chrome sends to user accounts is heavily encrypted, and not possible for anyone without the necessary decryption tools to identify – so again, we just don’t know what data Chrome is sending.

Google could be pre-processing information so that it is smaller and less likely to be detected, rather than sending the fully rendered page. They could be only requesting and receiving page data sporadically, on an as-needed basis, after phase 1 of a crawl has been completed. They could be sending data of pages with enough visits to the CrUX system to be anonymized and processed for Core Web Vitals before it is sent on for use in the index. The data could be streamed through a yet unknown extension or API that Google has baked into the code, into the user admin functionality or into any number of other elements that we take for granted when using Chrome. The truth is, Google is full of clever developers who want to make the web better. If they believe that what they are building into Chrome is doing that they might not question it. We know Google has a history of talented engineers leaving the organization when they get scared with how the giant is using the tech that they helped to build.

I am eagerly hoping that this article and the talk that it is based on will inspire people more technical than me to start looking more deeply into exactly what Chrome is doing with our computers and phones. Even when concerns and warnings are not 100% provable, they bring light to topics that are worthy of attention, and they are important work when no one else is willing to come forward with their concerns. Already one person – Mark Williams-Cook – has set up a test to see if we can prove the theory out, and I hope others will set up their own tests too.

At a minimum, we know that Google’s new Pixel 9XL phones are transmitting private data as often as every 15 minutes. According to the Tweaktown article,“… the device is automatically connecting to device management and policy enforcement endpoints, which suggests Google has remote control capabilities” and all this is happening while Gemini is disabled. Beyond that, according to Aras Nazarovas, a security researcher at Cybernews, Google is collecting location even when the GPS location features are turned off, and “the Pixel 9 Pro XL repeatedly uses PII for authentication, configuration, and logging. This practice doesn’t align with the industry’s best anonymization practices and appears excessive.” If this is the lax care that Google has given its most recent, flagship product, it seems likely that the oversight and overreach are not isolated to just that one instance.

Questions that Google Should be Expected to Answer:

If Google is not doing anything described in the article or video, they could simply deny it all. This seems unlikely, so instead, we have suggested some questions that we think regulators should be asking, and investigators should be looking into when evaluating what data is being captured by Chrome.

What level of processing, pre-processing or data evaluation is happening on local computers and phones? Is this something a user can opt out of?
What is the purpose of including TensorFlow Lite code in Chrome? What type of information is it processing?
How does Chrome prevent page engagement data, page experience data, loading data and Core Web Vitals information from being collected on pages that are behind firewalls?
How did private groups and documents become indexed and ranked by Google’s algorithm in February of 2024? What has been done to prevent similar things from happening again in the future?
How is personal data being anonymized and when is personal data being used to train Google’s AI? Are there new or evolving security or privacy measures that we should know about?

Unveiling Google’s Secret Influence on SEO and Data Collection: The Implications of Chrome’s Data Pipeline – Part 3 of 3

A Call for Transparency About Chrome Data

Categories

Recent Posts

Archives