Temporal Perspective

Six Approaches to AI Content Monetization

Ioannis Bakagiannis — Tue, 16 Dec 2025 15:04:45 GMT

TL;DR

Search is still growing, but clicks are shrinking. Since AI summaries became mainstream, zero-click behaviour rose from 56% to 69% (May 2024 → May 2025)—a structural shift in how value moves through the web].

Publishers responded with six monetization plays: training licenses, training royalties, inference licensing, pay-per-crawl, vendor marketplaces, and ads inside AI answers.

AI decouples content consumption (when a model learns or retrieves) from content delivery (when a user receives value). Publishers used to control the delivery moment. That’s where the meter lived. But now the unit that used to pay you (a visit) is no longer required to extract value.

Each approach is trying to re-attach revenue to the old meter. Each breaks—economically, technically, operationally, or structurally.

The Real Problem: The Billing Unit Disappeared

Publishers don’t just create information. They built a monetization system around delivery: pages, sessions, subscriptions, ad slots, recirculation, brand trust. The page view wasn’t vanity; it was the billing unit.

The open web’s economic loop was brutally simple:

Publish → rank → click → monetize

AI removes that billing unit and changes the loop to:

Retrieve/train → synthesize → answer

The “answer” happens inside the AI interface, not on your domain. The user gets value without the default monetization surface.

Approach 1: Bulk Training Licensing

Lump-sum nostalgia for a training era that’s already over

What it promises

Large, one-time licensing deals offer publishers significant upfront revenue while maintaining familiar enterprise sales motions. Major publishers negotiate multi-million dollar contracts with leading AI companies, monetizing their archives at scale.

How it works

A publisher licenses an archive (articles + metadata) for model training (and sometimes for product features like summaries/citations).
The AI company pays a large upfront fee or multi-year commitment.
The publisher treats it like syndication: monetize the back catalog while core traffic economics weaken.

Why it fails

1) It doesn’t scale past the top of the market
68% of known commercial agreements are with News or Media companies. Only the largest publishers with significant brand leverage can command meaningful deals. If you’re not News Corp, Condé Nast, or The New York Times, you’re not getting a deal worth discussing in board meetings. Kaptur, Digiday

2) Content Licensing Agreements Will Concentrate Markets and hurt the Open Web if it is the only viable model
Procurement overhead, trust requirements, and integration costs push buyers toward a small number of high-authority sources. A failure to maintain non-discriminatory access will result in the consolidation of both the AI and content production markets. Sources: ProMarket — Content licensing agreements will concentrate markets

3) The value center is moving from training to inference
Compute and product value are shifting toward inference and deployment. Menlo reports 74% of builders now say most workloads are inference. That’s where product value and distribution concentrate. Training becomes less of the “rentable choke point.”

4) Publishers aren’t set up to “sell datasets”
Publishers are content creators, not data engineers. Packaging, metadata cleanliness, update semantics, provenance, and compliance are non-trivial—often pushing publishers toward intermediaries that take margin. Additionally the data that have out-of-the-box can be used only for foundational - next word prediction - training. Currently most of the compute goes to post training for instruction following (e.g. SFT) or behavioural training (e.g. RLHF).

5) Renewal leverage is legally fragile
Even when cash is real, publishers are negotiating under a moving legal landscape: if training ends up broadly protected as transformative fair use, renewals get harder and “one-time” becomes “one-and-done.”

Bottom line: Bulk licensing is a bridge for a few, not a market design for everyone.

Approach 2: Training Royalties / Model-Output-Level Compensation

Attribution fantasy in neural networks

What it promises

“You should earn forever because the model learned from you.” Publishers can participate in the long-term upside of AI by receiving ongoing royalties based on how their content is embedded in model weights, with attribution at the model level ensuring fair compensation.

How it works

A model is trained on large datasets that include publisher content (licensed or claimed).
An attribution system attempts to estimate how much each publisher contributed to outputs or model capability.
Royalties are distributed based on estimated contribution (often framed as “proportional use”).

What does not belong here: ProRata’s “fractional attribution” since it is a solution for RAG synthesis and not training.

Why it fails

1) Model-level attribution isn’t auditable in the way money requires
The hard constraint isn’t “we need better analytics.” It’s that modern models store learned representations in distributed weights—not retrievable “source records.” When payments depend on causal attribution, you need something closer to accounting than inference.

Research confirms this: Given the architectural complexity and intrinsic limitations of today’s LLMs, failures are not outliers but structural inevitabilities, and their black-box nature makes error diagnosis and causal attribution prohibitively difficult. LLM Explainability, Large Language Model Sourcing: A Survey, Data Authenticity, Consent, & Provenance for AI are all broken

2) It becomes adversarial immediately
Once money depends on attribution, everyone optimizes for it—poisoning, laundering, prompt manipulation, strategic paraphrase.

3) Output-level citation ≠ training-level attribution
Some systems can track what appears in an answer. That’s useful, but it’s not proof of what trained the model (or what shaped latent knowledge). For example watermarking is the prevalent technique of establishing provenance in GenAI but watermark detection tools, especially for text, may be able to provide only a statistical confidence score, not a definitive attribution, for the content’s origins.

Bottom line: Royalties that require model-level attribution collapse under audit, dispute, and adversarial pressure.

Approach 3: Direct Inference Licensing

Premium publisher aristocracy that breaks the open web

What it promises

Pay at the moment of use: if an AI system retrieves your content to answer a query, you get paid. This sounds aligned with value delivery.

How it works

Publishers provide licensed APIs/feeds directly to LLMs.
AI apps call them during inference for freshness/grounding. Original artifacts like articles, blogs or pieces of them are fed back to the model.
Billing is per call, per document, per token returned, or contracted tiers. c

Why it fails

1) Works best for top-tier publishers
Procurement overhead, trust requirements, and integration costs push buyers toward a small number of high-authority sources.

The market structure mirrors Approach 1: “The likely outcome is a dual consolidation: fewer major publishers controlling content supply, and fewer major AI firms controlling demand.”

2) It encourages exclusivity and competitive foreclosure
“Anticompetitive acquiescence” describes when companies acquiesce in lawsuits, licensing, or regulation to raise rivals’ costs—potentially benefiting if competitors suffer more or potential competitors never enter the market.

Publishers want guaranteed revenue and preferential placement; AI companies want stable coverage and advantage. The market consolidates around the biggest players, eliminating the long tail. For example ChatGPT could use only one news source for America - imagine how detrimental that would be - to ensure coverage.

3) No clean mapping between value, cost, and pricing unit
Inference costs vary wildly by output length, context, model choice, and user behaviour, making “per query” pricing hard to reconcile with unit economics.
Sources: CloudZero — Your Guide To Inference Cost, Monetizely — AI Inference Cost Problem

Publishers want to charge per query. AI companies experience costs per token. Users expect value per answer. There’s no natural mapping between these three.

A simple query might trigger complex retrieval logic, pulling from dozens of sources, while a complex query might be answered from cached knowledge. Who pays what, and based on which metric?

Bottom line: Works for a small set of premium brands; structurally hostile to broad, open supply. But it could work for vertical AI application integration with niche / expert content creators.

Approach 4: Pay-Per-Crawl / Access

Metering theater with enforcement gaps

What it promises

Charge bots to access content. Simple, usage-based, publisher-controlled.

How it works

A publisher (often via CDN/proxy like Cloudflare) classifies automated traffic: allow, block, or require payment.
Pricing is typically per request (crawl) or per page accessed, sometimes with tiers.
Access protocols/standards like RSL) (an XML-based open standard enabling machine-readable licensing and automated compensation. Publishers add machine-readable terms to robots.txt files) try to formalize “what uses are allowed” and “what costs apply.”
Bots that pay gain access; bots that don’t are blocked.

Why it fails

1) It penalizes freshness and repeat payments
Good AI systems refresh and cross-check. Pay-per-crawl makes that expensive, pushing caching and staleness. This creates an incentive to cache aggressively and crawl less frequently, leading to stale data. The economic model punishes exactly the behaviour that would improve AI quality: frequent, thorough content retrieval. Also the definition of “caching” can be extended a great deal, meaning an LLM can cache a search result into storage and never require to fetch that data again since it is getting access to raw content.

2) Latency and Fragmentation
It fragments the web into thousands of toll booths. Every toll booth adds auth/payment/policy checks—new failure modes in a latency-sensitive stack. Inference latency introduces hidden and opportunity costs that AI companies cannot afford.

3) Complexity of implementation appeals only to a handful players The RSL implementation guide for an AI company is 11 pages long - I know because I made one for testing. And then the application will have to register with a transaction partner - most likely a DSP like in the Bidswitch example - in order to execute this content trade. On the other hand publishers need to implement this at the page level. The Standard that is being worked right now by the IAB CoMP working group is an OpenRTB style standard - which is pretty lengthy and detailed - for EACH content page. Then integrate with an ecosystem partner that runs a licensing server for all these pages. I wonder who has the capacity to implement such a thing (thinking emoji).

4) Enforcement is the central failure mode
Non-compliance is measurable and rising: 13.26% of AI bot requests ignored robots.txt in Q2 2025. CDNs and infrastructure providers will definitely try to help in that direction but in the end if AI companies can get access to content without paying they will do that. At the same time adds a ton of complexity that smaller publishers cannot afford.

5) Catch 22: Discovery requires the thing you’re trying to monetize
The model assumes crawlers will find your content, evaluate it, and decide to pay. But discovery itself requires access—the very thing being metered. If you block crawlers by default, you’re invisible to the index. If you allow free access for discovery, you’ve already given away the data. It’s a structural catch-22: to be discovered, you must be crawled; to be crawled under pay-per-access, you must already be known. The system breaks at bootstrap.

Currently, Google’s AI Overviews exemplify this: the search index feeds the AI, and the AI reduces clicks, but the index was built on free crawling. Pay-per-crawl assumes a world where discovery and access are separate, but in practice, they collapse into the same request. You can’t sell access to something that hasn’t been discovered, and you can’t be discovered without granting access.

6) Pricing remains speculative, neither market-tested nor validated at scale
Pay-per-crawl assumes a rational pricing equilibrium, but no one has demonstrated how to price access in a way that works for both sides. Publishers set rates hoping to capture value; AI companies face unpredictable costs that compound across thousands of sources. The underlying bet is that regulatory pressure and blocking leverage will force well-capitalized players— namely Google and OpenAI—into compliance, creating a de facto standard through coercion rather than market discovery. This is not price formation. It’s a negotiation standoff disguised as a business model.

Bottom line: Metering without enforceability RAW ACCESS becomes a leaky tax that degrades quality and still doesn’t guarantee payment with a flawed design mechanism geared towards the big players.

Approach 5: Vendor Content Marketplaces

Closed platforms with no public proof

What it promises

Third-party marketplaces can solve the coordination problem by aggregating publisher content and connecting it with AI companies seeking licensed data, creating network effects and standardized access. Centralized Pay-Per-Query models.

How it works

Publishers integrate vendor tech (gateway, authentication, metering, settlement).
The vendor aggregates supply and sells a unified pipe to AI companies.
Vendors provide analytics and payouts; they take a cut.

Examples:

Why it fails

1) Cold-start dynamics are brutal
Marketplaces struggle when premium publishers can do direct deals and the long tail can’t attract demand.

2) Build on rented land (AGAIN)
Publishers learned this lesson with Google: build on someone else’s infrastructure + distribution, and you’re subject to their terms, their margins, and their strategic pivots. When Google hit critical mass, it could unilaterally change ranking algorithms, ad share, and traffic flow—and publishers had no recourse.

Vendor marketplaces recreate this dynamic. You integrate their APIs, route traffic through their pipes, accept their analytics, and trust their settlement. When they reach scale, they control pricing, terms, and access to demand. If they change margin splits or restrict publisher controls, you have no leverage—your integration costs and workflow dependencies make switching prohibitively expensive.

This isn’t hypothetical. We’re watching it happen now: Google’s AI Overviews and AI Mode demonstrate how platforms can unilaterally insert themselves between publishers and users, extracting value without negotiation. Vendor marketplaces promise to prevent this—while building the exact same structural dependency under a different brand.

The open web doesn’t survive by replacing one intermediary with another. It survives when its citizens retain control over pricing, distribution, and the ability to exit without penalty.

3) “Control” often becomes lock-in
Vendor-specific integration increases switching costs; the vendor often owns demand relationships.

4) Transparency remains theater, not infrastructure
IAB Tech Lab’s working group explicitly called out “the absence of a marketplace and methods to attribute contribution of content.”

Publishers entering these marketplaces have no independent way to verify:

Actual usage volume (how many times their content was retrieved)
Realized pricing (what AI companies actually paid per use)
Attribution methodology (how operators determine which content was “used”)
Marketplace margin (what percentage the intermediary captures)

This isn’t new. We’ve seen this pattern before in programmatic advertising: walled gardens control measurement, reporting, and settlement, then report numbers that align with their economics, not yours. The difference is that programmatic ad exchanges at least had third-party verification and auditability standards. Vendor content marketplaces don’t.

Without open APIs, standardized reporting schemas, or third-party audits, “transparency” becomes whatever the vendor chooses to show you. And when the vendor controls both supply access and demand relationships, publishers have no leverage to demand better.

Bottom line: A marketplace could be right; a closed marketplace becomes another dependency.

Approach 6: Ads in AI Responses & Affiliate Hybrids

Trust destruction for fragile yield

What it promises

Bring the most proven web monetization engine into answer interfaces. But conceal them as content.

How it works

AI application requests content from a website.
The publisher along with the help of AdTech vendors inject paid advertising into the retrieved content.
OR publishers create advertorial content directly.
Publishers get paid in a CPM way based on content access.

Why it fails

1) Ads attack the core product asset: trust
If users suspect commercial bias, the answer engine loses the “utility” advantage that made it sticky. Trust is the highest adoption lever for an AI company. Losing that will lead to catastrophe.

2) Open to Fraud
It is very straight forward for someone who has been in AdTech and digital advertising to see that this mechanism can be gamed easily through botting. There will be the same MFA issue that the current open web has.

3) Brand Voice breaks
Ads are not displayed as they were integrated into the content of the publisher. The user’s LLM will do post-processing of the whole context to reply to the user. The brand’s messaging is most likely going to change in a way that is not controllable.

4) Alignment contamination risk is real
Sponsored outputs leaking into training/feedback loops can create persistent commercial bias and AI models do not act on user’s best interests.

Bottom line: Ads can and should exist in conversational interfaces but separate from the actual content with the right disclosure signals.

What a Working Model Must Do

A durable solution has to match how AI behaves:

Monetize inference, not training
Training is episodic. Inference is continuous. The monetization surface must live at the delivery moment (Not saying that publishers should not do that as well, but it is not the scalable economic model for the industry).
Work without perfect attribution
No payment system should depend on reconstructing causal contribution inside a black box. This means moving from “pay for what you contributed” to “participate in the value you enabled.”
Prevent caching from zeroing out publisher participation
Incentives must keep knowledge providers in the loop when their information is relied on.
Preserve the long tail
If only the top 50 brands get paid, the web shrinks into an aristocracy.
Offer predictable economics for AI builders
Cost volatility will push builders to route around the system.

Ending with The Paradigm Shift

Monetizing the artifact, not the moment of usefulness

In the 2000s, everyone thought print would remain the cash driver while websites were a quirky distribution channel. Publishers invested in print infrastructure, optimized print advertising, and treated web properties as experiments.

They were wrong. The web didn’t complement print but it replaced it as the main revenue driver. The business driver shifted from “how many newspapers do we sell” to “how much web traffic do we generate.”

Now we’re making the same mistake again.

Everyone thinks websites are the cash driver while AI is a quirky distribution channel. Publishers are investing in SEO, optimizing programmatic advertising, and treating AI licensing as an experiment.

They’re wrong again.

AI will complement websites as much as they did complement print. The business driver is shifting from “how much web traffic do we generate” to “how much value do we deliver in AI value chains.”

Value is created when the AI delivers a useful answer - not when it ingests content.

If the AI can answer from memory or cache, publisher participation goes to zero. If attribution is required, the system becomes non-auditable. If enforcement is required, the system becomes leaky. If exclusivity is required, the open web collapses.

From artifacts to value flows

In the 2000s, many publishers treated the web as “distribution.” It became the business model.

Now, many are treating AI as “distribution + licensing upside.” That’s not what it is.

AI is becoming the primary interface between questions and knowledge. So the strategic question isn’t:

“How do we get paid for our content?”

It’s:

“How does value flow in AI systems where our knowledge is used—and where can we attach a fair, enforceable, scalable price?”

Answer that, and you have a survival plan.

The future of publisher monetization won’t be a better contract. It will be a new market structure, one that prices usefulness at inference time, without impossible attribution and without turning the open web into a gated estate.

Let’s build something that works.

The Hidden Cost of Web Scraping

Ioannis Bakagiannis — Mon, 08 Dec 2025 19:09:28 GMT

Subscribe now

Every AI application founder thinks web scraping is the cheapest way to get context. They’re wrong. It’s the most expensive infrastructure choice you can make but usually it is the only one.

The math seems compelling: why pay for content when you can scrape it for free? Build a few parsers, spin up some proxies, and you have access to the entire web. Your cost is just server time and a couple of engineers maintaining the code. Simple, right?

Except it’s not simple. And it’s definitely not cheap.

Here’s what actually happens: Your scrapers burn money through token bloat, create compounding engineering debt, expose you to existential legal risk, degrade answer quality, and kill user trust. All while you think you’re saving money.

The hidden costs are so high that AI companies relying on scraping are operating on borrowed time. The question isn’t whether they’ll realize it’s expensive—it’s whether they’ll figure it out before their competitors do.

The Token Economics Nobody Talks About

Let’s start with the most immediate, measurable cost: tokens.

When you scrape content, you’re not just paying for inefficient extraction—you’re paying for massive overconsumption. You ingest entire articles when you only need specific facts, definitions, or relevant excerpts.

The real comparison isn’t scraped vs. structured content. It’s what you ingest vs. what you actually need.

The Overconsumption Problem

Consider what actually happens in AI applications:

Scenario 1: Answering a factual question

- User asks: “What is the capital of France?” (assume that the LLM does not have the answer in the training data)

- What you need: A simple fact (~10-20 tokens: “Paris is the capital of France”)

- What you ingest with scraping: Full article about Paris (~1,200 tokens)

- Waste: 98% of tokens are unnecessary

Scenario 2: Getting a definition

- User asks: “What is retrieval-augmented generation?”

- What you need: A concise definition (~50-100 tokens)

- What you ingest with scraping**: Full technical article (~1,500 tokens)

- Waste: 93% of tokens are unnecessary

Scenario 3: Multi-source research synthesis

- User asks: “Compare the economic policies of three different countries”

- What you need: Relevant excerpts from 10 sources (~100-200 tokens each = 1,000-2,000 tokens total)

- What you ingest with scraping: 10 full articles (~1,200 tokens each on avg = 12,000 tokens)

- Waste: 83% of tokens are unnecessary

Modern scraping tools like trafilatura and newspaper3k extract main content from HTML[^1]—but they still give you the entire article. A typical news article contains 600-900 words (~800-1,200 tokens).[^2]

The problem isn’t extraction quality—it’s that you’re ingesting 10-100x more content than you need.

The Real Token Economics

100,000 conversations per month scenario:

- 2 content retrievals per conversation (VERY conservative)

- Model: Claude Sonnet 4.5 ($3 input per million tokens)

Current approach (scraping full articles):

- Average per retrieval: 1,200 tokens

- Monthly: 100K × 2 × 1,200 = 240M tokens = $720/month

What you actually need (relevant excerpts/facts):

- Average per retrieval: 150-200 tokens (targeted information)

- Monthly: 100K × 2 × 175 = 35M tokens = $105/month

Real waste: $615/month or $7,380/year

That’s not 15-30% overhead, it’s 85% waste.

At 1 million conversations per month, you’re burning $73,800 per year on unnecessary tokens—ingesting content you never needed in the first place.

And this is just input tokens. Output tokens cost more (3-5x input pricing), and when models process massive irrelevant context, they generate longer, less precise outputs—compounding the waste further.

Technical Performance: How Scraping Kills Quality

Token costs are just the beginning. Web scraping degrades AI performance in measurable, research-proven ways, from context engineering to answer accuracy to user experience.

The Signal-to-Noise Problem

News articles aren’t designed for AI consumption, they’re designed for human browsing and ad monetization.

A typical HTML page includes:

- Site-wide navigation, headers, menus

- Display ads, newsletter forms, trending widgets

- The actual article (finally)

- Comment sections (often unmoderated, low-quality)

- Related articles, site maps, legal links

- Cookie notices, subscription prompts

The actual content represents a small fraction of total HTML tokens. Tools like Boilerpipe exist specifically to “detect and remove surplus clutter” because web pages contain so much boilerplate that extraction is a non-trivial engineering problem.[^34]

The Lost in the Middle Problem

Context engineering has emerged as critical for modern AI applications.[^5] The core insight: context is a finite resource with diminishing marginal returns.[^6]

Stanford research (Liu et al., 2023) demonstrated that language model performance is highest when relevant information occurs at the beginning or end of input context, and significantly degrades when models must access information in the middle.[^8] Recent research shows that context length alone hurts LLM performance even with perfect retrieval.[^9]

Now consider scraped HTML structure:

- Top: Navigation, headers, site-wide elements

- Middle: Actual article content (what you need)

- Bottom: Comments, ads, footer

You’re placing the signal exactly where the model performs worst. Structured content can place relevant information at the beginning, where models excel. Scraping locks you into a structure optimized for humans, not AI.

RAG Performance Degradation

Multiple research studies document that retrieval noise and redundancy degrade output quality in RAG systems.[^35][^36]

1. RAG systems suffer when encountering noisy or irrelevant documents

2. Misalignment between retrieved evidence and generated text leads to hallucinations

3. As context passages increase, “noise” also increases

4. Reader performance may plateau or degrade—sometimes beyond no-context performance

Web scraping introduces systematic noise—navigation, ads, comments, boilerplate—that no model architecture can fully compensate for. Scraped HTML often contains all three corruption types within a single page: the article is relevant, the sidebar is irrelevant, and comments may contain counterfactual claims.

Hallucination Correlation

Research on hallucinations identified types:[^37]

Fabricated (43%)
Negations (30%)
Contextual (17%)
Causality-related (10%)

Recent research demonstrates a direct, measurable relationship between context length with low signal-to-noise ratio and hallucination rates.

The hallucination rate increases with context length, reaching approximately 45% when context approaches 2,000 tokens.[^48] This isn’t theoretical—it’s a measured phenomenon across multiple studies.

Research on RAG systems reveals that models get “distracted” by irrelevant content in documents, particularly in long documents where the answer isn’t obvious.[^50] When retrieval granularity is too large, retrieved blocks contain excessive irrelevant content, increasing the cognitive burden on models and causing answers to deviate from the query.[^51]

Research using mechanistic interpretability (ReDeEP, 2024) revealed the internal mechanism: hallucinations occur when Knowledge FFNs in LLMs overemphasize parametric knowledge while Copying Heads fail to effectively retain or integrate external knowledge from retrieved content.[^53]

The research is unambiguous: noisy context doesn’t just fail to help—it actively makes hallucinations more likely.

Web scraping systematically introduces the exact conditions research identifies as causing hallucinations: long contexts, irrelevant content mixed with signal, poor information positioning, and high noise-to-signal ratios.

The Multi-Source Dilemma

Research shows that complete information about a query is rarely found in a single source.[^39] Natural answers require aggregating information from multiple sources.

This creates a painful dilemma:

Single-source or limited scraping:

Lower costs and legal risk
But: Incomplete answers, lower quality

Multi-source scraping (50+ publishers):

Better quality and diversity
But: Exponential costs, multiplied risks

It’s a catch-22: you need diversity for quality, but diversity multiplies cost and risk. And even with successful multi-source retrieval, multi-source synthesis remains challenging.[^40]

The Multi-Turn Conversation Penalty

Context bloat compounds throughout multi-turn conversations. Modern AI applications maintain conversation history, user preferences, retrieved content, and system instructions.

When scraped content consumes 3-4x more tokens than necessary, you’re forced into painful trade-offs:

Drop older conversation turns (losing continuity)
Reduce content sources (sacrificing quality)
Compress context through additional LLM calls (adding latency and cost)
Sacrifice personalization (making responses less relevant)

The Latency Tax

Context window size has a linear relationship with time to first token.[^11] AWS research shows that models experience substantial slowdowns when processing contexts exceeding 100,000 tokens.[^13]

Users notice lag. DAU/MAU metrics directly correlate with response quality and speed. Apps with DAU/MAU over 50% are world-class; most average 10%.[^15]

Every second of added latency pushes your stickiness ratio down. Since acquiring new users costs 5-7x more than retaining existing ones,[^16] latency isn’t just a UX problem—it’s a profitability problem.

Engineering Debt: The Cost That Never Stops

When founders evaluate web scraping, they estimate the initial build—maybe 2-4 weeks for scrapers. Then they move on.

What they miss: the initial build is only 30-40% of total engineering cost. The other 60-70% is maintenance—and it never stops.

The True Development Cost

With AI coding assistants providing 55% faster development,[^17] you have three options:

Option 1: Build in-house with AI copilots

Initial development: $60,000-$80,000 over 2-3 months

Requirements and architecture with AI-generated boilerplate ($20K)
Core development for 20-30 publishers ($40K)
Testing, retry logic, error handling

Hidden assumption: this only works for 20-30 publishers. Scale to 50+ and complexity explodes non-linearly, pushing costs toward $120-150K.

Option 2: Third-party services (Tavily, Firecrawl)

Services promise to eliminate development costs entirely.[^18] But they create new problems:

Loss of control: Can’t customize extraction or optimize for your needs
Quality unpredictability: At mercy of their extraction quality
Costs scale: $0.005-$0.008/page compounds quickly
Legal liability still yours: You’re still responsible for how content is used

For 100K conversations/month (200K pages):

Firecrawl: ~$1,000-$1,600/month ($12-19K/year)
Tavily: ~$1,600/month ($19K/year)

Plus engineering time to integrate, monitor, and handle failures.

Option 3: Hybrid approach (most common)

Use third-party for MVP, then build custom as you scale. This means:

Third-party costs while testing
Development costs when you need custom solutions
Double the complexity, double the maintenance

The AI copilot productivity boost is real—but it doesn’t eliminate the fundamental problem: web scraping infrastructure is inherently brittle.

The Maintenance Nightmare

Websites change constantly. Publishers redesign, update HTML, add anti-bot measures, change URL patterns. Every change breaks your scrapers.

Industry practitioners report that engineering teams spend 20-30% of their time maintaining existing scrapers.[^19] For a 5-person team, that’s one full-time engineer just keeping the lights on.

Update frequency is relentless:[^20]

1-3 websites out of every 30 require updates each month
With strong anti-bot solutions, updates needed monthly or more
Each incident requires 2-3 developer days to fix

The opportunity cost is staggering. Those hours could build features that differentiate your product. Instead, they’re spent reverse-engineering HTML changes and bypassing CAPTCHAs.

The Non-Linear Scaling Problem

Scraping 5 publishers? Manageable.

For early-stage prototypes serving fewer than 10,000 queries per month from 3-5 publishers, scraping can be a pragmatic short-term choice. The overhead is contained, legal exposure is minimal, and token costs remain low.

But this changes fast.

Scraping 50 publishers? Completely different problem.

Each publisher has:

Different HTML structure requiring custom parsing logic
Different anti-scraping measures (Cloudflare, CAPTCHAs, JavaScript challenges)
Different URL patterns and content organization
Different update frequencies
Different legal terms and enforcement

You can’t template this. Every publisher is a bespoke engineering problem.

Real-world example: building a custom solution for a difficult site required several weeks of developer time—thousands of dollars that easily outweighed third-party service fees.[^21]

Multiply that by 50+ publishers, each with their own quirks, each updating on their own schedule. The maintenance burden scales exponentially.

Infrastructure Hidden Costs

Proxy Services:[^22]

Premium residential/mobile proxies: $99+/month
Pricing by bandwidth: $6.60/GB and up
High-volume scraping: $500-2,000+/month

CAPTCHA Solving:[^23]

2Captcha: ~$1.16 per 1,000 CAPTCHAs
Millions of retrievals per month: thousands in CAPTCHA costs

Storage and Processing:

Scraped HTML storage
Extraction pipelines
Content databases
Data freshness management

Typical monthly infrastructure: $200-$1,000+ depending on scale.

Business Risk: Legal Exposure and Trust Crisis

Legal risk doesn’t show up on your monthly cost report until it destroys your company. And in 2025, how you handle data matters just as much as what your product does.

The Lawsuit Tsunami

The legal landscape has shifted dramatically. What was once gray area is now a minefield.

The New York Times vs. OpenAI: The NYT has spent $10.8 million in legal bills fighting this case—and it’s not over. The judge ordered OpenAI to turn over 20 million ChatGPT conversation logs.[^24][^25][^26]

News Corp vs. Perplexity AI: Sued for “willfully copied copious amounts of copyrighted material.” Perplexity proudly marketed “skip the links”—directly threatening publisher business models. TollBit revealed Perplexity’s scrape-to-referral ratio: 369 scrapes for every 1 referral.[^26][^27]

Canadian Publishers vs. OpenAI: Multiple outlets sued for copyright infringement, circumvention of protective measures, breach of terms, and unjust enrichment.[^28]

The pattern is clear: publishers are aggressively defending their content across multiple jurisdictions.

Terms of Service Violations

Even if copyright law remains ambiguous, scraping violates explicit Terms of Service—creating clear breach of contract liability.

Publisher TOS prohibitions:[^29]

Ryanair: Prohibits automated data extraction
Meta: Prohibits collection via automated technology
LinkedIn: Prohibits scraping of member profiles
X Corp: Prohibits scraping in browsewrap and clickwrap agreements

According to 404 Media, 28% of “most actively maintained, critical sources” have restricted AI scraping in the last year.[^30] Researchers call this an “emerging crisis.”

The Transparency Problem

Scraping practices are becoming publicly measurable and embarrassingly transparent.

TollBit’s 2024 report exposed scrape-to-referral ratios:[^32]

OpenAI: 179:1
Perplexity: 369:1
Anthropic: 8,692:1

These numbers are cited in lawsuits, reported in press, and discussed in publisher board rooms. When your business model relies on scraping 369 times while sending back 1 referral, you’re extracting value until publishers shut you down.

Reputational Damage

Perplexity AI:[^43]

News Corp lawsuit, Forbes plagiarism accusation
Scrape-to-referral ratio of 369:1 became public embarrassment

Meta:[^44]

Leaked documents showed scraping while ignoring robots.txt

OpenAI:[^45]

NYT lawsuit costing $10.8M+, Indian copyright suit
High-profile legal battles creating negative brand association

These aren’t obscure technical disputes—they’re front-page news. For every company like OpenAI with resources to weather the storm, dozens of smaller AI applications would be destroyed by similar controversies.

Consumer Trust

According to Cisco’s 2024 Consumer Privacy Survey:[^41]

75% of consumers won’t buy from companies they don’t trust with their data

The same research found:

Consumers who trust providers spent 50% more on connected devices
51% of “Privacy Actives” have switched companies due to data privacy concerns
49% of consumers aged 25-34 have switched over data policies

The mechanism: How you handle others’ data signals how you’ll handle users’ data.

When users discover your AI is built on unlicensed scraping—violating publisher terms and potentially copyright law—they infer you’ll be equally cavalier with their personal data.

Enterprise Procurement

For B2B AI applications, data sourcing practices are explicit RFP requirements.

From enterprise AI licensing guidelines:[^42]

“All content released through AI services must be: Originally created by the publisher, appropriately licensed from third-party rights holders, used as permitted by rights holders, or used as otherwise permitted by law.”

The critical clause: “Customer’s sole responsibility to ensure appropriate rights to all content input to AI service“

Translation: if your AI uses unlicensed scraped content and gets your enterprise customer sued, that’s on you—and you won’t get the contract.

If you can’t prove your content is licensed, you can’t win enterprise deals.

The Investor Due Diligence Problem

Web scraping isn’t just a legal risk—it’s a deal risk.

When AI companies go through fundraising, M&A, or IPO processes, investor due diligence assesses:

Violation of computer usage laws
Consumer privacy compliance
Material Non-Public Information handling
IP liability exposure

Section 204A of the Investment Advisers Act requires written policies to prevent MNPI misuse.[^31] For venture-backed companies, web scraping exposure can be a deal blocker.

M&A transactions with companies using unlicensed scraping must carefully allocate liability. Acquirers don’t want to inherit your legal time bomb.

Data Ethics as Competitive Differentiation

The market is responding. Leading AI companies are pivoting from scraping to licensing.

Major AI content licensing deals (2024):[^46]

OpenAI + News Corp: 5-year deal worth over $250M
OpenAI + Dotdash Meredith: Worth at least $16M
OpenAI + Axel Springer: $25M one-off payment plus variable fees

PwC’s 2024 Trust Survey found that 67% of customers prioritize hearing how companies protect data—but fewer executives (32%, down from 42%) are actually disclosing privacy policies.[^47]

That creates opportunity: AI companies that transparently demonstrate ethical data sourcing can differentiate on trust, not just model performance.

“Ethical data sourcing” isn’t compliance theater—it’s a competitive moat. It unlocks enterprise sales, improves retention, facilitates fundraising, and builds sustainable publisher relationships.

The Total Cost: What Scraping Really Costs

Let’s calculate the true cost for a realistic AI application.

Assumptions:

100,000 conversations per month
2 content retrievals per conversation
50 target publishers for coverage
Claude Sonnet 4.5 ($3 input, $15 output per million tokens)

Token Costs (Annual)

Scraped (full articles): 240M tokens/month = $8,640/year
Targeted content (what you actually need): 35M tokens/month = $1,260/year
Real token waste: $7,380/year (85% waste from overconsumption)

Engineering Costs (Annual)

In-house development:

Initial build (Year 1): $70,000
Ongoing maintenance: $120,000 (25% of 3-person team)

Third-party services:

Service costs: $15,000/year
Integration/monitoring: $24,000 (15% engineer time)
Total: $39,000/year
But: Loss of control, quality unpredictability, legal liability still yours

Infrastructure Costs (Annual)

Proxy services: $12,000
CAPTCHA solving: $2,784
Storage/processing: $6,000
Total: $20,784/year

Legal Risk Costs (Annual)

Cease-and-desist responses: $30,000-$75,000
Investor due diligence: $25,000-$50,000
Conservative estimate: $55,000 (excluding lawsuits)

Opportunity Costs (Annual)

Engineering time NOT spent on:

Product features driving engagement
Model optimization
UX improvements
New capabilities

Estimated: $100,000 in lost product value

Quality Degradation Costs (Annual)

User churn from poor quality: $250,000 in lost LTV (5% higher churn at 100K MAU, $50 LTV)
Lower DAU/MAU from latency/accuracy: $100,000 in reduced growth
Total: $350,000/year

TOTAL HIDDEN COST

In-house (Year 1): $723,380 In-house (Ongoing): $653,380/year

Third-party (Year 1): $501,380 Third-party (Ongoing): $501,380/year

The Dilemma:

Third-party services appear cheaper, but you sacrifice control and quality. Most teams start with third-party, hit limitations, and build custom anyway—paying for both in transition.

At 1 million conversations/month (10x scale):

Token waste alone: ~$74K/year
In-house: ~$6.5M/year
Third-party: ~$5M/year

Neither option is sustainable. Both carry legal risk. Both degrade quality. Both force painful trade-offs.

This is what “free” content actually costs.

The Impossible Choice

AI application builders face an impossible dilemma.

Option 1: Keep scraping

Token bloat burning money
25% of engineering time on maintenance
Legal exposure accumulating
Answer quality degrading
User trust eroding
Can’t pass enterprise RFPs
Cost: $500K-$6.5M+/year depending on scale

Option 2: Blanket licensing with major publishers

OpenAI paying $250M over 5 years to News Corp
Only works with OpenAI’s budget and leverage
Doesn’t cover long tail of publishers
Per-publisher negotiations = massive overhead
Cost: Millions upfront and ongoing

Option 3: Reduce content coverage

Limit to fewer publishers to reduce burden
Accept lower quality and completeness
Users get worse experience than competitors
Cost: Lost market share

Option 4: Build direct publisher relationships

Negotiate individual licensing deals
Requires legal team and business development
Each publisher wants different terms
Doesn’t scale beyond 10-20 publishers
Cost: Prohibitive for startups

Every option has fatal flaws.

Scraping is expensive, risky, and degrades quality. Blanket licensing is only viable for giants. Reducing coverage kills competitiveness. Direct relationships don’t scale.

The current content sourcing model for AI applications is fundamentally broken.

The Strategic Window

The market is at an inflection point.

Publishers have realized AI companies are an existential threat to their traffic and revenue. They’re fighting back with lawsuits, technical countermeasures, and public pressure. 28% of critical sources have already blocked AI scraping—and that percentage is growing monthly.

Anti-bot technology is improving rapidly. Cloudflare Bot Management, PerimeterX, and similar services make scraping exponentially more difficult and expensive. The arms race favors defenders, not scrapers.

Legal precedents are being established. The NYT lawsuit, News Corp lawsuit, and Canadian publisher actions are creating case law that will make unlicensed scraping increasingly untenable.

Consumer and enterprise awareness of data ethics is rising. Users care how you source content. Enterprises require proof of licensing in RFPs. Investors scrutinize data practices in due diligence.

Market leaders like OpenAI are pivoting from scraping to licensing—demonstrating that even companies with effectively unlimited budgets recognize scraping is unsustainable.

What worked in 2023 doesn’t work in 2025.

The AI application founders who recognize web scraping’s true cost now—before the legal bills arrive, before the engineering debt becomes unmanageable, before user trust collapses—will have the opportunity to build on different infrastructure.

The most sophisticated AI teams are exploring fundamentally different approaches, not incremental fixes to scrapers, but infrastructure built for the realities of the Agentic Web where AI agents become primary internet users and content economics shift from advertising-driven traffic to value-based access.

Those who continue believing scraping is “cheap” will learn the truth the hard way: through $10M legal bills, broken scrapers consuming 30% of engineering time, user churn from poor answer quality, and lost enterprise deals because they can’t prove content licensing.

The Question

The current content sourcing model is broken.

Scraping appears free but costs millions. It burns tokens, wastes engineering time, creates legal exposure, degrades answer quality, and destroys user trust.

Most founders don’t realize the true cost—until they do the math.

The question isn’t whether AI applications need to find a better way to source content.

The question is whether you’ll figure it out before your competitors do.

The content sourcing infrastructure for AI applications is fundamentally broken. The companies that recognize this now—before million-dollar legal bills, before engineering debt consumes 30% of team capacity, before quality degradation kills user trust—will build on different foundations.

That infrastructure is being built. The question is whether you’ll adopt it while you have strategic choice, or be forced to migrate when scraping costs become undeniable.

The strategic window is open. For now.

References

[^1]: Trafilatura and newspaper3k are popular Python libraries for content extraction from web pages. Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/evaluation.html

[^2]: Griffin Mott Consulting: How Many Words Are Usually In An Article? (2025). https://griffinmottconsulting.com/blog/ideal-article-length/

[^3]: Trafilatura can extract metadata, main body text and comments. Documentation: https://pypi.org/project/trafilatura/0.5.0/

[^4]: OpenAI Developer Community: Markdown is 15% more token efficient than JSON. https://community.openai.com/t/markdown-is-15-more-token-efficient-than-json/841742

[^5]: Anthropic: Effective Context Engineering for AI Agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

[^6]: LlamaIndex: Context Engineering - What it is, and techniques to consider. https://www.llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider

[^7]: Chroma Research: Context Rot - How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot

[^8]: Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173. https://arxiv.org/abs/2307.03172

[^9]: Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (2025). https://arxiv.org/html/2510.05381v1

[^10]: Why Does the Effective Context Length of LLMs Fall Short? (2024). https://arxiv.org/html/2410.18745v1

[^11]: Glean: How input token count impacts the latency of AI chat tools. https://www.glean.com/blog/glean-input-token-llm-latency

[^12]: Understanding Latency, Throughput, and Context Length in LLM Hosting. https://www.databasemart.com/blog/llm-hosting-latency-throughput-context-length

[^13]: AWS: Optimizing AI responsiveness - Amazon Bedrock latency-optimized inference. https://aws.amazon.com/blogs/machine-learning/optimizing-ai-responsiveness-a-practical-guide-to-amazon-bedrock-latency-optimized-inference/

[^14]: LongICLBench: Long-context LLMs Struggle with Long In-context Learning (2024). https://arxiv.org/html/2404.02060v3

[^15]: CleverTap: DAU vs. MAU - App Stickiness Metrics Explained. https://clevertap.com/blog/dau-vs-mau-app-stickiness-metrics/

[^16]: Gainsight: Essential Guide to DAU/MAU Ratio. https://www.gainsight.com/essential-guide/product-management-metrics/dau-mau/

[^17]: GitHub Copilot users complete tasks 55.8% faster than control groups. Source: arXiv - The Impact of AI on Developer Productivity. https://arxiv.org/abs/2302.06590

[^18]: Tavily pricing: $30-$100+/month. Firecrawl pricing: $16-$333+/month. Sources: https://docs.tavily.com/documentation/api-credits and https://www.firecrawl.dev/pricing

[^19]: Software Engineer average salary 2025-2026: $112-129K base, ~$160K fully loaded with benefits. Source: https://www.coursera.org/articles/software-engineer-salary

[^20]: The True Costs of a Web Scraping Project.

[^20]: How Much Does Web Scraping Cost - The Ultimate Guide. https://webautomation.io/blog/how-much-does-web-scraping-cost-the-ultimate-guide/

[^21]: How Much Does Web Scraping Cost. https://www.zenrows.com/blog/web-scraping-cost

[^22]: Best CAPTCHA Proxies in 2025. https://www.zenrows.com/blog/captcha-proxies

[^23]: How Does Proxies Help CAPTCHA Bypass. https://www.octoparse.com/blog/use-proxies-to-bypass-captcha

[^24]: Lewis Silkin: NYT v OpenAI - Publishing Sector’s AI Content-Scraping Conundrum. https://www.lewissilkin.com/insights/2024/01/19/nyt-v-openai-the-publishing-sectors-ai-content-scraping-conundrum

[^25]: Judge Orders OpenAI to Hand Over 20 Million ChatGPT Logs in NYT Copyright Clash. https://www.analyticsinsight.net/news/judge-orders-openai-to-hand-over-20-million-chatgpt-logs-in-nyt-copyright-clash

[^26]: The Hollywood Reporter: NYT Has Spent $10.8M In Legal Battle With OpenAI. https://www.hollywoodreporter.com/business/business-news/new-york-times-legal-battle-openai-1236127637/

[^26]: The Register: Major publishers sue Perplexity AI for scraping content. https://www.theregister.com/2024/10/22/publishers_sue_perplexity_ai/

[^27]: TechCrunch: News outlets accusing Perplexity of plagiarism and unethical web scraping. https://techcrunch.com/2024/07/02/news-outlets-are-accusing-perplexity-of-plagiarism-and-unethical-web-scraping/

[^28]: American Bar Association: OpenAI Sued for Data Scraping in Canada. https://www.americanbar.org/groups/business_law/resources/business-law-today/2025-february/openai-sued-data-scraping-canada/

[^29]: TermsFeed: Terms & Conditions to Stop Screen Scraping. https://www.termsfeed.com/blog/terms-conditions-stop-screen-scraping/

[^30]: 404 Media: The Backlash Against AI Scraping Is Real and Measurable. https://www.404media.co/the-backlash-against-ai-scraping-is-real-and-measurable/

[^31]: Akin Gump: Legal Implications of Web Scraping for Investment Firms. https://www.akingump.com/a/web/soxXRQ6Nw48FehNvwpdjJ1/2jiuhx/hflr-reprint-to-scrape-or-not-to-scrape-rappaport-altman-handschumacher-4819-0662-7801-v1.pdf

[^32]: PYMNTS: Web Scraping Wars - How Businesses Are Fighting AI Data Harvesting. https://www.pymnts.com/artificial-intelligence-2/2024/web-scraping-wars-how-businesses-are-fighting-ai-data-harvesting

[^33]: DropSite News: LEAKED - Top Websites Meta Is Scraping for AI.

[^34]: Stack Overflow: Algorithm for reading actual content of news articles. https://stackoverflow.com/questions/1451894/algorithm-for-reading-the-actual-content-of-news-articles-and-ignoring-noise-o

[^35]: Long Context RAG Performance of Large Language Models (2024). https://arxiv.org/html/2411.03538v1

[^36]: arXiv: Retrieval-Augmented Generation - A Comprehensive Survey. https://arxiv.org/html/2506.00054v1

[^37]: arXiv: A Survey on Hallucination in Large Language Models. https://arxiv.org/abs/2311.05232

[^38]: arXiv: The Dawn After the Dark - Empirical Study on Factuality Hallucination. https://arxiv.org/html/2401.03205v1

[^39]: arXiv: MSRS - Evaluating Multi-Source Retrieval-Augmented Generation. https://arxiv.org/html/2508.20867

[^40]: arXiv: Towards Multi-Source RAG via Synergizing Reasoning and Preference-Driven Retrieval. https://arxiv.org/html/2411.00689v1

[^41]: Cisco Newsroom: How safe is our data? Consumers want to know. https://newsroom.cisco.com/c/r/newsroom/en/us/a/y2024/m10/how-safe-is-our-data-consumers-want-to-know.html

[^42]: Arphie: What is RFP legal requirements? https://www.arphie.ai/glossary/rfp-legal-requirements

[^43]: TechCrunch: News outlets accusing Perplexity of plagiarism and unethical web scraping. https://techcrunch.com/2024/07/02/news-outlets-are-accusing-perplexity-of-plagiarism-and-unethical-web-scraping/

[^44]: DropSite News: LEAKED - Top Websites Meta Is Scraping for AI.

[^45]: Lewis Silkin: NYT v OpenAI - Publishing Sector’s AI Content-Scraping Conundrum. https://www.lewissilkin.com/insights/2024/01/19/nyt-v-openai-the-publishing-sectors-ai-content-scraping-conundrum

[^46]: Digiday: 2024 in review - Timeline of major deals between publishers and AI companies. https://digiday.com/media/2024-in-review-a-timeline-of-the-major-deals-between-publishers-and-ai-companies/

[^47]: PwC: 2024 Trust Survey - How to earn customer trust. https://www.pwc.com/us/en/library/trust-in-business-survey/customer-trust-in-your-sector.html

[^48]: K2View: RAG hallucination - What is it and how to avoid it. https://www.k2view.com/blog/rag-hallucination/

[^49]: Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173. https://arxiv.org/abs/2307.03172

[^50]: TechCrunch: Why RAG won’t solve generative AI’s hallucination problem. https://techcrunch.com/2024/05/04/why-rag-wont-solve-generative-ais-hallucination-problem/

[^51]: arXiv: A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems - Progress, Gaps, and Future Directions (2025). https://arxiv.org/html/2507.18910v1

[^52]: MDPI Mathematics: Hallucination Mitigation for Retrieval-Augmented Large Language Models - A Review (March 2025). https://www.mdpi.com/2227-7390/13/5/856

[^53]: arXiv: ReDeEP - Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability (2024). ICLR 2025. https://arxiv.org/abs/2410.11414

[^54]: arXiv: Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought (February 2024). https://arxiv.org/html/2402.04004v2

[^55]: ACL Anthology: RAG-HAT - A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation. EMNLP 2024. https://aclanthology.org/2024.emnlp-industry.113/

The Billion Dollar Question: What Happens to Publishers When Clicks Disappear?

Ioannis Bakagiannis — Mon, 01 Dec 2025 17:22:37 GMT

Subscribe now

Every publisher executive has seen the charts. Traffic down 25%. Down 40%. Down 90%. The numbers keep getting worse, and the explanations keep getting vaguer. “Algorithm changes.” “Market headwinds.” “Strategic pivots.”

Let’s be direct: AI is cannibalizing publisher traffic at an accelerating rate, and no one has figured out how to replace the revenue.

Between May 2024 and May 2025, zero-click searches surged from 56% to 69%—a 13 percentage point jump in just 12 months. When Google’s AI Overviews appear, zero-click rates hit 80-83%. The link economy that sustained digital publishing for 25 years isn’t declining—it’s collapsing.

This article examines the data no one wants to talk about: the exact magnitude of traffic losses by content vertical, the correlation between AI trust and cannibalization velocity, and most importantly, the math that shows none of the current monetization alternatives can replace what’s being lost. Then we’ll explore the open question: what would a real solution actually look like?

Part 1: The Cannibalization Pattern

The Numbers Are Worse Than Reported

The headline figure—publishers losing 25-27% of traffic year-over-year—masks catastrophic variation by content type. When you break down the data by vertical, a disturbing pattern emerges: the content categories publishers thought were “safe” are getting hit hardest.

Educational Content: The Canary in the Coal Mine

Chegg lost 49% of its non-subscriber traffic between January 2024 and January 2025. Total revenues dropped 24% in Q4 2024. The company laid off 45% of its workforce—388 people—and sued Google over AI Overviews “stealing” their traffic.

Why educational content? Because AI answers “how do I solve this calculus problem?” or “explain photosynthesis in simple terms” perfectly. Students don’t need to click through to Chegg when ChatGPT gives them the answer in 3 seconds. The product—instant educational content—is identical, but the user never lands on the publisher’s site.

Recipe Content: Death by Convenience

Food and recipe publishers saw traffic declines of 30-50% in 2024. Epicurious.com traffic dropped 37% year-over-year by December 2024. Thanksgiving recipe searches—traditionally a traffic bonanza—were down and some reported losses of 40% year-over-year.

The format is the problem. Recipe content is perfectly structured for AI summarization: ingredients list, numbered steps, cook time, serving size. AI Overviews can display the entire recipe without requiring a single click. When click-through rates drop 34-89% for queries with AI summaries, recipe publishers lose both traffic and the ad revenue that depended on it.

Travel Content: An Extinction-Level Event

Travel and tourism sites saw 20% year-over-year traffic declines, but the aggregate number obscures individual catastrophes. The Planet D, a travel blog, shut down after losing 90% of its traffic to AI Overviews. Individual travel bloggers report 40% traffic drops and 34% ad income losses year-over-year.

“Best hotels in Barcelona.” “3-day Rome itinerary.” “What to pack for Iceland in winter.” These are exactly the queries AI excels at answering—and exactly the high-commercial-intent queries that drove affiliate revenue for travel publishers.

News: Relatively Resilient (For Now)

News publishers saw more moderate declines: median traffic down 7% for news brands versus 14% for non-news brands. Major publishers averaged 10% losses from Google Search.

Why is news more resilient? Breaking news requires real-time, attributed sources. Google’s E-E-A-T (Experience, Expertise, Authoritativeness, Trust) guidelines still prioritize established news brands for current events. Legal liability concerns limit how definitively AI can present news without attribution.

But this resilience is temporary. As AI attribution systems improve and users grow comfortable consuming news through AI interfaces, news publishers will face the same cannibalization that’s already devastated educational, recipe, and travel content.

Technology Content: The Cautionary Tale

Stack Overflow saw a 75% drop in new questions from its 2017 peak. Year-over-year (December 2024 vs. 2023), questions declined 60%. Since ChatGPT launched in November 2022, usage has dropped 76%.

Stack Overflow’s collapse reveals something more fundamental than traffic loss: the entire interaction paradigm changed. Developers didn’t just stop clicking through to Stack Overflow—they stopped needing Stack Overflow. The question-and-answer model became obsolete when AI could generate, debug, and explain code in real-time.

This isn’t about Google stealing traffic. It’s about agentic systems replacing the underlying use case.

Publishers fixate on Google because they built on rented land—they never owned distribution. But even if they had, the business would still face existential transformation. The platform shift happening now is category-agnostic.

Stack Overflow’s fate isn’t a cautionary tale about SEO strategy. It’s a preview of what happens when AI doesn’t just answer questions—it performs tasks that eliminate the need for informational content entirely.

Publishers who think “we just need better Google rankings” are solving for 2015. The question isn’t how to reclaim traffic from AI Overviews. It’s how to provide value in a world where users don’t need to visit publisher sites at all because agentic systems handle the entire workflow.

Adapt to the paradigm shift, or face Stack Overflow’s destiny—not because Google took your traffic, but because your entire content category became functionally obsolete.

The Trust Paradox: Why “Safe” Verticals Are Most Vulnerable

Here’s the uncomfortable truth everyone’s missing: The content categories where users trust AI answers most are getting cannibalized fastest.

KPMG’s 2024 trust survey found 56% of consumers trust AI for educational resources—the highest of any application. Recipe and how-to content have low trust barriers because the stakes are low (”what’s the worst that can happen if the recipe is slightly off?”). Developers trust AI for code because they can test it immediately.

Now look at the cannibalization rates:

Educational content (56% trust): Chegg down 49%
Recipe content (low-stakes trust): Down 30-50%
Developer content (high trust): Stack Overflow questions down 75%
Medical content (low trust): Minimal reported traffic loss

The relationship isn’t “more AI trust = more AI usage.” It’s “more AI trust = fewer clicks to publishers.”

When users trust AI answers, they don’t need to verify by clicking through to the source. The zero-click search becomes the terminal action. When AI Overviews appear, 80% of searches end without a click. For high-trust verticals, that number is likely even higher.

This creates a devastating implication for publishers: Medical, legal, and financial content—currently “protected” by low AI trust and regulatory concerns—will face accelerating cannibalization the moment user trust rises. 75% of workers say accurate data is critical to AI trust. As AI accuracy improves in these high-stakes verticals, the trust barrier will fall, and traffic will collapse.

The conventional wisdom says “build trust to survive AI.” The data suggests the opposite: Publishers in high-trust verticals should fear AI adoption more, because trust drives zero-click behavior.

The Inflection Point: What Happens at 80-90% Zero-Click?

Zero-click searches jumped from 56% to 69% in 12 months—a 23% relative increase. If this trajectory continues:

2026: 75% zero-click
2027: 80% zero-click
2028: 85% zero-click

At 80-90% zero-click rates, the fundamental assumption underlying digital publishing—that content creation leads to traffic, which leads to monetization—breaks completely.

AI Overview appearance rates have doubled from January 2025 (6.49%) to June 2025 (13.14%). Longer queries (8+ words) trigger AI summaries much more frequently. As AI coverage expands from simple queries (”capital of France”) to complex queries (”compare fixed-rate vs. adjustable-rate mortgages for first-time homebuyers in 2025”), zero-click rates will accelerate.

What does this mean practically?

If you’re a publisher doing 10 million monthly visits today, an 80% zero-click rate means 8 million of those visits disappear. If your revenue model depends on $3 CPMs across 7 ad impressions per visit, you’ve just lost:

8,000,000 visits × 7 impressions × $0.003 = $168,000 per month = $2.02 million per year

That’s not a “headwind.” That’s an existential threat.

Part 2: Why None of the Alternatives Work at Scale

The consulting decks all say the same thing: “diversify revenue streams.” Subscriptions. Commerce content. Licensing deals. Events. Podcasts. Consulting services.

Let’s do the math and see if any of these can actually replace what’s being lost.

Advertising: The Structural Decline

Display and programmatic advertising remains the largest revenue source for most publishers, but it’s in structural decline from multiple directions.

The Numbers:

U.S. programmatic advertising: $168 billion (2024)
Publishers’ share of global ad investment: 27.2% (2024) versus 71% a decade ago
Over 10 years: -43.8 percentage points

That’s not just traffic loss from AI. That’s ad dollars flowing to walled gardens (Google, Facebook, Amazon), ad blocking (costing publishers $54 billion in 2024), and social referral collapse (Facebook referrals down 50-58% over 6 years).

The trajectory is unmistakable: Advertisers follow attention. As users spend more time in AI interfaces (ChatGPT, Perplexity, Google AI Overviews) and less time on publisher sites, ad budgets will follow. The $168 billion programmatic market is growing, but publishers’ share is shrinking.

Subscriptions: Limited Addressable Market

Subscriptions are the darling of every publisher strategy deck. 80% of publishers cite digital subscriptions as their most important revenue stream, up from 74% in 2020.

The growth story:

Median user subscriptions are up 3x since 2019
Median churn rates: <5%
Newsletter platforms growing: Substack has 5M+ paid subscribers; beehiiv sent 15.6 billion emails in 2024

The problem:

Tripling subscriptions from a tiny base still leaves you with a tiny base. The median news brand that tripled its subscribers since 2019 now serves thousands or tens of thousands of paying readers—compared to millions of ad-supported readers they used to monetize.

The math:

Advertising supported a model where publishers earned $3-6 CPMs across millions of free readers. Subscriptions earn $10-20/month from thousands of paying readers. The total addressable market for paid content is a small fraction of the ad-supported audience.

Elite brands (New York Times, Wall Street Journal, The Information) can make subscriptions work. For the other 99% of publishers, subscriptions are supplemental revenue, not a replacement for advertising.

The AI acceleration problem:

Subscriptions depend on top-of-funnel traffic to build awareness and drive conversions. As AI Overviews answer user queries without clicks, fewer users discover publishers organically. The traffic needed to feed subscription funnels is disappearing.

AI Data Licensing: Theater vs. Business Model

This is where the headlines get confusing. News Corp signs a $250 million deal with OpenAI. Financial Times and Reuters ink licensing agreements. Surely this is the solution?

The market:

The context:

Global revenue lost to ad blocking alone: $54 billion (2024)
U.S. programmatic advertising: $168 billion (2024)

Do the math:

The entire 2024 AI licensing market ($816.7 million) is 1.5% of what publishers lost to ad blocking in the same year. Even if licensing reaches $11.16 billion by 2030, that’s 6.6% of current programmatic spend—shared across thousands of publishers globally.

News Corp’s $250 million deal sounds impressive until you realize News Corp generates $10+ billion in annual revenue. That licensing deal is 2.5% of annual revenue—and it’s one of the largest deals ever signed.

The distribution problem:

The “average” deal of $24 million is massively skewed by mega-publisher outliers. OpenAI offers smaller publishers $1-5 million per year for archive access. Mid-tier publishers get far less. Long-tail publishers get nothing.

The structure problem:

Most licensing deals pay for training data—historical archives that AI companies use to train their models. This is a one-time value proposition. The ongoing value—when AI uses publisher content to generate billions of answers per month—goes largely uncompensated.

Perplexity launched a “Publishing Program” in July 2024 offering revenue share based on citations, but adoption is limited and the model unproven at scale. Haven’t heard about this initiative in a while now.

The uncomfortable truth: AI licensing deals are PR wins, not business model solutions.

Commerce Content & Affiliate Revenue: Killed by the Thing They Pivoted To

In the mid-2010s, many publishers pivoted to commerce content and affiliate revenue to reduce ad dependence. 87% of publishers now use commerce content as a revenue contributor. For some, it became a top-3 revenue source.

The success stories:

The problem:

Commerce content—product reviews, buying guides, “best of” lists, how-to articles—is exactly the content type AI cannibalizes most effectively.

Affiliate revenue share dropped significantly between 2023-2024. Publishers report revenue drops up to 50% following Google AI Overviews rollout in May 2024.

Why? Because AI Overviews can summarize product recommendations, synthesize reviews from multiple sources, and provide buying guidance—all without users clicking through to publisher sites. No click = no affiliate commission.

The tragic irony: Publishers pivoted to commerce content to escape advertising dependence. Commerce content became the most vulnerable category to AI cannibalization. Publishers optimized for the exact queries AI handles best.

The Other Options: Niche, Non-Scalable, or Both

Podcasts: U.S. podcast advertising hit $2.43 billion in 2024, growing 12% year-over-year. CPM rates are stable around $21-22 for 60-second spots. Great! Except $2.43 billion is 1.4% of the $168 billion programmatic market. Podcasts work for publishers with strong audio offerings, but can’t replace advertising losses at scale.

Events: The global events industry reached $1,505.53 billion in 2024, growing at 11.8% CAGR. But most of this goes to venues, production, and B2B conference companies. Publishers capture a sliver—and only elite publishers with brand equity and infrastructure (WSJ conferences, TechCrunch Disrupt) can monetize meaningfully. For mid-tier publishers losing millions in ad revenue, events can’t move the needle.

Consulting, White-Label Partnerships, Syndication: These are service businesses that don’t scale. The Independent’s wine club might be “a real money spinner,” but it’s not replacing tens of millions in lost advertising revenue. These are distractions from the core structural problem.

The Revenue Gap That No One Wants to Acknowledge

Let’s put all the numbers in one place:

What’s Being Lost:

Publishers’ share of ad investment decline (10-year): -43.8 percentage points
Traffic decline from zero-click search: -25-90% depending on vertical

What’s Being Gained:

AI licensing (total market): $816.7 million (2024)
Podcast advertising (total market): $2.43 billion (2024)
Subscription growth: 3x digital subs since 2019 (but from tiny base)

If you’re a Chief Revenue Officer or Chief Business Officer staring at 30-50% traffic declines and board questions about AI strategy, none of these alternatives—subscriptions, licensing, commerce, podcasts—close the gap at the scale and speed your P&L requires. The revenue replacement math doesn’t work.

Even if AI licensing reaches $11.16 billion by 2030 and subscriptions double from current levels, publishers face a multi-billion-dollar structural revenue gap with no clear path to close it.

This isn’t a transition. It’s not a rough patch. It’s a fundamental breakdown of the economic model that sustained digital publishing for 25 years.

Part 3: The Open Question—What Would a Real Solution Look Like?

The current approaches—bilateral licensing deals, subscription paywalls, commerce pivots—aren’t closing the revenue gap. They’re rearranging deck chairs on the Titanic.

So what’s missing? What would an AI monetization solution that actually works at scale need to do?

The emerging consensus among publishers, AI researchers, and marketplace architects points to infrastructure that doesn’t exist yet—but that follows clear requirements based on market dynamics and publisher economics.

1. Pay Per Answer, Not Per Archive: The Inference-Time Shift

Most AI licensing deals pay for training data—access to historical archives that AI companies use to train their models. This is backward-looking and one-time.

The real value is inference-time—when AI uses publisher content to generate an answer for a user. This happens billions of times per day across all AI platforms. ChatGPT alone handles an estimated 2.5 billion queries per day (industry estimates). Perplexity, Google AI Overviews, Claude, Gemini—each generating billions of inference events.

If even a fraction of these inferences use publisher content, and if publishers could capture $0.001 per use, the math changes dramatically:

2.5 billion inferences/day × 30% attribution rate = 750 Million paid uses/day
750 Million uses x 365 × $0.001 = $274 million/year (single platform - low estimate)
Scale across all AI platforms: $1B+ annual market

The AI market is still around 2% of the total search volume and expected to rise significantly - if not completely take over - in the next 5 to 10 years.

The problem: Attribution systems that can track which content influenced which AI output nor the marketplaces that will help the value capture don’t exist today.

What’s needed:

Real-time tracking of content usage in AI responses
Automated micropayments triggered at inference time and provenance verifiability
Systems that work across all AI platforms, not bilateral deals

2. From Winner-Take-Most to Long-Tail Economics

Current AI licensing is a winner-take-most market. News Corp gets $250 million. New York Times, Financial Times, Wall Street Journal secure deals. Everyone else gets crumbs or nothing.

The average deal is $24 million, but that’s skewed by mega-publisher outliers. OpenAI offers smaller publishers $1-5 million per year—if they even get a call back.

Long-tail publishers—the 10,000+ sites producing quality content in niche verticals—are completely shut out. Yet collectively, they produce enormous value for AI systems. A specialized cycling publication’s gear reviews inform AI answers about bikes. A regional news site’s local reporting shows up in AI summaries. They get nothing.

What’s needed:

Marketplace or platform model (not bilateral negotiations)
Low barriers to entry (automated licensing, simple onboarding)
Micropayment infrastructure that enables compensation for small publishers
Collective bargaining power through network effects

Think of how programmatic advertising scaled: Small sites could earn ad revenue through exchanges and SSPs without negotiating directly with every advertiser. AI licensing needs the same infrastructure.

3. Provide Granular Publisher Control

Right now, publishers have two options: allow AI crawling (via robots.txt) or block crawling - assuming that crawlers respect that which is not the case. There is no other way for AI applications to reach publisher content.

Publishers want:

Transparent access: Clear and legitimate way to share their insights and data.
IP Protection: The data shared to be used ONLY for inference purposes
Value Based Dynamic pricing: Charge based on the value given back to the end user of the AI application.
Usage analytics: See which content is being used, how often, and by whom.

What’s needed:

Content licensing APIs or AI native integrations with granular controls
Standardized access and licensing terms
Dynamic pricing mechanisms like auctions
Real-time dashboards showing usage and revenue

4. Flip the Power Dynamic: When AI Platforms Need You

The current power dynamic is broken. AI companies scrape freely from the public web. They only pay when threatened by lawsuits (New York Times vs. OpenAI) or when they want premium brand partnerships.

Publishers are in a weak negotiating position because AI companies can:

Scrape public content for free
Train models on it without permission
Generate answers without attribution
Capture all the economic value

What’s needed to flip this dynamic:

Incentive alignment: The easiest way to adoption is through incentives. Would it make sense for a business to start paying for a “product” (content) that already gets for free? There have to be common incentives at play to do so.
Network effects: Once enough publishers join a marketplace, AI platforms must participate or risk inferior answer quality.

What you should not wait for because your company will be dead by then:

Regulatory/legal pressure: Copyright litigation and potential legislation making unlicensed use illegal or expensive.

5. Standardize Access & Market Dynamics

Bilateral deals between individual AI companies and publishers won’t scale because they reinforce power asymmetry. Without standards, the market fragments into proprietary systems. OpenAI builds one licensing system. Google builds another. Anthropic does something different. Publishers must integrate with each separately, multiplying complexity and reducing adoption. The AI industry needs what the advertising industry achieved with programmatic infrastructure: standard protocols, transparent pricing, and interoperability.

What’s needed:

Universal content access: AI platforms query a standardized access gateway to retrieve external information and trigger payments.
Trust mechanisms: Similar to how Google measures domain authority, AI content marketplaces need to verify the quality of the “product” exchanged, prevent fraud, content theft etc.
Open pricing exchanges: Instead of opaque bilateral negotiations, marketplaces where supply (publisher content) and demand (AI platform usage) set clearing prices through auctions or posted rates. Publishers see what comparable content earns. AI platforms compare pricing across sources.
Interoperable payment rails: Micropayment infrastructure that works across platforms—whether it’s blockchain-based, traditional fintech, or hybrid. Publishers shouldn’t need separate payment integrations for each AI company.

The programmatic advertising analogy is instructive: Before ad exchanges and SSPs standardized inventory access, publishers negotiated directly with every advertiser. Inefficient, non-transparent, limited to large players. OpenRTB and header bidding protocols changed everything. Small publishers gained access to global demand. Advertisers could reach niche audiences at scale.

What Would It Take to Actually Solve This?

If marketplace infrastructure is the answer, how do you evaluate whether a solution is real or vaporware? Here are the non-negotiable requirements:

Inference-time access & attribution: Does it track and compensate every usage?
Liquidity on both sides: Are AI platforms already using it, or is it theoretical demand?
Transparent pricing discovery: Can you see market rates, or are you negotiating blind?
Low integration friction: Does it take hours, days or months to go live?
Provenance & verification: Can you prove that the value exchange is auditable?

Any marketplace that can’t deliver on all five isn’t solving the structural problem—it’s another band-aid.

Why Doesn’t This Exist Yet?

If the solution is conceptually clear, why isn’t anyone building it? Short answer: It is SUPER hard.

Technical barriers:

Content value exchange mechanisms do not exist, not even theoretically
Monitoring billions of AI queries across hundreds of platforms requires massive scale

Economic barriers:

AI companies have no incentive to pay voluntarily (marginal cost of scraping = $0)
Revenue share reduces AI company margins
Chicken-and-egg: publishers won’t invest without guaranteed buyers; AI companies won’t pay without publisher participation

Market structure barriers:

Power asymmetry (Big 4 AI companies control 80% of market; thousands of publishers compete)
No unified publisher front for collective bargaining
Winner-take-most dynamics favor elite publishers

Coordination problems:

No industry standards
Free rider problem (individual publisher opt-out doesn’t stop AI)
Regulatory lag (copyright law unclear on AI training; lawsuits pending but slow)

The Provocative Questions

This is where conventional thought leadership would pivot to “our product solves this.” Instead, let’s be honest about what we don’t know.

Who will build the content value exchange infrastructure?

Will it be AI companies (unlikely—not in their interest)? A consortium of publishers (possible, but coordination is hard)? A third-party marketplace? A regulatory mandate that forces industry standardization?

Can publishers coordinate to demand fair compensation before it’s too late?

The window for collective action is narrowing. AI models are already trained on vast amounts of publisher content. The longer publishers wait, the weaker their negotiating position becomes. But publisher fragmentation—thousands of independent businesses with different strategies—makes coordination nearly impossible without external forcing function.

Is there a business model that aligns AI platforms and content creators?

The hardest question of them all. This will need its own article.

Maybe we’re watching the end of ad-supported digital publishing as we’ve known it. Maybe only elite publishers with subscription models and massive brand equity survive. Maybe long-tail and mid-tier publishers simply disappear, and the Agentic Web runs on a handful of mega-publishers and AI-generated content.

Where do we end up after all that?

Let’s return to the data:

Zero-click searches jumped from 56% to 69% in 12 months
Publishers have lost 25-90% of traffic depending on vertical
The entire AI licensing market ($816.7M) is 1.5% of publisher ad-blocking losses ($54B)
No alternative revenue model scales to replace advertising

The question is not “what happens to publishers when clicks disappear?”

The question is: “What do we build so publishers don’t disappear with the clicks?”

The infrastructure to capture inference-time value doesn’t exist at scale. The attribution systems are immature. The payment rails are nascent. The industry standards are absent. The regulatory framework is unclear.

But the need is urgent. Publisher traffic is collapsing right now. Revenues are declining right now. The “extinction-level event” some observers describe is not hypothetical—it’s happening.

The infrastructure to solve this doesn’t exist at scale yet. But the window to shape it is closing. Publishers who join early marketplaces now—during alpha and beta phases—will set pricing benchmarks, influence platform design, and capture premium positioning before the market commodifies.

In 24 months, marketplace access will be table stakes. The question is whether you’re setting the terms or accepting them.

What do you think? Are marketplace economics the answer, or is there another path forward? I’m deep in the weeds in this space and talking to publishers navigating these decisions.

If you’re a working through publisher and content creator AI monetization strategy, I’d love to hear your perspective. Reach out on LinkedIn or through the website.

Sources

The Agentic Web, Part 4: From Search Bars to Gateways

Ioannis Bakagiannis — Tue, 05 Aug 2025 16:10:53 GMT

A search bar used to be the gateway to the internet. You typed a query; you got ten links. Today, the dominant pattern is different: synthesized answers, suggested actions, and task flows. Google’s own guidance acknowledges the shift away from the classic “ten blue links” toward richer, multimodal results, and its AI Overviews formalize the answer-first pattern.

But the Agentic Web is much more than an AI Overview. Today, users increasingly express goals -“book a refundable flight under €300 next Friday,” “migrate my site to HTTPS without downtime,” “draft and file this paperwork” - and expect systems to execute. Thesis: the front door to the web is becoming an Agentic Gateway: the place where intent is captured, context is grounded, and actions are orchestrated across tools and services.

Previously in this series
Introduction to the Agentic Web: Vision — Why the web is shifting from pages to agents and what that enables.
The Agentic Web, Part 2: Anatomy of an Agent — What components a competent agent needs (memory, tools, planning).
Part 3: Evolution of Web Infrastructure — How infrastructure (APIs, auth, payments) unlocks end-to-end execution.

Definition: what is an Agentic Gateway?

An Agentic Gateway is the front door to an autonomous capability. It’s the layer a user or a system touches to state intent, and the layer that then interprets, plans, executes, verifies, and hands back outcomes, often by coordinating one or multiple large language models (LLM) with tools, data, and people.

Think of it as the mission control for an agentic workflow:

It translates ambiguous goals (“ship this feature by Friday,” “negotiate a better rate,” “compile an investment brief”) into machine-actionable plans.
It orchestrates models, tools, and services to execute those plans.
It manages context, personalization, and permissions so the agent does the right thing with the right information.
It reports progress, asks for clarification when needed, and knows when it’s done.

Agentic Gateways can be closed-world or open-world. The difference lies on how well defined is the agentic workflow and how clear is the definition of success. Closed-world gateways have clear(er) feedback loops even though they can still interact with the open world e.g. a coding agent. When we are talking about the web we are talking about open-world gateways.

The Gateway Analysis Framework (GATE)

To evaluate any gateway, we’ll use the GATE Framework—a four-part lens that maps exactly to the capabilities you need to get outcomes,

Framework G.A.T.E.
Grounding Context (G): Internally available context—what the gateway already knows.
Action Surface (A): Execution-connected functionality—tools and extensibility (plugins, APIs, automations).
Translation Layer (T): Web 2.0 compatibility—ability to render and interact with today’s sites (forms, cookies), and to fall back to browsing.
Engine Competence (E): Model attributes—reasoning, planning, multimodality, latency, model size, deployment infra and how well the gateway chains skills.

G) Grounding Context (Internal Context & Identity)

Definition: The user-specific state the gateway can access by default (profiles, preferences, organization policy, past tasks) and the authorization practices to go along with it.

Why it matters: Personalization. Without grounding, the agent guesses. With grounding, it minimizes clarification loops, respects constraints, and avoids wrong actions (e.g., booking with the wrong card).

Example: An OS copilot with account access knows your calendar, Wi-Fi networks, and installed apps.

Implication: Gateways with durable, privacy-aware memory produce faster, more accurate outcomes.

A) Action Surface (Execution Tools)

Definition: The set of functions the gateway can call: native tools, third-party APIs, and a mechanism to add new ones safely (scopes, rate limits, audit logs).

Why it matters: Outcomes require verbs—search, fill, sign, buy, deploy. A bare LLM without tools is a talker, not a doer.

Example: A “pay now” step via Stripe Checkout or Sessions API inside an agentic flow.

Implication: Without tools, you get summaries. With tools, you get completed tasks.

T) Translation Layer (Web 2.0 Compatibility)

Definition: The ability to interact with today’s web in a deeply respectful way. Mostly used to render pages, submit forms, generate content, respond to email etc.

Why it matters: Agentic APIs will lag. A pragmatic gateway must interact with legacy sites and forms while staying inside compliance boundaries.

Example: An agentic browser that can open a live product page for inspection and still extract facts or complete checkout.

Implication: Compatibility buys coverage; it keeps the agent useful before every site exposes an agent API.

E) Engine Attributes (Model & Orchestration)

Definition. The reasoning, planning, and multimodal capabilities that translate intent into plans, call tools, and verify outputs, under latency and cost constraints. Also entails where this engine lives.

Why it matters: Weak planning leads to looped prompts and partial results. Strong engines can decompose tasks, check their work, and recover from tool failure. Infrastructure requirements many times dictate the power of the engine.

Example: Gateways that cite sources, summarize uncertainty, and support tool-use tightly reduce error rates.
Implication: Confidence isn’t just model IQ; it’s the full reliability stack.

Putting it all together

Booking a complex multi-city trip requires stored traveler profiles & preferences (G), the ability to book hotel and air travel (A), take robust multi-step reasoning with fast responses (E) and validate that the offers and reservations are made in the vendor’s website (T).

User Agentic Gateways Evaluation

The “natural home” of the gateway could be the browser, but user behavior is shifting from passive browsing to conversational and task-centric flows. We currently have mapped four routes users are actually taking and we will evaluate them with GATE.

Apps (ChatGPT, Claude, etc.)

Dedicated AI apps are the default conversational gateways today.

G: Solid personal memory features are emerging, but portability and authentication for third party auth are cumbersome.
A: Mature tool ecosystems for search, code, data. Commerce is emerging as third party tooling.
T: Can search, scrape and summarize. No compatibility with traditional web.
E: State-of-the-art reasoning with competitive latency with probably the best orchestration.

Strength: Fast innovation and broad tool coverage.

Risk: Fragmentation of identity, memory/personalization and tooling for each app.

Integrated AI / Co-pilots (Chrome Extensions)

Co-pilots meet users where they work (docs, email, IDEs).

G: Access to local browsing history and tabs gives rich situational context, if permissions are well-scoped but limited to no memory and personalization.
A: Extension APIs execute actions inside the browser (form-fill, DOM click) and call external services; reliability depends on site stability. Also limited to the page or session at hand.
T: Excellent for Web 2.0 compatibility because they operate “where the user is,” but fragile when sites change layouts.
E: Model lives in an third party API call and have limited optionality in terms of orchestration due to the app size limitations.

Strength: Low friction; meets the user in-flow.

Risk: Live and die by the attached app/browser. A solution only for today.

Agentic Browsers

Browsers that ship an agent as a first-class feature attempt to merge gateway and renderer.

G: Bird’s eye view of all web aspects with built-in identity and workspace memory. Authentication solutions are the most mature.
A: Native headless modes, automation primitives, and deep extensions make them strong executors. But unfortunately rely a lot on existing search indexes and scraping, practices that will not be relevant in the future.
T: Best-in-class rendering and interaction fidelity by design.
E: Competitive models but a lot of memory requirements on device (mobiles).

Strength: Deepest integration with the legacy web and device.

Risk: Data privacy issues and incentive alignment with content providers they are scraping.

OS-Embedded AI (phones, computers)

The operating system can become the universal gateway across apps, files, and hardware.

G: Deepest personal context (files, emails, calendars, sensors) with system-level permissions.
A: Can orchestrate across apps (mail, calendar, messages) and invoke device capabilities.
T: Limited direct web manipulation (same as apps). But nothing stops developers from building on device functionality.
E: Private/local models increasingly capable; mixed cloud offload for heavy tasks. This is the biggest potential strength, but it’s currently a limitation.

Strength: Strongest personalization with privacy and local identity management.

Risk: Model performance. Reasoning is critical in an open-world system, and on-device models can lag the state of the art by a wide margin.

Agent-to-Agent Gateways

In an open-world setting like the internet, intents range from “compare 12 EV models” to “pay a customs duty” to “rebook my flight and preserve seat 14A.” It is exponentially hard—practically impossible—for one agent to include all context, contracts, and capabilities. Two forces make third-party agents desirable:

Specialized value beats generality: a bank’s agent knows card rules; a retailer’s agent sees inventory; a logistics agent owns carrier APIs.
Fair representation and efficiency: brands, businesses and publishers want to speak for themselves and at the same time gateways don’t want to re-research settled facts.

In order though for third-party agents to exist there have to be deeper and undeniable incentives that align with existential or monetary values. Generally we see three reasons for such agents to exist.

Monetize execution & context: Charge per call when the agent makes the gateway “better”. The capability or context adds concrete value. Example: a stripe payments agent processing checkout or a sports publisher agent provides the live score of a game.
Sell downstream: Recommend or fulfill products/services and earn margin. Example: BYD’s agent presenting trims and inventory or a retail network offering tailored recommendations from their partner stores.
Gain distribution: Use responses to route attention to a creator or brand. Example: Joe Rogan’s podcast agent offering an opinion about “who wins: tiger or gorilla”.

To connect them to a gateway, there could be two linking models.

Direct-to-Agent

The gateway calls a known external agent via an “agentic API” like MCP often with identity and permissions already established.
Why it matters: Low latency, predictable UX, clear accountability.
Example: A user’s default Payments Agent (e.g., Stripe) handles checkout inside the flow, with pre-authorized methods and receipts.

Agentic Marketplace

The gateway routes a request to a network to discover the best agent for the intent, then negotiates capabilities and terms.
Why it matters: Coverage and competition which are useful when the gateway doesn’t know “who” to call.
Example: the user queries “is the Tesla stock a Buy, a Hold or a Sell?”, then the agent requests from the network information about earnings calls, latest financials and expert opinions. The MorningStar agent and the Yahoo agent respond with context that helps the gateway to craft a well rounded response.

Call to Arms:
We are working on something exciting in this area. The hardest problems are incentive design, safety, and attribution. Let’s tackle them together. If you are passionate about it as we are, reach out!

Why should I care?

We haven’t found the definitive Agentic Gateway yet—but we now have a clear way to evaluate contenders with GATE. In the near term, Agentic Browsers are poised to win: they sit where users already act and bridge today’s Web 2.0 forms and flows. Over the longer horizon, OS-level solutions will most likely prevail by combining deep personal context, permissions, and cross-app execution.

What this means for you: if you’re a publisher, content creator, SaaS, e-commerce store or platform - or any of the other internet “actors” - you need to ship a third-party agent NOW. Make your capabilities callable (not just readable): expose verbs (quote, book, pay, modify) and content as AI context. Distribution is shifting from pages to agent calls, those without agents will be quietly routed around. Build an agent, get measured by outcomes, and stay visible through the transition.

Next and final perspective - Money makes the world go around

In the next installment, we’ll follow the money trail and examine the economic architectures that emerge when agents stop merely reading the web and start signing the checks—who pays whom, for what, and when.

The Battle of the Agentic Web Has Begun

Ioannis Bakagiannis — Fri, 04 Jul 2025 09:44:20 GMT

A Not-So-Quiet Flip of the Switch

On July 1st 2025 Cloudflare, gatekeeper to roughly a fifth of the internet, turned on crawler blocking for every new customer and rolled out a Pay Per Crawl program that lets publishers charge bots for every request. (theverge.com)
With one configuration change, the company challenged a core assumption of Web 2.0: that anyone—or anything—may scrape your content as long as it boosts traffic.

How Web 2.0 Incentives Broke

Blame Google

Let’s be honest—this all starts with Google. If Google had been transparently extractive from the beginning, maybe the flawed incentive design of Web 2.0 would’ve been exposed much sooner. The vicious cycle of:

Intent → Google Search
Search → Clicks
Clicks → On-page ads
Ads → Revenue

Kept everyone on the hamster spinning wheel. Welcome to the link economy. Google’s crawler infrastructure made it all work—indexing and ranking the world’s information, for free, so long as you played by its rules. The crawler was the cost of doing business, the ad auction the profit engine.

That engine required bots/crawlers that tirelessly roamed the web, harvesting and scoring content. That was acceptable in a world where humans clicked the links. But the AI companies broke that covenant. An agent that answers in its own window never hands the user back to the publisher, so no ad loads, no CPM, no paycheck.

Scraping Was Not Cool

For most of the past two decades, automated scraping sat just north of plagiarism and just south of denial-of-service on the moral spectrum. Google got a pass because its crawler sent readers back to us, and those eyeballs converted into ad revenue. That quid-pro-quo kept the link economy humming but also kept the need to accommodate crawlers.

Then came AI. Foundation-model builders vacuumed up massive corpora, and every new chatbot feature—summaries, opinions, breaking news—demanded fresh pages. Suddenly scraping wasn’t vandalism; it was business as usual.

Cloudflare Fires the First Shot

To Cloudflare’s credit, they see what’s going on—and, most importantly, they take action. Likely driven by pure capitalist motives, as Saanya Ojha pointed out, because the opportunity ahead is simply that great. The current scraping model is unsustainable. Publishers are losing traffic and revenue while AI companies profit from their data. Something has to change.

The Plan In Action

As of July 1 2025, all new customer sites are set to block known AI bots by default unless site owners explicitly allow them. This marks a transition from the old opt-out model to a new, permission-based framework where publishers can choose to allow, block, or monetize access for specific AI crawlers. To support this, Cloudflare introduced a “Pay‑Per‑Crawl” feature and then a marketplace that lets site owners charge a flat fee per crawl request, with payments facilitated directly through HTTP protocols.

Three Truths Were Spoken

Misaligned incentives are unsustainable. If bots keep draining value, creators will lock down their sites or vanish.
Permission is non-negotiable. A default block forces every AI company to declare itself and ask.
Publishers must exist in an agent-friendly web. Data holders deserve a business model that doesn’t require an ad stack.

But They Got Some Things Wrong

Crawling HTML is a crude way to feed an LLM. It drags along layout debris and forces publishers to run parallel CMS instances (web, feed, llms.txt, etc.).
Curating relationships between AI companies and every publisher in the world who might have data that would benefit the users of the AI, is not realistic. This would need to happen though if there is an API key for each partnership.
Offloading discovery to a marketplace without building search defeats the purpose. Matching the right datum to the right query is the hard (and expensive) part.

Also one thing was missed completely: Discovery economics. A marketplace is meaningless without matching. Google subsidised retrieval with ads. Who funds matching in a post-ad world? (talk to me if this area is interesting wink wink)

Why AI Still Needs the Open Internet

Why Does AI Need Open Internet Data?

Pre-training
What the model needs: rich language, stylistic nuance, broad domain knowledge
Typical sources: historical web, books, structured corpora

Fine-tuning
What the model needs: task-specific examples, up-to-date terminology
Typical sources: partner datasets, proprietary logs

Inference
What the model needs: fresh facts, time-sensitive signals, authoritative context
Typical sources: APIs, live feeds, plugins

Three main buckets of external data that power inference usage:

Perspective & opinion. Essays, forums, niche newsletters etc
Live feeds of reality. Prices, weather, sports scores, shipping schedules etc anything that is being created at live speed or it is time-sensitive.
Credibility signals. Citations, peer review, historical revisions.

Large-scale training demands bulk access; live inference needs low-latency access. Either way, the web remains the richest, messiest, most up-to-date dataset available. No private corpus can match its breadth.

Rethinking Internet’s Data Business Models

So what business models actually make sense? I argue that we have to separate these models based on the intended use of the data.

Training Data — the “Bottle of Wine” Scenario

Once your words flow into the model’s weights, they stay there, the way wine poured into a stew can never be poured back into the bottle. From my point of view there are three ways of licensing the data:

Metered Royalty
Charge a fee every time the AI uses knowledge traced back to your work.
- Appeal: Feels equitable. Pay me when you profit from me.
- Problem: Detecting those moments is like spotting a single grape in that stew. Even the vendor can’t do it currently, and they have every incentive to undercount.
Revocable Lease
Grant access now, pull it later if terms sour.
- Appeal: Keeps pressure on the AI company to behave.
- Problem: Impossible to “un-train” a model without wrecking it; the wine is already simmering. The threat is an illusion.
One-Time, Perpetual Licence
Sell the rights up front—no strings, no meter.
- Appeal: Zero tracking overhead, zero litigation about who owes what.
- Problem: You must be comfortable never clawing back control.

Best fit: One-Time, Perpetual Licence

Technical enforcement for options 1 and 2 simply doesn’t exist at production scale, and every extra audit hop slows the product you hope to monetize. Choosing the perpetual route is less about generosity and more about admitting physics: once the model swallows your data, policing bites is fantasy.

Inference Data — the “Rental Car” Scenario

Here, your content sits outside the model. The AI calls for it only when a user’s question needs it, much like renting a car for a day trip. Sounds easy to meter—until you spot the loopholes. The main one is that inference data often becomes training data after the fact. If the AI company logs the conversation or fine-tunes its orchestrator, your “real-time” data just becomes “forever” data. I am seeing three main options for monetizing such data:

Pay-Per-Call API
Tiny fee every time the model hits your endpoint.
- Upside: Straightforward invoice: X calls × Y cents. No data manipulation.
- Snag: You must trust the AI company’s logs, or fund a third-party auditor that slows everything down. And if those logs later feed training, you’re accidentally back in the “Bottle of Wine.” Also you have to maintain the API.
Pay-Per-Crawl (Cloudflare’s pitch)
Same as above but with a unified interface.
- Upside: Using one “connector” to the data instead of managing XXX APIs.
- Snag: Same as above plus you have to correctly route traffic to the “correct” spots.
Gate & Transform
License each retrieval and strip the payload to the bare minimum: summaries, embeddings, or partial snippets.
- Upside: Your core IP never lands whole on the AI company’s disk, making downstream training far less valuable.
- Snag: It does not exist. Hit me up if this is interesting to you.

Best fit: Pay-Per-Crawl (today) but Gate & Transform (tomorrow)
The best starting point would be to align with Cloudflare. Limit bot exposure and start getting back something for the consumed data. But if you truly don’t want your prose immortalized inside someone else’s model, you must control every retrieval and obscure the raw source. Anything less is a polite invitation for the vendor to turn today’s rental into tomorrow’s permanent fleet car.

Where Next?

For the Agentic Web to materialise, we still need serious infrastructure and protocol work—see my other articles for the deep dive. The bottom line is clear: AI agents must become first-class citizens of the new Internet, and that means fresh rules and new monetisation options.

I applaud Cloudflare’s move; it’s a strong first step, but there’s plenty left to do. These are exciting times. If this prospect excites you too—and you’d like to get involved as a collaborator, investor, or stakeholder with a monetization angle—drop me a line. I’d love to talk.

Agentic Web Part 3: Evolution of Web Use

Ioannis Bakagiannis — Mon, 30 Jun 2025 15:42:11 GMT

In Part 1, we defined the Agentic Web: a shift from static pages to outcome-driven interactions powered by AI agents.
In Part 2, we examined the anatomy of an AI agent and its web.

Here, we dive into how core web use is transforming. Browsing gives way to delegation. The web stops being a place to click and becomes a system that acts.

Web 2.0 Core Use Cases

Stay informed
– Manually visit news sites, RSS, newsletters
– Search for “what happened?”

Learn & research
– Keyword search → skim multiple sources → bookmark or copy–paste notes

Communicate & build community
– Email, chat apps, social media feeds, forums

Consume entertainment
– Stream video/music, play web games, scroll memes

Discover & buy
– Search + ad/social referrals → compare offers → fill checkout forms

Manage money
– Log in to online banking, trading dashboards, crypto wallets

Do work & create
– SaaS dashboards, cloud docs, CMS/blog editors

Book & coordinate services
– Flight portals, ride-hailing, food delivery, tele-health portals

Self-development & education
– MOOCs, language apps, digital training platforms

From One-Size-Fits-All to Adaptive Automation

In the early days of AI on the web, interaction was treated as a one-size-fits-all experience: enter a prompt, let the model run, accept the output.

But this oversimplifies reality. Human behaviour isn’t uniform—it's contextual, emotionally layered, and risk-sensitive. Users calibrate trust in AI systems based on the stakes, emotional significance, and potential consequences of each action.

Why Maslow Still Matters in the Agentic Web

To design automation that feels trustworthy, we must align it with Maslow’s Hierarchy of Needs:

Physiological Needs – food, shelter, basic goods
Safety Needs – health, financial stability, protection
Belonging & Love – relationships, community, connection
Esteem – status, achievement, personal value
Self-Actualization – growth, creativity, purpose

The further up the pyramid a task falls, the more emotional weight, irreversibility, and regulatory impact it tends to carry. Consequently, the more nuanced and collaborative automation must become.

Trust Calibration: Matching Automation to Human Psychology

Factor: Cost or Risk

Low-Stakes Tasks: $5 household item, news digest
High-Stakes Tasks: Designer goods, healthcare, legal matters

Factor: Emotional Weight

Low-Stakes Tasks: “Refill dog food”
High-Stakes Tasks: “Plan my wedding menu”

Factor: Reversibility

Low-Stakes Tasks: Easily undone (cancel, edit, re-order)
High-Stakes Tasks: Difficult to unwind (legal filings, medical decisions)

Factor: Regulation

Low-Stakes Tasks: Light or none
High-Stakes Tasks: Heavily regulated (finance, health, privacy, compliance)

Key Insight:

Basic-level tasks → full automation
Mid to upper-level tasks → consultative, agent-supported experiences

Autonomy Spectrum for Core Web Use Cases

As we examined, not every task on the web requires—or deserves—the same level of oversight. Some can be fully delegated to agents, while others demand active human involvement. The Autonomy Spectrum illustrates how common use cases divide across three modes of control: Agent-Led (full autonomy), Collaborative (partial autonomy), and User-Led (low autonomy).

Use Case: Stay Informed

Agent-Led: Daily news digest, sentiment alerts
Collaborative: Curated deep-dive
User-Led: Op-ed comparison

Use Case: Learn & Research

Agent-Led: Collect abstracts
Collaborative: Draft literature review
User-Led: Final thesis

Use Case: Communicate

Agent-Led: Auto-sort inbox
Collaborative: Suggest talking points
User-Led: Deliver bad news

Use Case: E-Commerce

Agent-Led: Restock consumables
Collaborative: Laptop shortlist
User-Led: One-of-a-kind art

Use Case: Finance

Agent-Led: Pay utilities
Collaborative: Portfolio rebalance
User-Led: High-risk investment

Use Case: Travel & Logistics

Agent-Led: Book commutes
Collaborative: Business trip planning
User-Led: Honeymoon

Use Case: Creative Work

Agent-Led: Resize images
Collaborative: First-pass ad copy
User-Led: Final brand voice

Use Case: Security & Compliance

Agent-Led: Patch vulnerabilities
Collaborative: Flag unusual logins
User-Led: Regulatory reports

Deep Dive: E-Commerce at Two Extremes

Household Staples (Toilet Paper)

Intent: “Buy the usual brand, cheapest price, deliver tomorrow.”
Agent Action:
- Checks price/coupons
- Verifies discounts
- Executes payment
User Involvement:
- Push notification: “Order placed: $11.20, arrives Tue.”
Why It Works: Low cost, reversible, no emotional weight.

Luxury Apparel (Designer Dress)

Intent: “Find a black cocktail dress, budget €800, deliver before July 10.”
Agent Action:
- Curates options with return policies
- Flags shipping estimates
User Involvement:
- Reviews shortlist
- Confirms preference and payment
Why Collaboration Matters: High cost, taste sensitivity, potential return hassle.

Behaviour Shift: From Browsing to Outcomes

We established that web usage is changing from the bottom up. The Agentic Web reframes the question from “Where should I click?” to “What should happen?”

Browsing, the core user behaviour of the current web, is shaken to the bone. Many businesses have built around this "random walk" pattern to influence users. But with outcome-driven agents, much of this activity diminishes. But not all of it.

Behaviours Likely to Fade

Fading Task: Typing search queries and clicking through 10 blue links

Why It Disappears: Agents gather, rank, and synthesize facts; users receive direct answers or auto-performed actions.

Fading Task: Hand-comparing prices and coupon codes

Why It Disappears: Agents benchmark and buy when target conditions are met.

Fading Task: Filling repetitive forms

Why It Disappears: Agents transmit verified identity and payment tokens via secure APIs.

Fading Task: Daily email triage

Why It Disappears: Agents auto-sort, draft replies, or resolve routine items.

Fading Task: SEO-driven “listicle” content farms

Why It Disappears: Thin content loses relevance as agents filter for decision-ready information.

Fading Task: Banner and pre-roll advertising

Why It Disappears: Agents filter non-value ads; commerce shifts to API-level offers and rev-share models.

Fading Task: Manual social cross-posting & scheduling

Why It Disappears: Agents generate, localize, A/B-test, and auto-publish across platforms.

Fading Task: One-size-fits-all learning modules

Why It Disappears: Adaptive tutors offer personalized flows.

Fading Task: First-level customer support chat trees

Why It Disappears: Domain-specific agents resolve routine queries; humans intervene only for edge cases.

Why We’ll Still Load the Site

Automation will handle low-stakes tasks, but humans will continue to access traditional websites in critical situations—where trust, regulation, or experience are key.

Reason: Trust and liability

When It Matters: Medical, legal, and financial content where users need to verify the author, credentials, and source authority.

Reason: Immersive shopping

When It Matters: Augmented reality demos, 3D product views, and virtual try-ons that enhance purchase confidence.

Reason: Community and story

When It Matters: Forums, comment sections, live events, and newsletters that foster social engagement and ongoing participation.

Reason: Complex interactivity

When It Matters: Configurators, dashboards, and simulation tools that require real-time input and responsive interfaces.

Reason: Identity and transactions

When It Matters: Secure checkouts, user account portals, and Know Your Customer (KYC) processes where manual review or confirmation is essential.

Reason: Emotional or milestone decisions

When It Matters: Life events like planning a wedding, choosing a school, or evaluating surgical options—situations that demand deep content, visual context, and deliberate comparison.

Rule of thumb:
If the user must feel, prove, or experience something—emotionally, legally, or interactively—they will still open the website or original source.

Next: Agentic Interfaces

We’ll shift from back-end rails to the touchpoints where humans and agents converge, tracing the emergent patterns that let software signal intent, share control, and fade elegantly into the background.

In short, Part 4 maps how interface design must evolve when autonomy, not clicks, becomes the primary mode of interaction.

Thanks for your interest in my thoughts. Now pass the knowledge on!

Learn more about the way ADS4GPTS is changing the monetization of the internet by aligning human and AI incentives

Visit ADS4GPTS

The Agentic Web Part 2: Anatomy of the Agentic Web

Ioannis Bakagiannis — Mon, 30 Jun 2025 15:35:19 GMT

Welcome to Part 2.
Imagine you open an invoice link and your screen fills with a 600-line JSON blob.
A procurement bot could parse, validate, and pay that bill in under a second.
You, meanwhile, are left hunting for the total and wondering if your browser just broke.

That tiny hiccup captures the core design problem of the Agentic Web: people and software agents want the same data for different reasons via radically different routes.

Web 2.0 Actors – Who’s Really Online?

Ask the average passer-by who “uses” the internet and they’ll picture a human in front of a screen. Reality is more crowded. In 2025, automated traffic—everything from search-engine spiders to health-check pings—now edges out human clicks, accounting for 51 % of total web requests.(securityweek.com) Much of that spike comes from AI crawlers such as those run by OpenAI and Anthropic, which vacuum up content to train their models.

Understanding the intent behind each request—not just its interface—matters, because two identical HTTP calls can serve wildly different purposes. Here’s the cast:

Humans (Users)

What They Care About:
Clear outcomes — trust, delight, task completion
Core Traits:
Sensory, context-rich, emotion-driven
Primary Constraints:
Cognitive load, accessibility, privacy expectations

Bots (Crawlers / Scrapers)

What They Care About:
A comprehensive, fresh link graph; maximum page fetches per crawl cycle
Core Traits:
Headless, pattern-matching, largely stateless
Primary Constraints:
robots.txt, CAPTCHA walls, IP blocks, bandwidth ceilings

Application System Processes

(APIs, webhooks, schedulers, service-mesh sidecars, infra probes)

What They Care About:
Reliable machine-to-machine orchestration — payments, health checks, ETL jobs, retries
Core Traits:
Deterministic, idempotent, authenticated
Primary Constraints:
Auth tokens/HMAC, exponential back-off, graceful degradation

Interface ≠ Intent. A single URL might be fetched by a smartphone browser, a price-monitoring bot, or a serverless cron job. The packet payloads look the same; the motives—and therefore the design constraints—do not.

Why Bots Exist

Automation wins whenever a task can be distilled into repeatable HTTP calls and the payoff per request outstrips the cost of spinning up compute. Cloud minutes are cheap, bandwidth is plentiful, and HTTP is permission-less by design. That math makes an automated probe or fetch “always worth a try.”

The Good Side: Essential Infrastructure

Indexing & Discovery. Search-engine crawlers trawl billions of pages so humans can find one relevant result in milliseconds—a feat no manual workforce could fund or finish.
Performance & Health. CDN nodes, uptime monitors, and service-mesh sidecars fire constant pings to keep modern apps reliable.
Market Efficiency. Price-comparison bots, accessibility readers, and research aggregators turn raw pages into actionable data streams.

These “good bots” underpin everything from Google Search to real-time stock quotes.

The Dark Side: Exploitation

The same cost asymmetry fuels a myriad of bad things:

Account Takeover

Bot Tactic: Credential-stuffing scripts hammer login endpoints
Impact: Mass breaches, identity theft

Scalping & Reselling

Bot Tactic: Millisecond-level checkout automation
Impact: Empty shelves, inflated resale prices

Content Theft

Bot Tactic: Full-site scrapers ignore robots.txt
Impact: SEO dilution, lost ad and subscription revenue

Ad Fraud

Bot Tactic: Headless browsers spoof impressions and clicks
Impact: Billions in wasted ad spend

Scale Works Against You

In 2024, automated traffic officially overtook human traffic for the first time:

🔺 Bots generated 51% of all web traffic, with 37% classified as “bad”
— Imperva, 2024

One Protocol, Divergent Motives

Two GETs to the same URL can arrive from a Chrome tab, a polite crawler, or a credential-stuffing botnet. The packet footprints match; the intentions diverge.

New Kid on the Block: AI Agents

Web 2.0 bots execute pre-baked scripts; Web 4.0 agents reason, plan, and adapt. They consume a goal (“book me the cheapest flight tomorrow morning”), fetch just-in-time context, decide on a sequence of calls, and then act—without a human hand on every click. Researchers have begun calling this emergent layer the “Agentic Web,” where autonomous software negotiates, purchases, and publishes alongside us.(linkedin.com, frontiersin.org)

The upgrade isn’t merely faster scripting; it’s a shift from automation (repeatable tasks) to autonomy (goal-directed workflows):

What They Care About:
Immediate, accurate outcomes with minimal latency — clean data, unambiguous instructions, and deterministic results
Core Traits:
Code-driven, data-hungry, task-oriented
Endowed with memory and planning capabilities
Primary Constraints:
Rate limits
Strong authentication
Observability
Predictable side effects

In short, the web’s newest participant isn’t just another bot; it’s a decision-maker. Understanding its incentives and guardrails is crucial, because when agents act, they do so at machine speed—but with stakes that feel distinctly human.

The Universal Agentic Workflow Framework

Autonomous agents outperform dumb bots because they follow a disciplined, human-modeled loop: they set a goal, plan, gather context, act, check their own work, and learn from the outcome. Below is the canonical seven-phase cycle every reliable agent must execute on the Agentic Web.

Here’s a Substack-native rewrite of your table — optimized for clarity, scanability, and formatting consistency across email and web readers. Each phase is presented as a clear, standalone section to maintain flow and authority:

1. Intent (Query)

What Happens: Extract the user’s goal and constraints. Clarify ambiguities (e.g., “Which Dr. Lewis?”).
Why It Matters: Clear intent prevents downstream rework and misalignment.
Actor: 🧑 Human (user prompt)

2. Reasoning

What Happens: Decompose the goal into ordered sub-tasks. Choose a compliant and efficient approach.
Why It Matters: Poor reasoning in regulated domains can lead to liability.
Actor: 🤖 Interface agent

3. Context Gathering

What Happens: Pull relevant data — personal preferences, credentials, policy limits, inventory. May coordinate with other agents.
Why It Matters: Even flawless logic fails on stale or incomplete data.
Actor: 🤖 Interface + external agents / data APIs

4. Execution (Tool Calls)

What Happens: Call APIs, complete forms, trigger RPA — all as atomic and reversible steps.
Why It Matters: This is where latency, rate limits, and edge cases become visible.
Actor: 🤖 Interface agent (local tools) + external services

5. Reflection

What Happens: Verify the outcome against the original intent. Compare before/after state. Log discrepancies.
Why It Matters: Catches silent failures and powers the learning loop.
Actor: 🤖 Interface agent

6. Human Audit

What Happens: Pause for review, approval, or override—especially in high-stakes scenarios.
Why It Matters: Satisfies ethical, legal, and emotional thresholds.
Actor: 🧑 Human

7. Iterative Feedback

What Happens: Store explicit 👍 / 👎 or learn from implicit corrections. Continuously update behavior.
Why It Matters: Turns one success into a pattern of accuracy gains.
Actor: 🧑 Human + 🤖 Interface agent

Who Does What?

Humans: supply the query, review high-risk actions, and provide feedback.
Interface Agent: The AI that directly communicates with the human. Plans, personalizes via memory, executes trusted tools, and self-reflects.
External Agents/Capabilities: enrich context (e.g., web search, partner APIs) and perform domain-specific tool calls the Interface agent can’t.

Two capabilities super-charge that middle layer:

Search – live, scoped retrieval of facts the agent doesn’t yet know.
Computer Use – browser automation for any site that lacks a public API, keeping agents as versatile as humans with a mouse.

Agentic Protocol Layer

Agents need more than raw HTTP. They rely on a thin but crucial protocol layer that standardises how context is loaded, tasks are handed off, work is orchestrated, and results are audited. Below is a snapshot of that layer, aligned to today’s live specs and vendor road-maps.

Context & Tool Access

Purpose: Expose data and executable functions to models
Key Specs:

Model Context Protocol (MCP) — "USB-C for AI context" (Anthropic)
Function-calling / Agent-invocation APIs (OpenAI, AWS, Google Cloud)

Agent-to-Agent Collaboration

Purpose: Structured task hand-off and negotiation between autonomous agents
Key Specs:

A2A (Agent-to-Agent) Protocol (Linux Foundation)
ACP (Agent Communication Protocol) (agentcommunicationprotocol.dev)

Workflow & Orchestration

Purpose: Chain function calls, manage state, retries, and branching logic
Key Specs:

LangGraph Patterns (langchain-ai.github.io)
Microsoft AutoGen multi-agent workflow engine

Discovery & Registry

Purpose: Publish and locate agent capabilities
Key Specs:

A2A “Agent Cards” endpoint
OpenAI Plugin Manifest & Function Registry (in development)

Control-Plane

Purpose: Enforce policy, authentication, rate limits, and capture telemetry
Key Specs:

mTLS for agent-to-agent trust
OpenTelemetry Gen-AI semantic conventions (opentelemetry.io)

Still Missing

While this stack reflects the current state of the field, there are key gaps — especially in:

Event-driven architectures
Pub/Sub messaging
Multimodal context streaming
Agent memory standardization
and more

The field is evolving fast, but many foundational elements remain fragmented or immature.

Why it matters:
HTTP moves the bytes; these specs move intent with accountability. Together they let any compliant agent discover tools, invoke them safely, delegate subtasks to other agents, and surface verifiable receipts—turning the open web into a programmable operating system rather than a patchwork of ad-hoc scrapers.

Agents != Bots

Bots were a necessary workaround, not our destiny. By treating agents as welcomed guests—complete with their own front door and house rules—we can ditch the cat-and-mouse games, shrink CAPTCHA fatigue, and make the internet faster, fairer, and more open for everyone.

The Fork in the Road: One Web or Two?

Over the next 24 months every product team will have to decide whether to extend the current, human-centred web or stand up a parallel rail optimised for autonomous agents. Both paths are open; neither is free of trade-offs.

Basically do AI agents try to act as humans and move a mouse and tap keyboards or swipe on mobile or they directly plug in the systems action controls?

Option A — Keep a Single Surface

Sites continue to serve the same URLs humans visit, but enrich them with machine-readable cues (JSON-LD, schema.org, micro-data) or a /.well-known/mcp endpoint so an agent can ask for agent+json while a browser still receives HTML.

Why teams like it

Zero new stack debt. You evolve, rather than rebuild, your web tier.
Universal reach. Browsers, crawlers, screen-readers, LLMs—everyone hits the same address.
SEO continuity. Backlinks and ranking signals keep working.

Why it strains over time

Heavy pages hurt agents. The median desktop page now ships ~2.6 MB of CSS, images and ads —bloat an agent must load, parse and pay to tokenize. Also hurts LLM performance. (almanac.httparchive.org)
Blurry security signals. Helpful booking agents and credential-stuffing botnets look identical in the logs.
Publisher revenue risk. If an LLM scrapes only to give answers, publishers lack any incentive to upload new content since their ad-based monetization model simply does not work.

Option B — Stand Up a Dedicated Agent Rail

Expose a slim, authenticated interface for everything around AI agents: Model Context Protocol (MCP) for data fetches, Agent-to-Agent (A2A) for secure task hand-offs. Agents identify themselves, negotiate rate-limits and receive terse JSON: no CSS, no CAPTCHAs, no token waste.

Why it’s compelling

Efficiency gains. JSON payloads are 20-50× lighter than full HTML, slicing latency, GPU time and carbon.
Built-in governance. Scoped OAuth 2.1 tokens, mTLS and execution receipts make abuse throttling explicit.
New business levers. Context and agentic capabilities become desired commodities that owners can create novel business models around.

Where it bites

Resource gap. Small publishers may lack time or talent to run and police a second interface.
Fragmentation risk. Without shared specs, we could repeat the browser-compatibility wars.
Discovery reset. Ranking Agents - not pages - demands fresh search paradigms and tooling.
Innovation parallels mobile APIs. Remember when JSON REST services unlocked the app economy? Agent rails promise similar uplift—atomic capabilities, composable in any interface.

Watch closely the ADS4GPTS and our other stealth project for innovations in this space.

Yet the “two-web” future is not automatic. Coordination will decide whether we get USB-C-level interoperability or a VHS-vs-Betamax rerun. Projects such as open-source MCP servers and Linux Foundation’s A2A spec give cause for optimism, but only if product teams treat them as baseline plumbing, not vendor lock-in.(gravitee.io, devblogs.microsoft.com)

Finally it will be interesting to see the decisions of incumbents in the internet, search and publishing space. Cloudflare’s CEO Matthew Prince is one of the first to openly talk about this and his stance is clear: this is the end of scraping. This means that Cloudflare will be betting and working on a dedicated rail for agentic data-hungry workflows (axios).

Agents as First-Class Citizens of the Web

Whether we choose a single mixed surface or a dedicated rail, one principle must survive the transition: autonomous agents deserve the same design respect as human users. An agent is not “just another bot.” It carries a person’s intent, wallet, and liability into the network. Ignoring that status—forcing agents to scrape, spoof headers, or dodge CAPTCHAs—doesn’t merely slow them down; it erodes the very trust we rely on when we delegate tasks to software.

Why “First-Class” Matters

Delegated Authority
When you ask an agent to “rebook my flight” or “move €10 000 to Treasury bills,” you’ve handed over legal and financial agency. The web must recognise that authority with explicit identity, scoped credentials, and auditable logs. Think of it as a human with power of attorney.
Predictable Contracts
Human-centred rate limits assume seconds between clicks; agents operate in milliseconds. Treating them as first-class citizens means publishing machine-negotiable SLAs and quota ceilings, so the system fails gracefully instead of rate-banning the user’s entire day.
Security Through Transparency
If an agent can declare who it represents and what capability it is invoking, orchestrators can block bad actors with surgical precision. No more collateral damage from blanket CAPTCHA gates or IP blacklists.
Economic Alignment
Publishers worry about lost ad impressions; users worry about token costs; providers worry about GPU bills. First-class treatment lets us meter, price, and share value explicitly, turning today’s friction into tomorrow’s business model.

The Strategic Upshot

For Developers: embracing first-class agents early means fewer brittle work-arounds, lower infra bills, and a cleaner architecture when regulations tighten.
For Publishers: authenticated agents offer a chance to charge for data instead of losing it to unmetered scraping.
For Users: reliable delegation frees them from micro-management, because their digital proxy enjoys the kind of predictable service quality humans already expect.

The web did this once before: browsers became first-class citizens when we moved from Telnet to HTTP 1.0. Repeating that leap for autonomous software will decide whether the Agentic Web becomes an open commons or a patchwork of paywalls, scrapers, and broken UX. Treat agents with parity now, and the ecosystem will repay us with speed, safety, and entirely new modes of value creation.

Next: Evolution of Web Use

We now turn from infrastructure to behaviour, tracking how familiar rituals like searching, shopping, learning, compress into terse prompts once agents shoulder the work. Part 3 sketches this shift from manual browsing to outcome-oriented delegation, revealing what disappears, what endures, and what entirely new habits emerge.

Thanks for your interest in my thoughts. Now pass the knowledge on!

Learn more about the way ADS4GPTS is changing the monetization of the internet by aligning human and AI incentives

Visit ADS4GPTS

Introduction to the Agentic Web: Vision and Definitions

Ioannis Bakagiannis — Mon, 30 Jun 2025 15:32:23 GMT

Imagine a World...

Consider this scenario: at 7:02 a.m., before you even silence your morning alarm, your personal AI assistant has quietly booked you a cheaper, lower-carbon flight, seamlessly adjusting your calendar to accommodate this change. Later that morning, as you prepare for a marathon scheduled for Sunday, you instruct your AI agent to procure the ideal pair of running shoes within your preferred price range. It swiftly evaluates dozens of retailers, assesses user reviews, checks inventory availability, and completes the transaction. All you experience is the assurance that the best possible outcome has been delivered effortlessly.

What Is the "Agentic Web"?

The Agentic Web represents the fourth generation of internet evolution, marking a profound shift from human-driven interactions to autonomous AI agents. Unlike previous iterations—Web 1.0’s static pages, Web 2.0’s interactive and social content, and Web 3.0’s focus on decentralized data—the Agentic Web is characterized by proactive, context-aware agents executing tasks on behalf of users.

In simple terms, as articulated by industry commentators, users "no longer interact directly with applications or APIs, but with intelligent agents acting as active, autonomous intermediaries" (dev.to). These agents are not merely passive tools; they possess the capability to perceive context, reason about goals, and autonomously execute tasks, effectively turning the internet from a passive information repository into a dynamic, collaborative ecosystem.

The Evolutionary Journey of the Web

To truly appreciate the revolutionary potential of the Agentic Web, let's briefly revisit the previous web generations:

Web 1.0 (1990s–early 2000s): Primarily read-only, characterized by static HTML pages and limited interactivity.
Web 2.0 (mid-2000s–2010s): User-generated content and social interactions, exemplified by platforms like Facebook, Wikipedia, and YouTube.
Web 3.0 (2010s–2020s): Emphasized decentralization, linked data, semantic content, and user data ownership via technologies such as blockchain.
Web 4.0 (2020s–): Autonomous AI agents become the primary actors, enabling users to simply declare intentions while agents proactively manage complex interactions, transcending manual tasks and navigation.

As succinctly summarized by one analyst: "If Web 1.0 was read-only, Web 2.0 let us interact and collaborate, and Web 3.0 focused on decentralization and connected data, Web 4.0 introduces autonomous agents capable of reasoning, acting, and collaborating" (dev.to).

Key Differentiators of the Agentic Web

What fundamentally distinguishes Web 4.0 from its predecessors is the shift from explicit, manual interactions to implicit, intent-driven experiences. Rather than users manually comparing flights or creating complex dashboards, AI agents autonomously navigate across multiple services, perform comparisons, and assemble customized results.

This reduces cognitive load, increases efficiency, and enables personalized, contextually relevant outcomes.

Further, personalization in Web 4.0 moves beyond limited recommendation algorithms, evolving into real-time, context-aware adaptability. Agents continuously learn and remember user preferences, past requests, and behaviours, collaborating dynamically among themselves to fulfill complex tasks in a manner completely tailored to each user’s immediate context (gate.com).

This represents a move away from one-size-fits-all interfaces to fully bespoke, agent-generated experiences.

Why Now? – Market and Technological Drivers

Several crucial factors are driving the timely emergence of the Agentic Web:

GPU Economics: The cost of GPU-based computation, essential for training and running sophisticated AI models, has dramatically fallen—approximately 70% year-over-year. This significant reduction makes the deployment of continuous, autonomous AI agents economically viable, allowing them to operate efficiently in the background without substantial costs.

AI Efficiency and Execution: Advances in machine learning, notably in large language models (LLMs), have significantly increased the efficiency, reliability, and effectiveness of AI agents. Today’s AI can manage complex multi-step tasks, communicate seamlessly with other agents, and maintain consistent, reliable performance.

The Two-Gear Internet: Agent vs. Human Speed

The introduction of autonomous agents will create a dual-speed internet:

Agent-to-Agent Interactions: Fast, efficient, continuous, and automatic communication between AI agents.
Human-to-Agent and Human-to-Human Interactions: Necessarily slower due to human processing limitations, but optimized by AI assistance to ensure maximum efficiency and effectiveness.

Interfaces and gateways capable of seamlessly bridging these speeds will be critical. The future of the internet thus includes dynamic, adaptive interfaces designed specifically to mediate various communication channels, optimizing interactions based on context and participants involved.

Desired Features of the Agentic Web

Accountability and Transparency:
AI agents must maintain clear audit trails of their decision-making processes, enabling human oversight and accountability. High-stakes decisions should require explainability, ensuring trust and compliance with emerging regulatory frameworks.
Security and Robustness:
Agents must operate within secure, sandboxed environments, utilizing zero-trust architectures and robust authentication protocols to mitigate risks from malicious actors or inadvertent misuse.
Privacy Protection:
Strong data protection measures, including on-device data processing, encryption, federated learning, and comprehensive user consent frameworks, should be integral to agent design, aligning with stringent data regulations.
Fairness and Ethical Compliance:
AI agents must actively mitigate biases and promote fairness, undergoing regular bias audits and adhering to clearly defined ethical guidelines and codes of conduct to ensure equitable outcomes.
Human Autonomy and Control:
Meaningful human oversight must remain central, particularly for critical decisions, preserving human agency and preventing dependency or deskilling.
Human-AI Alignment:
AI incentives, optimization and monetization procedures should align with human interests.
International Collaboration and Standardization:
Cross-border cooperation on regulatory frameworks, ethical standards, and technical interoperability is vital to avoid fragmentation and ensure coherent governance across the global digital ecosystem.

Where Next:

As we delve deeper in the subsequent parts of this series, we will examine the gateways, business models, and ethical considerations inherent in the development of the Agentic Web. This exploration will further illustrate the profound implications and opportunities presented by this next-generation internet ecosystem, fundamentally altering how we engage with digital technology.

Part 2 is the X-ray that shows which bones bend, which joints break, and where entirely new organs are forming.

Thanks for your interest in my thoughts. Now pass the knowledge on!

Learn more about the way ADS4GPTS is changing the monetization of the internet by aligning human and AI incentives

Visit ADS4GPTS