Cloudflare crawl endpoint

2026-03-1022:27493183developers.cloudflare.com

Browser Rendering's new /crawl endpoint lets you submit a starting URL and automatically discover, render, and return content from an entire website as HTML, Markdown, or structured JSON.

You can now crawl an entire website with a single API call using Browser Rendering's new /crawl endpoint, available in open beta. Submit a starting URL, and pages are automatically discovered, rendered in a headless browser, and returned in multiple formats, including HTML, Markdown, and structured JSON. This is great for training models, building RAG pipelines, and researching or monitoring content across a site.

Crawl jobs run asynchronously. You submit a URL, receive a job ID, and check back for results as pages are processed.

curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
"url": "https://blog.cloudflare.com/"
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
-H 'Authorization: Bearer <apiToken>'

Key features:

  • Multiple output formats - Return crawled content as HTML, Markdown, and structured JSON (powered by Workers AI)
  • Crawl scope controls - Configure crawl depth, page limits, and wildcard patterns to include or exclude specific URL paths
  • Automatic page discovery - Discovers URLs from sitemaps, page links, or both
  • Incremental crawling - Use modifiedSince and maxAge to skip pages that haven't changed or were recently fetched, saving time and cost on repeated crawls
  • Static mode - Set render: false to fetch static HTML without spinning up a browser, for faster crawling of static sites
  • Well-behaved bot - Honors robots.txt directives, including crawl-delay

Available on both the Workers Free and Paid plans.

To get started, refer to the crawl endpoint documentation. If you are setting up your own site to be crawled, review the robots.txt and sitemaps best practices.


Read the original article

Comments

  • By jasongill 2026-03-1023:089 reply

    I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?

    Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.

    • By cortesoft 2026-03-114:351 reply

      Well, the conversion process into the JSON representation is going to take CPU, and then you have to store the result, in essence doubling your cache footprint.

      Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

      Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.

      I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ‘second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.

      • By DeepSeaTortoise 2026-03-1112:32

        > Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

        Isn't this solving a slightly, but very significantly different problem?

        You could serve the very same data in two different ways: One to present to the users and one to hand over to scrapers. Of course, some sites would be too difficult or costly to transform into a common underlying cache format, but people who WANT their sides accessible to scrapers could easily help the process along a bit or serve their site in the necessary format in the first place.

        But the key is:

        A tool using a "pre-scraped" version of a site has very likely very different requirements of how a CDN caches this site. And this could be easily customizable by those using this endpoint.

        Want a free version? Ok, give us the list of all the sites you want, then come back in 10min and grab everything in one go, the data will be kept ready for 60s. Got an API token? 10 free near-real-time request for you and they'll recharge at a rate of 2 per hour. Want to play nice? Ask the CDN to have the requested content ready in 3 hours. Got deep pockets? Pay for just as many real-real-time requests as you need.

        What makes this so different is that unless customers are willing to hand over a lot of money, you dont need to cache anything to serve requests at all. Potentially not even later if you got enough capacity to serve the data for scheduled requests from the storage network directly.

        You just generate an immediate promise response to the request telling them to come back later. And depending on what you put into that promise, you've got quite a lot of control over the schedule yourself.

        - Got a "within 10min" request but your storage network has plenty if capacity in 30s? Just tell them to come back in 30s.

        - A customer is pushing new data into your network around 10am and many bots are interested in getting their hands on it as soon as possible, making requests for 10am to 10:05? Just bundle their requests.

        - Expected data still not around at 10:05? Unless the bots set an "immediate" flag (or whatever) indicating that they want whatever state the site is in right now, just reply with a second promise when they come back. And a third if necessary... and so on.

    • By selcuka 2026-03-110:242 reply

      Not the same thing, but they have something close (it's not on-by-default, yet) [1]:

      > Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.

      [1] https://blog.cloudflare.com/markdown-for-agents/

      • By jasongill 2026-03-1112:14

        Interesting - its sounds like this could be combined with some creative cache parsing on their side to provide this feature to sites that want it.

      • By Muromec 2026-03-127:11

        so... we will get the reader moder with one header set in a browser?

    • By michaelmior 2026-03-1023:502 reply

      > I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

      It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

      • By janalsncm 2026-03-110:273 reply

        How would they know the content hasn’t changed without hitting the website?

        • By coreq 2026-03-112:36

          They wouldn't, well there's Etag and alike but it still a round trip on level 7 to the origin. However the pattern generally is to say when the content is good to in the Response headers, and cache on that duration, for an example a bitcoin pricing aggregator might say good for 60 seconds (with disclaimers on page that this isn't market data), whilst My Little Town news might say that an article is good for an hour (to allow Updates) and the homepage is good for 5 minutes to allow breaking news article to not appear too far behind.

        • By cortesoft 2026-03-114:36

          Keeping track of when content changes is literally the primary function of a CDN.

        • By OptionOfT 2026-03-113:16

          Caching headers?

          (Which, on Akamai, are by default ignored!)

      • By binarymax 2026-03-110:11

        Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.

    • By csomar 2026-03-110:011 reply

      It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.

      • By randomtools 2026-03-118:46

        So does that mean it can replace serpapi or similar?

    • By cmsparks 2026-03-1023:33

      That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)

    • By hrmtst93837 2026-03-118:541 reply

      Offering wholesale cache dumps blows up every assumption about origin privacy and copyright. Suddenly you are one toggle away from someone else automatically harvesting and reselling your work with Cloudflare as the unwitting middle tier.

      You could try to gate this behind access controls but at that point you have reinvented a clunky bespoke CDN API that no site owner asked for, plus a fresh legal mess. Static file caches work because they only ever respond to the original request, not because they claim to own or index your content.

      It is a short path from "helpful pre-scraped JSON" to handing an entire site to an AI scraper-for-hire with zero friction. The incentives do not line up unless you think every domain on Cloudflare wants their content wholesale exported by default.

    • By brookst 2026-03-1111:42

      That was my first thought when I read the headline. It would make perfect sense, and would allow some websites to have best of both worlds: broadcasting content without being crushed by bots. (Not all sites want to broadcast, but many do).

    • By Fokamul 2026-03-1112:37

      But think about poor phishers and malware devs protected by Cloudflare.

    • By ryan14975 2026-03-121:16

      This makes a lot of sense. Cloudflare already has the rendered content at edge — serving a structured snapshot from cache would eliminate redundant crawling entirely.

      What I'd love to see is site owners being able to opt in and control the format. Something like a /cdn-cgi/structured endpoint that respects your robots.txt directives but gives crawlers clean markdown or JSON instead of making them parse raw HTML. The site owner wins (less bot traffic), the crawler wins (structured data), and Cloudflare wins (less load on origin).

  • By ljm 2026-03-1023:3512 reply

    Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too.

    And they can pull it off because of their reach over the internet with the free DNS.

      • By oefrha 2026-03-113:28

        That's not the perfect defense you think it is. Plenty of robots.txts[1] technically allow scraping their main content pages as long as your user-agent isn't explicitly disallowed, but in practice they're behind Cloudflare so they still throw up Cloudflare bot check if you actually attempt to crawl.

        And forget about crawling. If you have a less reputable IP (basically every IP in third world countries are less reputable, for instance), you can be CAPTCHA'ed to no end by Cloudflare even as a human user, on the default setting, so plenty of site owners with more reputable home/office IPs don't even know what they subject a subset of their users to.

        [1] E.g. https://www.wired.com/robots.txt to pick an example high up on HN front page.

    • By rendaw 2026-03-116:172 reply

      I think the simple explanation is that they weren't selling scraping countermeasures, they were selling web-based denial of service protection (which may be caused by scrapers).

      • By dewey 2026-03-1113:33

        This was always also sold as bot protection and anti-scraping / crawling features like https://www.cloudflare.com/lp/pg-ai-crawl-control/

      • By PeterStuer 2026-03-117:474 reply

        Ask yourself, why would a scraper ddos? Why would a ddos-protection vendor ddos?

        • By wongarsu 2026-03-1110:181 reply

          Because the scraper is either impatient, careless or indifferent; and if they scrape for training data they don't plan to come back. If they don't plan to come back they don't care if you tighten up crawling protections after they have moved on. In fact they are probably happy that they got their data and their competition won't

          • By wiether 2026-03-1111:38

            > they don't plan to come back

            To me the current behavior of those scrapers tells me that "they don't plan", period.

            Looks like they hired a bunch of excavators and are digging 2 meters deep on whole fields, looking for nuggets of gold, and pilling the dirt on a huge mountain.

            Once they realize the field was bereft of any gold but full of silver? Or that the gold was actually 2.5 meters deep?

            They have to go through everything again.

        • By junaru 2026-03-1110:481 reply

          > Ask yourself, why would a scraper ddos?

          Don't need to ask anything i can tell you exactly - because they have no regard for anything but their own profit.

          Let me give you an example of this mom and pop shop known as anthropic.

          You see they have this thing called claudebot and at least initially it scraped iterating through IP's.

          Now you have these things called shared hosting servers, typically running 1000-10000 domains of actual low volume websites on 1-50 or so IPs.

          Guess what happens when it is your networks time to bend over? Whole hosting company infrastructure going down as each server has hundreds of claudebots crawling hundreds of vhosts at the same time.

          This happened for months. Its the reason they are banned in WAFs by half the hosting industry.

          • By PeterStuer 2026-03-1117:19

            So how would you avoid this specific situation as a web-crawler that tries to be well behaved? You strictly adhere to robots.txt as specified by each domain. The problem is not with any of the sites but the density (1000-10000) by which the hoster packed them. If e.g. the crawler had a 1 sec between page governor even if robots.txt had no rate specified, which to be fair is very reasonable, this packing could still lead to high server load.

        • By c0balt 2026-03-1110:40

          The number of git forges behind Anubis et al and the numerous public announcements should be enough.

          Scrappers seem to be exceedingly careless in using public resources. The problem is often not even DDOS (as in overwhelming bandwidth usage) but rather DOS through excessive hits on expensive routes.

        • By drcongo 2026-03-1112:11

          Ask yourself, why would everyone except you say that they do?

    • By its-kostya 2026-03-110:001 reply

      Cloudflare has been trying to mediate publishers & AI companies. If publishers are behind Cloudflare and Cloudflare's bot detection stops scrapers at the request of publishers, the publishers can allow their data to be scraped (via this end point) for a price. It creates market scarcity. I don't believe the target audience is you and me. Unless you own a very popular blog that AI companies would pay you for.

      • By PeterStuer 2026-03-117:34

        Next step will be their default "free" anti-bot denying all but their own bot. They know full well nearly nobody changes the default.

    • By pocksuppet 2026-03-113:312 reply

      Was it ever not one? They protect a lot of DDoS-for-hire sites from DDoS by their competitors. In return they increase the quantity of DDoS on the internet. They offer you a service for $150, then months later suddenly demand $150k in 24 hours or they shut down your business. If you use them as a DNS registrar they will hold your domain hostage.

    • By theamk 2026-03-1023:562 reply

      no? it takes 10 seconds to check:

      > The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

      You don't need any scraping countermeasures for crawlers like those.

      • By Macha 2026-03-111:102 reply

        So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

          • By Macha 2026-03-119:25

            No, hence all their examples using User-Agent: *

        • By gruez 2026-03-111:142 reply

          >So yet another opt out bot which you need your web server to match on special behaviour to block

          Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.

          • By AdamN 2026-03-119:45

            Not 'allegedly' - it's just a fact. Even if you're not malicious however it's still sometimes necessary because the server may have different sites for different browsers and check user agents for the experience they deliver. So then even for legitimate purposes you need to at least use the prefix of the user agent that the server expects.

          • By Macha 2026-03-1114:05

            It is cloudflare who made the claim that they are well behaved unlike those other bots and that their behaviour can be controlled by robots.txt

            If I need to treat cloudflare bots the same as malicious bots, that undermines their claim.

      • By PeterStuer 2026-03-117:50

        Like they explain in the docs, their crawler will respect the robots.txt dissalowed user-agents, right after the section hat explains how to change your user-agent.

    • By subscribed 2026-03-110:58

      I think there's some space being absolutely snuffed by the countless bots of everyone, ignoring everything, pulling from residential proxies, and this, supposedly slower, well behavior, smarter bot.

      Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.

    • By isodev 2026-03-114:26

      They always have been.

      They also use their dominant position to apply political pressure when they don’t like how a country chooses to run things.

      So yeah, we’ve created another mega corp monster that will hurt for years to come.

    • By andrepd 2026-03-1110:36

      Well this scraper honours robots.txt so I'm sure most AI crawlers will find it useless.

    • By iso-logi 2026-03-1023:571 reply

      Their free DNS is only a small piece of the pie.

      The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.

      Their DNS is only really for data collection and to front as "good will"

      • By jen729w 2026-03-114:091 reply

        > The fact that 30%+ of the web relies on their caching services

        30% of the web might use their caching services. 'Relies on' implies that it wouldn't work without them, which I doubt is the case.

        It might be the case for the biggest 1% of that 30%. But not the whole lot.

        • By reddalo 2026-03-118:21

          >'Relies on' implies that it wouldn't work without them

          Last time Cloudflare went down, their dashboard was also unavailable, so you couldn't turn off their proxy service anyway.

    • By rrr_oh_man 2026-03-1023:452 reply

      [flagged]

      • By stri8ted 2026-03-1023:582 reply

        Do you have any evidence to support this view?

        • By rolymath 2026-03-110:551 reply

          Read who and how it was founded. It's not a secret at all.

          • By rrr_oh_man 2026-03-1118:52

            It’s funny how I got immediately downvoted and flagged

        • By pocksuppet 2026-03-113:32

          Who else would MITM 30% of the internet?

      • By mtmail 2026-03-1023:58

        Any kind of source for the claim?

    • By Retr0id 2026-03-1023:481 reply

      For a long time cloudflare has proudly protected DDoS-as-a-service sites (but of course, they claim they don't "host" them)

      • By Dylan16807 2026-03-113:311 reply

        Are you using the word "claim" to call them wrong or for a more confusing reason?

        Because I'm pretty sure they are not in fact wrong.

        • By Retr0id 2026-03-113:461 reply

          The distinction between a caching proxy and an origin server is pretty meaningless when you're serving static content, if you ask me.

          • By Dylan16807 2026-03-114:36

            There's a blurry line there, true.

            On the other hand when a page is small and static enough that it's basically just a flyer, I also care a lot less about who hosts it.

    • By giancarlostoro 2026-03-110:27

      If they ever sell or the CEO shifts, yes. For the meantime, they have not given any strong indication that they're trying to bully anybody. I could see things changing drastically if the people in charge are swapped out.

  • By RamblingCTO 2026-03-118:438 reply

    Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.

    • By paxys 2026-03-1115:313 reply

      That’s what they are doing. This is a textbook protection racket.

      “Buy Cloudflare bot protection, otherwise it would be a shame if your site got scraped and ddos’d.”

      Who is doing the scraping and ddosing? Cloudflare.

      • By tracker1 2026-03-1116:49

        In this case, sure... that said, I've worked on a few sites where more than half the traffic was bots because the content was useful for other sites (classic car classifieds/sales site). The fact that just over half the page requests were actually search query results is what meant a lot of optimization steps in practice... Implementing a "search" database (mongodb and elastic were pretty new at the time), denormalizing a lot of the data structures on the "enterprise" SQL structures for search and display for not logged in users, etc. Heavier caching, donut caching, etc.

        It was an interesting and sometimes fun part of my career. Working on a site/application that isn't necessarily a tech site, and that I have a personal interest in was pretty great... some of the pace for sales/commercial features less so, with sales making deals requiring deep integrations on impossible timelines. You learn a lot when a self-hosted site is being kicked while it's down... The cloud migration to get a better use of flexible resources, etc.

      • By kentonv 2026-03-1118:29

        You can trivially block Cloudflare crawl via robots.txt. You don't need to buy Cloudflare's bot protection -- this is not a malicious bot.

        https://x.com/CloudflareDev/status/2031745285517455615

        (Disclosure: I work for Cloudflare but not on this product. I get pretty tired of the conspiracy theories TBH.)

    • By kentonv 2026-03-1118:21

      Cloudflare crawl respects robots.txt. It does not attempt to bypass any anti-crawling measures. If the site doesn't want to be crawled -- whether it uses Cloudflare or not -- this product will not help you crawl it.

      Some sites actually want crawlers -- e.g. sites that are selling a product, documentation, etc. That's what this product is meant for.

      https://x.com/CloudflareDev/status/2031745285517455615

      (Disclosure: I work for Cloudflare but not on this product.)

    • By tyingq 2026-03-1114:221 reply

      That's too funny. If true, really looking forward to the Cloudflare response here. I'm unsure how you would spin that in a way that didn't seem self-serving.

      • By morpheuskafka 2026-03-1116:311 reply

        It's very clearly disclosed in the linked docs already, it says that Cloudflare Bot Protection will block it same as all other bots, unless you choose to allow it as an exception. If they didn't do it that way, people would accuse them of either bypassing their own product (possibly anticompetitive) or just having a low quality one.

        • By tyingq 2026-03-1117:511 reply

          So it doesn't take any action to work around other bot protections? Feels like that would be on the list of features an AI company wanting to scrape would ask for.

    • By GodelNumbering 2026-03-1113:04

      I imagine that would cause a backlash from the website owners trusting cloudflare to keep their content 'safe'

    • By chvid 2026-03-119:03

      As long at it gets past Azure's bot protection ...

    • By antonyh 2026-03-1115:01

      Wait. What?

      Is this just a way to strong-arm non-cloudflarians into adopting their platform if you don't want your site crawled? It does sound like they are selling the solution to avoid their own content crawler.

    • By davidhariri 2026-03-1113:463 reply

      Came here to write this. I am getting much better results from Firecrawl (not affiliated with them, just a happy customer).

      • By oasisbob 2026-03-1116:51

        As someone who helps keep a site online with a lot of content, I have mixed feelings on Firecrawl.

        On one hand, their bots seem much more well behaved than others.

        However, running a crawler fleet which is deceptive and evasive in its identification and don't honor REP is no way to build a business.

      • By kordlessagain 2026-03-1118:11

        I'd love for you to kick the tires on https://grubcrawler.dev

      • By RamblingCTO 2026-03-1114:122 reply

        fuck firecrawl. they copied my idea by showing interest in my product and then copied it, used their YC money to give it all out for free. fuck nick in particular. I'm still salty over this

        • By xeornet 2026-03-1115:021 reply

          "they copied my idea by showing interest in my product and then copied it". What exactly is revolutionary about Firecrawl or your product? Scraping APIs have been around for over a decade.

          • By RamblingCTO 2026-03-1116:302 reply

            I was the first to return markdown and use reader mode stuff to strip irrelevant stuff. Theres copying and there's talking to the founder sounding interested to have your team copy what I did in the background. One is fair game, the other is a dick head move.

            • By xeornet 2026-03-1117:00

              Not sure about the first claim. But yes, talking to the founder, sharing details and having it stolen is not a good look. Sorry that happened to you.

            • By keeda 2026-03-1118:221 reply

              I think that is a neat idea and it sucks this happened, but how long before somebody simply saw that feature and replicated it? I'm curious, had you considered a deeper moat than that?

              This is especially relevant given AI is making this kind of thing easy at an industrial scale. I think we should all be looking for alternative moats.

              • By gopher_space 2026-03-1119:23

                Sometimes timing is your moat and that's all you need. That being said I'll probably start limiting my public releases to revolve around standards I want implemented.

                I'm rethinking the sources of value moats are built around. It seems like the landscape is changing and dimensions such as location, perspective, experience, and attention weigh more than they used to.

                > but how long before somebody simply saw that feature and replicated it?

                This is a good example. The, idk, "value store" of your org just switched from products and services to the employees who understand your process from a couple angles and can write well.

        • By neversupervised 2026-03-1114:23

          Tell more. Crawling is not a new idea. How did they abuse you?

    • By ekropotin 2026-03-1115:10

      Please tells me you are joking

HackerNews