Show HN: An API that takes a URL and returns a file with browser screenshots

2025-02-0618:48208100github.com

An API that takes a URL and gives back a file with browser screenshots. - US-Artificial-Intelligence/scraper

You run the API on your machine, you send it a URL, and you get back the website data as a file plus screenshots of the site. Simple as.

This project was made to support Abbey, an AI platform. Its author is Gordon Kamer.

Some highlights:

  • Scrolls through the page and takes screenshots of different sections
  • Runs in a docker container
  • Browser-based (will run websites' Javascript)
  • Gives you the HTTP status code and headers from the first request
  • Automatically handles 302 redirects
  • Handles download links properly
  • Tasks are processed in a queue with configurable memory allocation
  • Blocking API
  • Zero state or other complexity

This web scraper is resource intensive but higher quality than many alternatives. Websites are scraped using Playwright, which launches a Firefox browser context for each job.

You should have Docker and docker compose installed.

  1. Clone this repo
  2. Run docker compose up (a docker-compose.yml file is provided for your use)

...and the service will be available at http://localhost:5006. See the Usage section below for details on how to interact with it.

You may set an API key using a .env file inside the /scraper folder (same level as app.py).

You can set as many API keys as you'd like; allowed API keys are those that start with SCRAPER_API_KEY. For example, here is a .env file that has three available keys:

SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too

API keys are sent to the service using the Authorization Bearer scheme.

The root path / returns status 200 if online, plus some Gilbert and Sullivan lyrics (you can go there in your browser to see if it's online).

The only other path is /scrape, to which you send a JSON formatted POST request and (if all things go well) receive a multipart/mixed type response.

The response will be either:

  • Status 200: multipart/mixed response where the first part is type application/json with information about the request; the second part is the website data (usually text/html); and the remaining parts are up to 5 screenshots.
  • Not status 200: application/json response with an error message under the "error" key.

Here's a sample cURL request:

curl -X POST "http://localhost:5006/scrape"
    -H "Content-Type: application/json"
    -d '{"url": "https://us.ai"}'

Here is a code example using Python and the requests_toolbelt library to let you interact with the API properly:

import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder
import sys
import json

data = {
    'url': "https://us.ai"
}
# Optional if you're using an API key
headers = {
    'Authorization': f'Bearer Your-API-Key'
}

response = requests.post('http://localhost:5006/scrape', json=data, headers=headers, timeout=30)
if response.status_code != 200:
    my_json = response.json()
    message = my_json['error']
    print(f"Error scraping: {message}", file=sys.stderr)
else:
    decoder = MultipartDecoder.from_response(response)
    resp = None
    for i, part in enumerate(decoder.parts):
        if i == 0:  # First is some JSON
            json_part = json.loads(part.content)
            req_status = json_part['status']  # An integer
            req_headers = json_part['headers']  # Headers from the request made to your URL
            metadata = json_part['metadata']  # Information like the number of screenshots and their compressed / uncompressed sizes
            # ...
        elif i == 1:  # Next is the actual content of the page
            content = part.content
            headers = part.headers  # Will contain info about the content (text/html, application/pdf, etc.)
            # ...
        else:  # Other parts are screenshots, if they exist
            img = part.content
            headers = part.headers  # Will tell you the image format
            # ...

Navigating to untrusted websites is a serious security issue. Risks are somewhat mitigated in the following ways:

  • Runs as isolated container (container isolation)
  • Each website is scraped in a new browser context (process isolation)
  • Strict memory limits and timeouts for each task
  • Checks the URL to make sure that it's not too weird (loopback, non http, etc.)

You may take additional precautions depending on your needs, like:

  • Only giving the API trusted URLs (or otherwise screening URLs)
  • Running this API on isolated VMs (hardware isolation)
  • Using one API instance per user
  • Not making any secret files or keys available inside the container (besides the API key for the scraper itself)

If you'd like to make sure that this API is up to your security standards, please examine the code and open issues! It's not a big repo.

You can control memory limits and other variables at the top of scraper/worker.py. Here are the defaults:

MEM_LIMIT_MB = 4_000  # 4 GB memory threshold for child scraping process
MAX_SCREENSHOTS = 5
SCREENSHOT_JPEG_QUALITY = 85
BROWSER_HEIGHT = 2000
BROWSER_WIDTH = 1280

Read the original article

Comments

  • By xnx 2025-02-0619:397 reply

    For anyone who might not be aware, Chrome also has the ability to save screenshots from the command line using: chrome --headless --screenshot="path/to/save/screenshot.png" --disable-gpu --window-size=1280,720 "https://www.example.com"

    • By cmgriffing 2025-02-0621:174 reply

      Quick note: when trying to do full page screenshots, Chrome does a screenshot of the current view, then scrolls and does another screenshot. This can cause some interesting artifacts when rendering pages with scroll behaviors.

      Firefox does a proper full page screenshot and even allows you to set a higher DPS value. I use this a lot when making video content.

      Check out some of the args in FF using: `:screenshot --help`

      • By wereHamster 2025-02-0622:572 reply

        That's not the behavior I'm seeing (with Puppeteer). Any elements positioned relative to the viewport stay within the area specified by screen size (eg. 1200x800) which is usually the top of the page. If the browser would scroll down these would also move down (and potentially appear multiple times in the image). Also intersection observers which are further down on the page do not trigger when I do a full-page screenshot (eg. an element which starts animation when it enters into the viewport).

        • By genewitch 2025-02-0710:24

          bravo for puppeteer, i guess? "singlefile" is the only thing i've ever seen not do weird artifacts in the middle of some site renders, or, like on reddit, just give up rendering comments and render blank space instead until the footer.

          anyhow i've been doing this exact thing for a real long time, e.g.

          https://raw.githubusercontent.com/genewitch/opensource/refs/...

          using bash to return json to some stupid chat service we were running

      • By ranger_danger 2025-02-070:021 reply

        Where would you type that command in?

      • By xg15 2025-02-0623:011 reply

        I mean, if you have some of those annoying "hijack scrolling and turn the page into some sort of interactive animation experience" sites, I don't think "full page" would even be well-defined.

        • By sixothree 2025-02-0713:32

          Pretty sure this refers to sticky headers. They have caused me many headaches when trying to get a decent screenshot.

    • By input_sh 2025-02-0621:175 reply

      Firefox equivalent:

          firefox -screenshot file.png https://example.com --window-size=1280,720
      
      A bit annoyingly, it won't work if you have Firefox already open.

      • By UnlockedSecrets 2025-02-0622:011 reply

        Does it work if you use a different profile with -p?

      • By genewitch 2025-02-0710:43

        on my firefox if i right click on a part of the page the website hasn't hijacked, it gives the option to "take screenshot" - which i think required enabling a setting somewhere. I hope it wasn't in about:config or wherever the dark-art settings are. I use that feature of FF to screenshot youtube videos with the subtitles moved and the scrub bar cropped out, i feel like it's a cleaner and smaller clipboard copy than using win+shift+s. Microsoft changed a lot about how windows handles ... files ... internally and screenshots are huge .png now, making me miss the days of huge .bmp.

        also as mentioned above, if you need entire sites backed up the firefox extension "singlefile" is the business. if image-y things? bulk image downloader (costs money but 100% worth; you know it if you need it: BID); and yt-dlp + ffmpeg for video, in powershell (get 7.5.0 do yourself a favor!)

        ```powershell

        $userInput = Read-Host -Prompt '480 video download script enter URL'

        Write-Output "URL:`t`t$userInput"

        c:\opt\yt-dlp.exe `

        -f 'bestvideo[height<=480]+bestaudio/best[height<=480]' `

        --write-auto-subs --write-subs `

        --fragment-retries infinite `

        $userInput

        ```

      • By blueflow 2025-02-0621:421 reply

        > it won't work if you have Firefox already open

        now try and go ahead how you could isolate these instances so they cannot see each other. this leads into a rabbit hole of bad design.

        • By yjftsjthsd-h 2025-02-072:051 reply

          > now try and go ahead how you could isolate these instances so they cannot see each other. this leads into a rabbit hole of bad design.

          Okay, done:

            PROFILEDIR="$(mktemp -d)"
            firefox --no-remote  --profile "$PROFILEDIR" --screenshot $PWD/output.png https://xkcd.com
            rm -r "$PROFILEDIR"
          
          What's the rabbit hole?

          • By blueflow 2025-02-0711:241 reply

            Whats with the dbus interface?

            • By yjftsjthsd-h 2025-02-0715:14

              What?

              (If you're trying to point out that two firefoxes are capable of talking to each other via system IPC, then yes, fully isolating any two programs on the same machine requires at least containers but probably full VMs, which has nothing to do with Firefox itself, and you'd need to explain why in this situation we should care)

      • By amelius 2025-02-0622:40

        > A bit annoyingly, it won't work if you have Firefox already open.

        I hate it when applications do this.

      • By cmgriffing 2025-02-0621:20

        LOL, you and I posted very similar replies at the same time.

    • By azhenley 2025-02-0620:04

      Very nice, I didn't know this. I used pyppeteer and selenium for this previously which seemed excessive.

    • By martinbaun 2025-02-0620:04

      Oh man, I needed this so many times didn't even think of doing it like this. I tried using Selenium and all different external services. Thank you!

      Works in chromium as well.

    • By antifarben 2025-02-0922:32

      Does anyone know whether this would also be possible with Firefox, including explicit extensions (i.e. uBlock) and explicit configured block lists or other settings for these extensions?

    • By hulitu 2025-02-099:56

      > Chrome also has the ability to save screenshots

      Too bad that no browser is able to print a web page.

    • By Onavo 2025-02-0620:392 reply

      What features won't work without GPU?

      • By kylecazar 2025-02-0621:22

        This flag isn't valid anymore in the new chrome headless. Disable GPU doesn't exist unless your on the old version (and then, it was meant as a workaround for Windows users only).

        I've used this via selenium not too long ago

      • By xnx 2025-02-0620:471 reply

        [flagged]

        • By dingnuts 2025-02-0620:59

          oh good an AI summary with none of the facts checked, literally more useless than the old lmgtfy and somehow more rude

          "here's some output that looks relevant to your question but I couldn't even be arsed to look any of it up, or copy paste it, or confirm its validity"

  • By jot 2025-02-0620:565 reply

    If you’re worried about the security risks, edge cases, maintenance pain and scaling challenges of self hosting there are various solid hosted alternatives:

    - https://browserless.io - low level browser control

    - https://scrapingbee.com - scraping specialists

    - https://urlbox.com - screenshot specialists*

    They’re all profitable and have been around for years so you can depend on the businesses and the tech.

    * Disclosure: I work on this one and was a customer before I joined the team.

    • By ALittleLight 2025-02-0623:221 reply

      Looking at your urlbox - pretty funny language around the quota system.

      >What happens if I go over my quota?

      >No need to worry - we won't cut off your service. We automatically upgrade you to the next tier so you benefit from volume discounts. See the pricing page for more details.

      So... If I go over the quota you automatically charge me more? Hmm. I would expect to be rejected in this case.

      • By jot 2025-02-0623:431 reply

        I’m sure we can do better here.

        In my experience our customers are more worried about having the service stop when they hit the limit of a tier than they are about being charged a few more dollars.

        • By ALittleLight 2025-02-0623:471 reply

          Maybe I'm misreading. It sounds like you're stepping the user up a pricing tier - e.g. going from 50 a month to 100 and then charging at the better rate.

          I would also worry about a bug on my end that fires off lots of screenshots. I would expect a quota or limit to protect me from that.

          • By jot 2025-02-078:291 reply

            That’s right. On our standard self-service plans we automatically charge a better rate as volume increases. You only pay the difference between tiers as you move through them.

            It’s rare that anyone makes that kind of mistake. It probably helps that our rate limits are relatively low compared to other APIs and we email you when you get close to stepping up a tier. If you did make such a mistake we would, like all good dev tools, work with you to resolve. If it happened a lot we might introduce some additional controls.

            We’ve been in this business for over 12 years and currently have over 700 customers so we’re fairly confident we have the balance right.

            • By ALittleLight 2025-02-0718:04

              I'm not a customer, so don't take what I say too seriously, but to me it seems like you are unilaterally making a purchasing decision on my behalf. That is, I agreed to pay you 50 dollars a month and you are deciding I should pay 100 (or more) - to "upgrade" my service. My intuition is that this is probably not legal, and, if I were a customer, I would not pay for a charge that I didn't explicitly agree to - if you tried to charge me I would reject it at the credit card level.

              If I sign up for a service to pay X and get Y, then I expect to pay X and get Y - even if my automated tools request more than Y - they should be rejected with a failure message (e.g. "quota limit exceeded").

    • By edm0nd 2025-02-0621:09

      https://www.scraperapi.com/ is good too. Been using them to scrape via their API on websites that have a lot of captchas or anti scraping tech like DataDome.

    • By rustdeveloper 2025-02-0621:221 reply

      Happy to suggest another web scraping API alternative I rely on: https://scrapingfish.com

      • By xeornet 2025-02-0712:05

        What’s the chance you’re affiliated? Almost every one of your comments links to it. And curiously similar interest in Rust from the official HN page and yours. No need to be sneaky.

    • By bbor 2025-02-0622:212 reply

      Do these services respect norobot manifests? Isn't this all kinda... illegal...? Or at least non-consensual?

      • By basilgohar 2025-02-0622:322 reply

        robots.txt isn't legally binding. I am interested to know if and how services even interact with it. It's more like a clue on when the interesting content for scrapers is on your site. This is how I imagine it goes:

        "Hey, don't scrape the data here."

        "You know what? I'm scrape it even harder!"

        • By bbor 2025-02-070:02

          Soooo nonconsensual.

          Maybe bluesky is right… are we the baddies?

        • By tonyhart7 2025-02-071:56

          it is legally binding if your company based on SV (only California implement this law) and they can prove it

      • By fc417fc802 2025-02-0623:08

        [dead]

    • By theogravity 2025-02-0623:131 reply

      there's also our product, Airtop (https://www.airtop.ai/), which is under the scraping specialist / browser automation category that can generate screenshots too.

      • By kevinsundar 2025-02-0623:191 reply

        Hey I'm curious what your thoughts are on whether you need a full blown agent that moves the mouse and clicks to extract contents from webpages or a more simplistic tool that can just scrape pages + take screenshots and pass it through an LLM is generally pretty effective?

        I can see niches cases likes videos or animations being better understood by an agent though.

        • By theogravity 2025-02-0717:57

          Airtop is designed to be flexible, you can use it as part of a full-blown agent that interacts with webpages or as a standalone tool for scraping and screenshots.

          One of the key challenges in scraping is dealing with anti-bot measures, CAPTCHAs, and dynamic content loading. Airtop abstracts much of this complexity while keeping it accessible through an API. If you're primarily looking for structured data extraction, passing pages through an LLM can work well, but for interactive workflows (e.g., authentication, multi-step navigation), an agent-based approach might be better. It really depends on the use case.

  • By jchw 2025-02-070:044 reply

    One thing to be cognizant of: if you're planning to run this sort of thing against potentially untrusted URLs, the browser might be able to make requests to internal hosts in whatever network it is on. It would be wise, on Linux, to use network namespaces, and block any local IP range in the namespace, or use a network namespace to limit the browser to a wireguard VPN tunnel to some other network.

    • By leptons 2025-02-079:221 reply

      This is true for practically every web browser anyone uses on any site that they don't personally control.

      • By jchw 2025-02-0711:43

        This is true, although I think in a home environment, there aren't as many interesting things to hit, and you're limited by Same Origin Policy, as well as certain mitigations that web browsers deploy against attacks like DNS Rebinding. However, if you're running this on a server, there's a much greater likelihood that interesting services are under the firewall, e.g. maybe the Kubernetes API server. Code execution could potentially be a form post away.

    • By remram 2025-02-071:27

      Very important note! This is called Server-Side Request Forgery (SSRF).

    • By anonzzzies 2025-02-0712:22

      Is there a self hosted version that does this properly?

    • By jot 2025-02-078:33

      Too many developers learn this the hard way.

      It’s one of the top reasons larger organisations prefer to use hosted services rather than doing it themselves.

HackerNews