Why my rust rewrite of Mozilla's readability is better than original readability

2025-11-2419:4544github.com

A Rust port of Mozilla's standalone readability library - theiskaa/readabilityrs

Crates.io Documentation License Downloads GitHub Stars

readabilityrs extracts article content from HTML web pages using Mozilla's Readability algorithm. The library identifies and isolates the main article text while removing navigation, advertisements, and other clutter.

This is a Rust port of Mozilla's Readability.js, which powers Firefox's Reader View. The implementation passes 93.8% of Mozilla's test suite with full document preprocessing support.

Add to your project:

Or add to your Cargo.toml:

[dependencies]
readabilityrs = "0.1.0"

The library provides a simple API for parsing HTML documents. Create a Readability instance with your HTML content, an optional base URL for resolving relative links, and optional configuration settings. Call parse() to extract the article and access properties like title, content, author, excerpt, and publication time. The extracted content is returned as clean HTML suitable for display in reader applications.

use readabilityrs::Readability; let html = r#"
 <html>
 <head><title>Example Article</title></head>
 <body>
 <article>
 <h1>Article Title</h1>
 <p>This is the main article content.</p>
 </article>
 </body>
 </html>
"#; let readability = Readability::new(html, None, None)?;
if let Some(article) = readability.parse() { println!("Title: {}", article.title.unwrap_or_default()); println!("Content: {}", article.content.unwrap_or_default()); println!("Length: {} chars", article.length);
}

The library uses Mozilla's content scoring algorithm to identify the main article. Elements are scored based on tag types, text density, link density, and class name patterns. Document preprocessing removes scripts and styles, unwraps noscript tags, and normalizes deprecated elements before extraction, improving accuracy by 2.3 percentage points compared to parsing raw HTML.

Metadata is extracted from JSON-LD, OpenGraph, Twitter Cards, Dublin Core, and standard meta tags in that priority order. The library detects authors through rel="author" links and common byline patterns, extracts clean titles by removing site names, and generates excerpts from the first substantial paragraph.

Configure parsing behavior through ReadabilityOptions using the builder pattern. Options include debug logging, character thresholds, candidate selection, class preservation, and link density scoring.

use readabilityrs::{Readability, ReadabilityOptions}; let options = ReadabilityOptions::builder() .debug(true) .char_threshold(500) .nb_top_candidates(5) .keep_classes(false) .classes_to_preserve(vec!["page".to_string()]) .disable_json_ld(false) .link_density_modifier(0.0) .build(); let readability = Readability::new(&html, None, Some(options))?;

Provide a base URL to convert relative links to absolute URLs. This ensures images, anchors, and embedded content maintain correct paths when displayed outside the original context.

The library returns Result types for operations that can fail. Common errors include invalid URLs and parsing failures.

use readabilityrs::{Readability, error::ReadabilityError}; fn extract_article(html: &str, url: &str) -> Result<String, ReadabilityError> { let readability = Readability::new(html, Some(url), None)?; let article = readability.parse().ok_or(ReadabilityError::NoContentFound)?; Ok(article.content.unwrap_or_default())
}

Built in Rust for performance and memory safety, the library leverages zero-cost abstractions to enable optimizations without runtime overhead. Minimal allocations during parsing through efficient string handling and DOM traversal mean the library processes typical news articles in milliseconds on modern hardware. Memory usage scales with document size, typically under 10MB for standard web pages. The Rust implementation is significantly faster than the original JavaScript version while maintaining lower memory footprint.

The implementation passes 122 of 130 tests from Mozilla's test suite achieving 93.8% compatibility with full document preprocessing support. The 8 failing tests represent editorial judgment differences rather than implementation errors. Four cases involve more sensible choices in our implementation such as avoiding bylines extracted from related article sidebars and preferring author names over timestamps. Four cases involve subjective paragraph selection for excerpts where both the reference and our implementation make valid choices. This means the results are 93.8% identical to Mozilla's implementation, with the remaining differences being arguable improvements to the extraction logic.

For information regarding contributions, please refer to CONTRIBUTING.md file.

Licensed under the Apache License, Version 2.0. See LICENSE file for details.


Read the original article

Comments

  • By emschwartz 2025-11-2512:59

    How does the approach here differ from dom_smoothie (https://github.com/niklak/dom_smoothie)?

  • By janpio 2025-11-2419:501 reply

    So why? The link just goes to the project GitHub repo, and README does not explain as far as I can see.

    • By fsiefken 2025-11-2421:19

      The README.md is 9k of dense text, but does explain it: faster, more efficient, more accurate & more sensible.

      Rust port feature: The implementation "passes 93.8% of Mozilla's test suite (122/130 tests)" with full document preprocessing support.

      Test interpretation/sensibility: The 8 failing tests "represent editorial judgment differences rather than implementation errors." It notes four cases involving "more sensible choices in our implementation such as avoiding bylines extracted from related article sidebars and preferring author names over timestamps."

      This means that the results are 93.8% identical, and the remaining differences are arguably an improvement. Further improvement, extraction accuracy: Document preprocessing "improves extraction accuracy by 2.3 percentage points compared to parsing raw HTML."

      Performance:

        * Built in Rust for performance and memory safety
        * The port uses "Zero-cost abstractions enable optimizations without runtime overhead."
        * It uses "Minimal allocations during parsing through efficient string handling and DOM traversal."
        * The library "processes typical news articles in milliseconds on modern hardware."
      
      It's not explicitly written but I think it's a reasonable assumption that its "millisecond" processing time is significantly faster than the original JavaScript implementation based on these 4 points. Perhaps it's also better memory wise.

      I would add a comparison benchmark (memory and processing time), perhaps with barcharts to make it more clear with the 8 examples of the differing editorial judgement for people who scan read.

  • By M95D 2025-11-258:12

    Why not contribute to Ladybird instead?

HackerNews