Struggling to turn web tables into clean data? Here’s a fast, reliable workflow to convert messy HTML tables into structured CSV/JSON—without brittle scrapers.
Inspired by this community example from Simon Willison: HTML Table Extractor. The safest pattern: parse deterministically first, then use AI sparingly for cleanup.
Why HTML tables are tricky
- Rowspan/colspan, multi-row headers, and nested tables break naive scrapers.
- Repeated header rows across paginated sections confuse column alignment.
- Hidden rows/cells (CSS), footnotes, and superscripts pollute cell text.
- Inconsistent number formats (commas, currency, percents) and units.
A 3-step extractor workflow
- 1) Deterministic parse first: Use pandas.read_html (lxml/bs4) to pull candidate tables. Prefer precise CSS selectors to target the right table and preserve the raw HTML alongside parsed text.
- 2) Normalize structure: Flatten multi-row headers, expand rowspans, and carry forward group labels into each row. Standardize number parsing (currencies, percents) and coerce types.
- 3) Add AI only for cleanup: Use an LLM to map messy headers to a canonical schema, harmonize units/abbreviations, or fix edge-case rows. Constrain output to a JSON schema and use low temperature for deterministic results.
Validation and quality checks
- Schema: Enforce required columns, types, and allowed value ranges.
- Arithmetic: Totals should equal the sum of parts; ratios within 0–1.
- Redundancy: Cross-parse with a second method (e.g., bs4 vs. read_html) and diff results.
- Provenance: Keep the table HTML, URL, timestamp, and extraction parameters for audits.
When to add AI
- Use rules when the table is consistent across pages or releases.
- Use AI when headers are semi-structured, cells mix notes with values, or OCR artifacts exist.
- Guardrails: Require structured JSON, reject unparsable rows, and log all LLM prompts/outputs.
Quick wins
- Target the right table with CSS (e.g., main table IDs) before parsing.
- Strip hidden elements (display:none) and footnote markers prior to number parsing.
- Flatten multi-row headers by joining with “_” or “: ” to preserve hierarchy.
- Keep a raw-to-clean mapping so you can trace any cell back to its source.
Sources
- Simon Willison — HTML Table Extractor
- Pandas — read_html documentation
- MDN — HTML tables guide
Takeaway
Deterministic first, AI last. Parse and normalize with proven tools, then let an LLM polish headers and edge cases under strict schema checks.
Get more nuggets
Like this? Subscribe to our free newsletter for weekly, bite-sized AI tactics that ship: theainuggets.com/newsletter.

