Converts HTML to plain text. Preserves structure, handles tables, and can track document elements.
pip install git+https://github.com/accessibleapps/html_to_text.gitOr with uv:
uv pip install git+https://github.com/accessibleapps/html_to_text.gitfrom html_to_text import html_to_text
html = """
<h1>Title</h1>
<p>First paragraph with <strong>bold</strong> text.</p>
<p>Second paragraph.</p>
"""
text = html_to_text(html)
print(text)Output:
Title
First paragraph with bold text.
Second paragraph.
- Structure preservation: Block elements (
<p>,<div>,<h1-h6>) get appropriate spacing - Table handling: Tables are parsed and tracked if callbacks are provided
- Link extraction: Internal and external links can be captured
- Pre-formatted text:
<pre>and<code>blocks preserve whitespace - Heading hierarchy: Tracks document structure through heading levels
- Element tracking: Optional callbacks for building document indexes
Parameters:
item(str|lxml.etree.Element): HTML string or parsed lxml elementnode_parsed_callback(callable, optional): Function called when elements are parsed. Receives(parent, tag_type, content, **kwargs)startpos(int, optional): Starting position offset for trackingfile(str, optional): File path for resolving relative links
Returns: Plain text string
Track document structure by providing a callback function:
def track_elements(parent, tag_type, content, **kwargs):
"""
Called for each tracked element.
Args:
parent: Parent element ID (for hierarchical structures)
tag_type: 'heading', 'link', 'table', 'tr', 'td', 'th', 'id', 'page'
content: Element content (text for links, None for structural elements)
**kwargs: Element-specific data (start, end, level, href, etc.)
Returns:
Dictionary with 'id' key (used as parent for child elements)
"""
print(f"{tag_type}: {content}")
return {'id': f"{tag_type}_{kwargs.get('start', 0)}"}
html = """
<h1>Main Title</h1>
<h2>Subsection</h2>
<p>Text with <a href="/page">link</a>.</p>
<table>
<tr><td>Cell</td></tr>
</table>
"""
text = html_to_text(html, node_parsed_callback=track_elements)Callback parameters by element type:
| Type | parent | content | kwargs |
|---|---|---|---|
heading |
Parent heading ID | None | start, end, tag, level |
link |
None | Link text | start, end, href |
table |
Parent table ID | None | start, attrs |
tr, td, th |
Parent table element ID | None | start, attrs |
id |
None | Element ID | start |
page |
None | Page identifier | start, pagenum |
<script><style><title>- Elements with
class="pagenum"
Headings (<h1> - <h6>): Surrounded by double newlines. Hierarchy tracked in callbacks.
Blocks (<p>, <div>, <blockquote>, <center>): Double newlines before and after.
Line breaks (<br>): Converted to single newline.
Horizontal rule (<hr>): Becomes 80-character dash line.
Pre-formatted (<pre>, <code>): Whitespace preserved exactly.
Links: Text extracted; URLs captured via callbacks if provided.
Tables: Structure tracked via callbacks; text extracted in reading order.
- Python ≥ 3.8
- lxml
from html_to_text import html_to_text
doc_structure = []
def index_callback(parent, tag_type, content, **kwargs):
entry = {
'id': len(doc_structure),
'type': tag_type,
'parent': parent,
'start': kwargs.get('start'),
'end': kwargs.get('end')
}
if tag_type == 'heading':
entry['level'] = kwargs['level']
elif tag_type == 'link':
entry['text'] = content
entry['href'] = kwargs['href']
doc_structure.append(entry)
return entry
html = open('document.html').read()
text = html_to_text(html, node_parsed_callback=index_callback, file='document.html')
# Now doc_structure contains full document index with positions
for item in doc_structure:
if item['type'] == 'heading':
indent = ' ' * int(item['level'])
heading_text = text[item['start']:item['end']]
print(f"{indent}{heading_text}")See LICENSE file.