Trafilatura icon

Trafilatura

Python web text and metadata extractor

FreeOpen SourceApache-2.0CLI

Description

Trafilatura by Adrien Barbaresi is a Python library and CLI for extracting cleaned main text and structured metadata (title, date, author, language) from arbitrary HTML pages, with support for many languages and outputs in JSON or XML-TEI.

Reviews

0.0 (0 reviews)