Skip to main content

How to load HTML

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load HTML documents into a LangChain Document objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing via Unstructured and BeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such as Azure AI Document Intelligence or FireCrawl.

Loading HTML with Unstructured​

%pip install "unstructured[html]"
from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "../../../docs/integrations/document_loaders/example_data/fake-content.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

print(data)
[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../../docs/integrations/document_loaders/example_data/fake-content.html'})]

Loading HTML with BeautifulSoup4​

We can also use BeautifulSoup4 to load HTML documents using the BSHTMLLoader. This will extract the text from the HTML into page_content, and the page title as title into metadata.

%pip install bs4
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(file_path)
data = loader.load()

print(data)

API Reference:

[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]

Was this page helpful?


You can leave detailed feedback on GitHub.