Skip to main content

How to load Markdown

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

Here we cover how to load Markdown documents into LangChain Document objects that we can use downstream.

We will cover:

  • Basic usage;
  • Parsing of Markdown into elements such as titles, list items, and text.

LangChain implements an UnstructuredMarkdownLoader object which requires the Unstructured package. First we install it:

# !pip install "unstructured[md]"

Basic usage will ingest a Markdown file to a single document. Here we demonstrate on LangChain's readme:

from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_core.documents import Document

markdown_path = "../../../../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)

data = loader.load()
assert len(data) == 1
assert isinstance(data[0], Document)
readme_content = data[0].page_content
print(readme_content[:250])
πŸ¦œοΈπŸ”— LangChain

⚑ Build context-aware reasoning applications ⚑

Looking for the JS/TS library? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith.
LangSmith is a unified developer platform for building,

Retain Elements​

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements".

loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")

data = loader.load()
print(f"Number of documents: {len(data)}\n")

for document in data[:2]:
print(f"{document}\n")
Number of documents: 65

page_content='πŸ¦œοΈπŸ”— LangChain' metadata={'source': '../../../../README.md', 'last_modified': '2024-04-29T13:40:19', 'page_number': 1, 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../../..', 'filename': 'README.md', 'category': 'Title'}

page_content='⚑ Build context-aware reasoning applications ⚑' metadata={'source': '../../../../README.md', 'last_modified': '2024-04-29T13:40:19', 'page_number': 1, 'languages': ['eng'], 'parent_id': 'c3223b6f7100be08a78f1e8c0c28fde1', 'filetype': 'text/markdown', 'file_directory': '../../../..', 'filename': 'README.md', 'category': 'NarrativeText'}

Note that in this case we recover three distinct element types:

print(set(document.metadata["category"] for document in data))
{'Title', 'NarrativeText', 'ListItem'}

Was this page helpful?


You can leave detailed feedback on GitHub.