A modern C#/.NET library for converting a wide range of document formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, etc.) into high-quality Markdown suitable for Large Language Models (LLMs), search indexing, and text analytics. The project mirrors the original Microsoft Python implementation while embracing .NET idioms, async APIs, and new integrations.
- Features
- Format Support
- Quick Start
- Usage
- Architecture
- Development & Contributing
- Roadmap
- Performance
- Configuration
- License
- Acknowledgments
- Support
β¨ Modern .NET - Targets .NET 9.0 with up-to-date language features
π¦ NuGet Package - Drop-in dependency for libraries and automation pipelines
π Async/Await - Fully asynchronous pipeline for responsive apps
π§ LLM-Optimized - Markdown tailored for AI ingestion and summarisation
π§ Extensible - Register custom converters or plug additional caption/transcription services
π§ Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs)
β‘ High Performance - Stream-friendly, minimal allocations, zero temp files
Format | Extension | Status | Description |
---|---|---|---|
HTML | .html , .htm |
β Supported | Full HTML to Markdown conversion |
Plain Text | .txt , .md |
β Supported | Direct text processing |
.pdf |
β Supported | Adobe PDF documents with text extraction | |
Word | .docx |
β Supported | Microsoft Word documents with formatting |
Excel | .xlsx |
β Supported | Microsoft Excel spreadsheets as tables |
PowerPoint | .pptx |
β Supported | Microsoft PowerPoint presentations |
Images | .jpg , .png , .gif , .bmp , .tiff , .webp |
β Supported | Exif metadata extraction + optional captions |
Audio | .wav , .mp3 , .m4a , .mp4 |
β Supported | Metadata extraction + optional transcription |
CSV | .csv |
β Supported | Comma-separated values as Markdown tables |
JSON | .json , .jsonl , .ndjson |
β Supported | Structured JSON data with formatting |
XML | .xml , .xsd , .xsl , .rss , .atom |
β Supported | XML documents with structure preservation |
EPUB | .epub |
β Supported | E-book files with metadata and content |
ZIP | .zip |
β Supported | Archive processing with recursive file conversion |
Jupyter Notebook | .ipynb |
β Supported | Python notebooks with code and markdown cells |
RSS/Atom Feeds | .rss , .atom , .xml |
β Supported | Web feeds with structured content and metadata |
YouTube URLs | YouTube links | β Supported | Video metadata extraction and link formatting |
Wikipedia Pages | wikipedia.org | β Supported | Article-only extraction with clean Markdown |
Bing SERPs | bing.com/search | β Supported | Organic result summarisation |
- Headers (H1-H6) β Markdown headers
- Bold/Strong text β bold
- Italic/Emphasis text β italic
- Links β text
- Images β
- Lists (ordered/unordered)
- Tables with header detection and Markdown table output
- Code blocks and inline code
- Blockquotes, sections, semantic containers
- Text extraction with page separation
- Header detection based on formatting
- List item recognition
- Title extraction from document content
- Word (.docx): Headers, paragraphs, tables, bold/italic formatting
- Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
- PowerPoint (.pptx): Slide-by-slide content with title recognition
- Automatic table formatting with headers
- Proper escaping of special characters
- Support for various CSV dialects
- Handles quoted fields and embedded commas
- Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
- JSON Lines Support: Processes
.jsonl
and.ndjson
files line by line - Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
- Nested Objects: Handles complex nested structures with proper indentation
- Structure Preservation: Maintains XML hierarchy as Markdown headings
- Attributes Handling: Converts XML attributes to Markdown lists
- Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
- CDATA Support: Properly handles CDATA sections as code blocks
- Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
- Content Order: Processes content files in proper reading order using spine information
- HTML Processing: Converts XHTML content using the HTML converter
- Table of Contents: Maintains document structure from the original EPUB
- Recursive Processing: Extracts and converts all supported files within archives
- Structure Preservation: Maintains original file paths and organization
- Multi-Format Support: Processes different file types within the same archive
- Error Handling: Continues processing even if individual files fail
- Size Limits: Protects against memory issues with large files
- Cell Type Support: Processes markdown, code, and raw cells appropriately
- Metadata Extraction: Extracts notebook title, kernel information, and language details
- Code Output Handling: Captures and formats execution results, streams, and errors
- Syntax Highlighting: Preserves language information for proper code block formatting
- Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
- Feed Metadata: Extracts title, description, last update date, and author information
- Article Processing: Converts feed items with proper title linking and content formatting
- Date Formatting: Normalizes publication dates across different feed formats
- URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
- Metadata Extraction: Extracts video ID and URL parameters with descriptions
- Embed Integration: Provides thumbnail images and multiple access methods
- Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)
- Support for JPEG, PNG, GIF, BMP, TIFF, WebP
- Exif metadata extraction via
exiftool
(optional) - Optional multimodal image captioning hook (LLM integration ready)
- Graceful fallback when metadata/captioning unavailable
- Handles WAV/MP3/M4A/MP4 containers
- Extracts key metadata (artist, album, duration, channels, etc.)
- Optional transcription delegate for speech-to-text results
- Markdown summary highlighting metadata and transcript
Install via NuGet Package Manager:
# Package Manager Console
Install-Package ManagedCode.MarkItDown
# .NET CLI
dotnet add package ManagedCode.MarkItDown
# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="1.0.0" />
- .NET 9.0 SDK or later
- Compatible with .NET 9 apps and libraries
- PDF Support: Provided via PdfPig (bundled)
- Office Documents: Provided via DocumentFormat.OpenXml (bundled)
- Image metadata: Install ExifTool for richer output (
brew install exiftool
,choco install exiftool
) - Image captions: Supply an
ImageCaptioner
delegate (e.g., calls to an LLM or vision service) - Audio transcription: Supply an
AudioTranscriber
delegate (e.g., Azure Cognitive Services, OpenAI Whisper)
Note: External tools are optionalβMarkItDown degrades gracefully when they are absent.
using MarkItDown;
// Convert a DOCX file and print the Markdown
var markItDown = new MarkItDown();
DocumentConverterResult result = await markItDown.ConvertAsync("report.docx");
Console.WriteLine(result.Markdown);
using System.IO;
using System.Text;
using MarkItDown;
using var stream = File.OpenRead("invoice.html");
var streamInfo = new StreamInfo(
mimeType: "text/html",
extension: ".html",
charset: Encoding.UTF8,
fileName: "invoice.html");
var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync(stream, streamInfo);
Console.WriteLine(result.Title);
using MarkItDown;
using Microsoft.Extensions.Logging;
using var loggerFactory = LoggerFactory.Create(static builder => builder.AddConsole());
using var httpClient = new HttpClient();
var markItDown = new MarkItDown(
logger: loggerFactory.CreateLogger<MarkItDown>(),
httpClient: httpClient);
DocumentConverterResult urlResult = await markItDown.ConvertFromUrlAsync("https://contoso.example/blog");
Console.WriteLine(urlResult.Title);
using Azure;
using MarkItDown;
var options = new MarkItDownOptions
{
// Plug in your own services (Azure AI, OpenAI, etc.)
ImageCaptioner = async (bytes, info, token) =>
await myCaptionService.DescribeAsync(bytes, info, token),
AudioTranscriber = async (bytes, info, token) =>
await speechClient.TranscribeAsync(bytes, info, token),
DocumentIntelligence = new DocumentIntelligenceOptions
{
Endpoint = "https://<your-resource>.cognitiveservices.azure.com/",
Credential = new AzureKeyCredential("<document-intelligence-key>")
}
};
var markItDown = new MarkItDown(options);
Create your own format converters by implementing IDocumentConverter
:
using System.IO;
using MarkItDown;
public sealed class MyCustomConverter : IDocumentConverter
{
public int Priority => ConverterPriority.SpecificFileFormat;
public bool AcceptsInput(StreamInfo streamInfo) =>
string.Equals(streamInfo.Extension, ".mycustom", StringComparison.OrdinalIgnoreCase);
public Task<DocumentConverterResult> ConvertAsync(
Stream stream,
StreamInfo streamInfo,
CancellationToken cancellationToken = default)
{
stream.Seek(0, SeekOrigin.Begin);
using var reader = new StreamReader(stream, leaveOpen: true);
var markdown = "# Converted from custom format\n\n" + reader.ReadToEnd();
return Task.FromResult(new DocumentConverterResult(markdown, "Custom document"));
}
}
var markItDown = new MarkItDown();
markItDown.RegisterConverter(new MyCustomConverter());
MarkItDown
- Main entry point for conversionsIDocumentConverter
- Interface for format-specific convertersDocumentConverterResult
- Contains the converted Markdown and optional metadataStreamInfo
- Metadata about the input stream (MIME type, extension, charset, etc.)ConverterRegistration
- Associates converters with priority for selection
PlainTextConverter
- Handles text, JSON, NDJSON, Markdown, etc.HtmlConverter
- Converts HTML to Markdown using AngleSharpPdfConverter
- PdfPig-based extraction with Markdown heuristicsDocx/Xlsx/Pptx
Converters - Office Open XML processingImageConverter
- Exif metadata + optional captionsAudioConverter
- Metadata + optional transcriptionWikipediaConverter
- Article-only extraction from WikipediaBingSerpConverter
- Summaries for Bing search result pagesYouTubeUrlConverter
- Video metadata markdownZipConverter
- Recursive archive handlingRssFeedConverter
,JsonConverter
,CsvConverter
,XmlConverter
,JupyterNotebookConverter
,EpubConverter
- Priority-based dispatch (lower values processed first)
- Automatic stream sniffing via
StreamInfoGuesser
- Manual overrides via
MarkItDownOptions
orStreamInfo
# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown
# Build the solution
dotnet build
# Run tests
dotnet test
# Create NuGet package
dotnet pack --configuration Release
dotnet test --collect:"XPlat Code Coverage"
The command emits standard test results plus a Cobertura coverage report at
tests/MarkItDown.Tests/TestResults/<guid>/coverage.cobertura.xml
. Tools such as
ReportGenerator can turn this into
HTML or Markdown dashboards.
βββ src/
β βββ MarkItDown/ # Core library
β β βββ Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
β β βββ MarkItDown.cs # Main conversion engine
β β βββ StreamInfoGuesser.cs # MIME/charset/extension detection helpers
β β βββ MarkItDownOptions.cs # Runtime configuration flags
β β βββ ... # Shared utilities (UriUtilities, MimeMapping, etc.)
β βββ MarkItDown.Cli/ # CLI host (under active development)
βββ tests/
β βββ MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors (WIP)
βββ Directory.Build.props # Shared build + packaging settings
βββ README.md # This document
- Fork the repository.
- Create a feature branch (
git checkout -b feature/my-feature
). - Add tests with xUnit/Shouldly mirroring relevant Python vectors.
- Run
dotnet test
(CI enforces green builds + coverage upload). - Update docs or samples if behaviour changes.
- Submit a pull request for review.
- Azure Document Intelligence converter (options already scaffolded)
- Outlook
.msg
ingestion via MIT-friendly dependencies - Expanded CLI commands (batch mode, globbing, JSON output)
- Richer regression suite mirroring Python test vectors
- Plugin discovery & sandboxing
- Built-in LLM caption/transcription providers
- Incremental/streaming conversion APIs
- Cloud-native samples (Functions, Containers, Logic Apps)
MarkItDown is designed for high performance with:
- Stream-based processing β Avoids writing temporary files by default
- Async/await everywhere β Non-blocking I/O with cancellation support
- Minimal allocations β Smart buffer reuse and pay-for-play converters
- Fast detection β Lightweight sniffing before converter dispatch
- Extensible hooks β Offload captions/transcripts to background workers
var options = new MarkItDownOptions
{
EnableBuiltins = true,
EnablePlugins = false,
ExifToolPath = "/usr/local/bin/exiftool",
ImageCaptioner = async (bytes, info, token) =>
{
// Call your preferred vision or LLM service here
return await Task.FromResult("A scenic mountain landscape at sunset.");
},
AudioTranscriber = async (bytes, info, token) =>
{
// Route to speech-to-text provider
return await Task.FromResult("Welcome to the MarkItDown demo.");
}
};
var markItDown = new MarkItDown(options);
This project is licensed under the MIT License - see the LICENSE file for details.
This project is a C# conversion of the original Microsoft MarkItDown Python library. The original project was created by the Microsoft AutoGen team.
- π Documentation: GitHub Wiki
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Email: Create an issue for support
β Star this repository if you find it useful!
Made with β€οΈ by ManagedCode