Document Extract Tool

document_extract converts documents you upload to GoClaw into clean markdown that your LLM can actually read. It handles formats that most LLMs can’t open natively: PDFs (including scanned ones), Microsoft Office files, EPUB / MOBI e-books, HTML, and plain text.

Under the hood it uses the go-markitdown library; embedded images and OCR route through your configured agent model chain.

When the agent uses it

Whenever you upload a supported document to GoClaw (over Telegram, Discord, the TUI, HTTP, etc.), the gateway annotates the attachment with a hint telling the agent it can call document_extract to actually read the contents. The agent decides whether to call it — you don’t have to ask.

You can also explicitly ask things like:

“Read this PDF and summarise it.”

and the agent will call document_extract on whatever file you attached.

Supported formats

FormatNotes
PDFText extraction; OCR optional (via vision chain); <!-- Page N of M --> markers at page boundaries
DOCXMicrosoft Word; footnotes and endnotes preserved as [^fn-N] anchors with a ## Footnotes section
XLSXMicrosoft Excel; every sheet rendered under its own ## SheetName heading; embedded pictures described via the vision chain
PPTXMicrosoft PowerPoint; <!-- Slide number: N --> markers per slide
EPUB / MOBIE-books; <!-- Page N of M --> markers at page boundaries
HTML / XHTMLWeb pages; inline base64 data: images flow through the vision chain
Markdown / Plain textPass-through

Images are not supported here — use the regular image tools (read with an image path) or send the image directly; vision models handle those natively.

Parameters

ParameterDefaultDescription
pathPath to the document. Absolute, relative to the working directory, or a media-store path like ./media/uploads/...
ocrfalseRun OCR on pages with no extractable text via the agent vision chain. Slow, costs API credits
include_imagesfalseDescribe embedded images inline via the agent vision chain. Costs API credits per image
metadatafalsePrepend YAML front-matter with title / author / page count where available
max_vision_calls50Cap on combined OCR + image description calls per extraction (0 = unlimited). Remaining images become references with a truncated_images count in the result
force_refreshfalseBypass the cache and re-extract

Response

A successful call returns structured JSON:

{
  "ok": true,
  "format": "pdf",
  "output_path": "./media/extracted/abcdef...-5f9c.md",
  "bytes": 148210,
  "lines": 1842,
  "title": "Q3 Report",
  "toc": ["Introduction", "Revenue", "Outlook"],
  "preview": "# Q3 Report\n\nRevenue increased...",
  "image_count": 7,
  "truncated_images": 0,
  "warnings": [],
  "cached": false
}
  • preview is a fixed-size head of the markdown (~1500 characters). For more, the agent uses the read tool against output_path:

    {"path": "./media/extracted/abcdef...-5f9c.md", "start_line": 120, "end_line": 200}
    
  • cached: true means this extraction was served from disk with no re-processing. Cache keys include a hash of the document contents and a hash of the flags (ocr, include_images, metadata), so two calls with different flags don’t collide.

Errors

CodeMeaning
unsupported_formatFile type isn’t supported by the library
no_textPDF/scan has no extractable text — retry with ocr: true
password_protectedDocument requires a password
corruptFile is malformed
read_failedCould not read the file from disk
invalid_pathPath could not be resolved
invalid_inputMissing/invalid path
extraction_failedOther, non-typed extraction error

Errors include a short hint where a retry strategy is obvious (for example, no_text suggests enabling OCR).

Storage & retention

Extracted markdown is cached under the extracted media category at media/extracted/<contentHash>-<flagsHash>.md (plus a small .json sidecar). Defaults:

  • TTL: 30 days
  • Quota: 2 GB

Adjust both in the setup wizard or TUI config editor under Media Storage → Extracted Documents. The category is ephemeral — cleanup is automatic, and anything you delete will simply be regenerated on the next call.

Vision chain notes

  • OCR and image descriptions use the agent model chain. If that chain has no vision-capable model, the tool degrades gracefully: images become references, OCR reports no_text, and a warning is surfaced in the response.
  • Every vision call goes through normal failover + cooldown, so a single bad provider won’t break an extraction.
  • The max_vision_calls cap protects against pathological documents (e.g. a 200-image PDF) running up an unbounded bill. Remaining images are reported in truncated_images; raise the cap or set it to 0 to disable.

Disable the tool

Under Tools → Document Extract in the setup wizard or TUI editor, toggle documentExtract.enabled off. The config key is tools.documentExtract.enabled (default true).