/ 6 MIN READ

How to Reason Over Large Documents with AI

On this page

Ask an AI agent to analyze a 200-page PDF and it will grab a few chunks that look relevant, stuff them into the prompt, and hope for the best.

This works for simple questions. “What’s the company name?” Sure. Vector search finds that in one chunk.

It fails for real questions. “Compare the methodology across all three experiments.” “What assumptions does section 4 make that contradict section 7?” “Extract every dimension and tolerance from this engineering drawing.”

These questions require reading multiple sections, cross-referencing, and building up understanding over time. A single retrieval pass can’t do that.

Why RAG Falls Short on Long Documents

The standard approach to document Q&A:

  1. Chunk the document into 512-token pieces
  2. Embed each chunk
  3. When a question comes in, find the top 5 most similar chunks
  4. Pass those chunks to the LLM with the question

This has a fundamental problem: you’re betting that the answer lives in 5 chunks out of potentially hundreds. For a 200-page document, that’s maybe 2% of the content.

Some questions need more than 2%. A question about “the main findings” might require reading the abstract, results, discussion, and conclusion. That’s four different sections. If your top-5 retrieval only grabs three of them, the answer is incomplete.

And semantic similarity is the wrong signal for structural questions. “Summarize the introduction” doesn’t have a semantically similar chunk — every chunk in the introduction is equally relevant. “What’s on page 47” is a structural query that vector search has no mechanism for.

The Agent Approach

Instead of retrieving and hoping, give the AI tools to explore the document on its own.

This is how a human reads a long document when answering a complex question:

  1. Look at the table of contents
  2. Read the sections that seem relevant
  3. Search for specific terms or numbers
  4. Read more sections based on what you found
  5. Keep going until you have enough to answer

An AI agent can do the same thing — if you give it the right tools.

Building a Document Agent

We built this in Nia. The document agent has six tools:

Get document tree — reads the table of contents and section hierarchy. The agent sees the full structure before deciding where to look.

Search — semantic search over the indexed document. Finds sections related to a concept.

Grep — exact text and regex search. When the agent needs a specific number, formula, reference, or term that semantic search might miss.

Read section — reads the full content of a section by its ID from the document tree. Not a chunk — the entire section.

Read page range — reads specific pages. Useful when the agent knows exactly where to look.

List pages — shows what pages exist and how the document is organized. Helps the agent plan its research.

The agent runs in a loop. It decides which tool to use, reads the result, thinks about what it still needs, and calls another tool. It keeps going — up to 20 iterations — until it has enough evidence to answer.

Then it writes its response with page-level citations for every claim.

What This Looks Like in Practice

A question like “Analyze the methodology and identify potential weaknesses” against a research paper:

  1. Agent calls get_document_tree — sees the paper has sections for Introduction, Related Work, Methods (with subsections for Data Collection and Model Architecture), Results, and Discussion
  2. Agent calls read_section on the Methods section
  3. Agent calls search for “limitations” and “assumptions”
  4. Agent calls read_section on Discussion (which usually mentions limitations)
  5. Agent calls grep for specific statistical methods mentioned in Methods
  6. Agent calls finish_analysis and writes its answer

Five tool calls. The answer covers the actual methodology with specific citations. Not a summary of whatever chunks happened to be nearest to “methodology weaknesses” in embedding space.

Structured Output

Sometimes you don’t want prose. You want data.

Pass a JSON schema and the agent extracts structured data from the document:

nia document agent <source-id> "Extract all authors and their affiliations" \
  --schema '{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"affiliation":{"type":"string"}}}}'

The agent uses a return_structured_result tool that validates the output against your schema. Useful for pulling tables, entity lists, or specific fields out of PDFs programmatically.

Extended Thinking

The agent runs on Claude Opus 4.6 with extended thinking enabled by default. Before each tool call, the model reasons through what it knows, what it still needs, and which tool to use next.

This matters more than it sounds. Without thinking, the agent tends to grab the first relevant section and answer immediately. With thinking, it plans a research strategy — “I should check the Methods section first, then cross-reference with Results, and grep for the specific metric they mention.”

Streaming

You can watch the agent work in real-time:

nia document agent <source-id> "What are the key findings?" --stream

The stream shows thinking chunks, tool calls with arguments, tool results, and the final answer as it’s written. Useful for understanding what the agent is doing and why.

Regular search (nia search query) is faster. It does a single retrieval pass and synthesizes an answer. Good for simple, direct questions where the answer lives in one or two chunks.

The document agent is for questions that need research:

  • Questions that span multiple sections
  • Questions that require cross-referencing
  • Structural questions (“summarize each section”, “compare chapters 3 and 5”)
  • Extraction tasks (pull all tables, all references, all entities)
  • Questions where you need high confidence with citations

Try It

Index any PDF or document with Nia, then run the agent:

# From the CLI
nia document agent <source-id> "your question"

# Via API
curl -X POST https://apigcp.trynia.ai/v2/document/agent \
  -H "Authorization: Bearer $NIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source_id": "your-source-id",
    "query": "your question",
    "model": "claude-opus-4-6-1m",
    "thinking_enabled": true
  }'

Documents have structure. Your AI should use it.

Try Nia at trynia.ai.