The naive approach fails
Most RAG systems chunk documents the same way: slide a window of N tokens across the text, split, embed, store. It works well enough for blog posts and documentation. It fails catastrophically for legal and regulatory text.
Consider DORA Article 25, which spans three paragraphs covering ICT third-party risk management. A naive 512-token chunker might split it like this:
- Chunk 1: End of Article 24 + beginning of Article 25, Paragraph 1
- Chunk 2: End of Paragraph 1 + Paragraph 2
- Chunk 3: End of Paragraph 2 + beginning of Article 26
Every chunk contains fragments of different legal provisions. When a user asks "What does Article 25 require?", the retriever finds chunks that partially contain the answer, mixed with irrelevant content from adjacent articles.
Structure-aware chunking
Regulatory documents have explicit hierarchical structure: Regulation → Chapter → Section → Article → Paragraph → Sub-paragraph. This structure isn't decorative — it's legally meaningful. A paragraph's meaning depends on which article it belongs to, which chapter, which regulation.
Hierarchical chunking respects this structure:
- Parse the document tree (regulation → chapters → articles → paragraphs)
- Chunk at the paragraph level — the smallest legally self-contained unit
- Attach parent context to each chunk (article title, chapter, regulation)
- Store relationships between chunks (same article, same chapter, cross-references)
The result: every chunk knows exactly where it sits in the legal hierarchy. When retrieved, it carries enough context for the language model to produce a precise, well-cited answer.
The context attachment pattern
A hierarchical chunk doesn't just contain the paragraph text. It includes structured metadata:
Chunk:
text: "Financial entities shall manage ICT third-party
risk as an integral component of ICT risk..."
article: "Article 25: ICT third-party risk"
chapter: "Chapter V: ICT Risk Management"
regulation: "DORA (Regulation (EU) 2022/2554)"
paragraph: 1
cross_refs: ["Article 28", "Article 31"]
This metadata serves three purposes:
- Retrieval precision — the embedding captures both content and context
- Citation accuracy — the model can cite "DORA Article 25(1)" rather than "the document says..."
- Relationship traversal — follow cross-references to related provisions
Scale considerations
Eryndal processes regulations, security frameworks, and CVE databases — over eight million nodes in total. At this scale, chunking strategy directly impacts retrieval quality and cost.
Hierarchical chunks are typically smaller than naive chunks (a single paragraph vs. 512 tokens), which means more chunks per document. But they're more precise, which means fewer chunks retrieved per query and less noise in the context window.
The tradeoff is worth it. Five precisely targeted chunks with full legal context produce better answers than ten large chunks with broken boundaries.
Beyond text: graph relationships
The final layer is connecting chunks through a knowledge graph. When DORA Article 25 references Article 28 (contractual arrangements), that cross-reference becomes an edge in the graph. When a user asks about third-party risk, the system can traverse from Article 25 to related provisions automatically.
This turns retrieval from a flat similarity search into a structured graph traversal — following the same relationships that a human regulatory expert would follow when researching a question.