Document Loaders
Every RAG pipeline starts with loading data. @hazeljs/rag ships 11 built-in document loaders that cover every common source — local files, PDFs, Word documents, web pages, YouTube transcripts, and entire GitHub repositories — all returning the same Document[] interface ready for chunking and indexing.
How loaders work
Every loader extends BaseDocumentLoader and returns an array of Document objects:
interface Document {
id?: string;
content: string; // the text that gets embedded
metadata?: Record<string, unknown>; // source, heading, page, url, etc.
embedding?: number[]; // populated after indexing
}
BaseDocumentLoader exposes a protected helper createDocument() that fills in a source path and a unique ID automatically, so custom loaders stay concise.
Loader overview
| Loader | Source | Zero extra deps? |
|---|---|---|
TextFileLoader | .txt files | ✅ |
MarkdownFileLoader | .md / .mdx — heading splits + YAML front-matter | ✅ |
JSONFileLoader | .json arrays or objects — textKey / JSON Pointer extraction | ✅ |
CSVFileLoader | .csv rows → documents with configurable column mapping | ✅ |
HtmlFileLoader | .html tag stripping; CSS selectors via cheerio | ✅ / opt. |
DirectoryLoader | Recursive walk; auto-detects loader by extension | ✅ |
PdfLoader | PDFs via pdf-parse; split by page or full document | npm i pdf-parse |
DocxLoader | Word docs via mammoth; plain text or HTML output | npm i mammoth |
WebLoader | HTTP scraping; retry + timeout; CSS selectors via cheerio | ✅ / opt. |
YouTubeTranscriptLoader | YouTube transcripts; no API key; segment by duration | ✅ |
GitHubLoader | GitHub REST API; filter by directory / extension / maxFiles | ✅ |
File loaders
TextFileLoader
The simplest loader — reads a plain .txt file and returns it as a single document.
import { TextFileLoader } from '@hazeljs/rag';
const docs = await new TextFileLoader({
filePath: './notes/meeting.txt',
encoding: 'utf-8', // default
}).load();
// docs[0].metadata.source === '/abs/path/notes/meeting.txt'
MarkdownFileLoader
Parses Markdown files with two extra features:
- Heading splits — each H1, H2, or H3 section becomes its own
Documentwithmetadata.heading - YAML front-matter — front-matter fields (
title,date,tags, etc.) are copied intometadata
import { MarkdownFileLoader } from '@hazeljs/rag';
const docs = await new MarkdownFileLoader({
filePath: './docs/guide.md',
splitByHeading: true, // creates one Document per section
parseYamlFrontMatter: true, // merges front-matter into metadata
}).load();
docs.forEach(d => {
console.log(`${d.metadata?.heading} — ${d.content.slice(0, 80)}...`);
});
// "Installation — npm install @hazeljs/rag..."
// "Quick Start — The simplest way to get started..."
This loader is particularly powerful for documentation sites and knowledge bases where Markdown files have rich structure.
JSONFileLoader
Loads JSON files in two modes:
- Array mode (default) — each element in the root array becomes a document
- Object mode — the root object is treated as one document
import { JSONFileLoader } from '@hazeljs/rag';
// Array of articles — use 'body' field as content
const articleDocs = await new JSONFileLoader({
filePath: './data/articles.json',
textKey: 'body', // field used as document content
// metadataKeys: ['id', 'author', 'date'], // fields moved to metadata
}).load();
// Nested JSON — navigate with a JSON Pointer
const nestedDocs = await new JSONFileLoader({
filePath: './data/export.json',
jsonPointer: '/results/items', // navigate to the array at this path
textKey: 'description',
}).load();
CSVFileLoader
Converts each CSV row into a document. You control which columns become content and which become searchable metadata.
import { CSVFileLoader } from '@hazeljs/rag';
const docs = await new CSVFileLoader({
filePath: './data/products.csv',
contentColumns: ['name', 'description'], // concatenated as content
metadataColumns: ['sku', 'category', 'price'],
delimiter: ',', // default
hasHeader: true, // first row is headers
}).load();
// docs[0].content === 'Widget Pro — The best widget for professionals'
// docs[0].metadata === { sku: 'WP-001', category: 'Tools', price: '29.99' }
The built-in CSV parser handles quoted fields, escaped commas, and multi-line values — no external dependency required.
HtmlFileLoader
Strips HTML tags and extracts clean text. Optionally uses cheerio for CSS-selector-based extraction when you only want a specific part of the page.
import { HtmlFileLoader } from '@hazeljs/rag';
// Basic — strips all tags
const docs = await new HtmlFileLoader({
filePath: './pages/about.html',
}).load();
// docs[0].metadata.title === 'About Us'
// With cheerio selector — only extract the <article> element
const articleDocs = await new HtmlFileLoader({
filePath: './pages/blog-post.html',
selector: 'article.post-content', // requires: npm install cheerio
}).load();
DirectoryLoader — bulk ingest
DirectoryLoader walks a directory tree, identifies each file's type by extension, and delegates to the appropriate loader automatically. It is the fastest way to index an entire knowledge base from disk.
import { DirectoryLoader } from '@hazeljs/rag';
const docs = await new DirectoryLoader({
dirPath: './knowledge-base',
recursive: true, // include subdirectories
// Optional filters
extensions: ['.md', '.txt', '.csv'], // only these extensions
exclude: ['**/node_modules/**', '**/.git/**'],
}).load();
const sources = [...new Set(docs.map(d => d.metadata?.source as string))];
console.log(`Loaded ${docs.length} documents from ${sources.length} files`);
Auto-detected extensions:
.txt → TextFileLoader · .md / .mdx → MarkdownFileLoader · .json → JSONFileLoader · .csv → CSVFileLoader · .html / .htm → HtmlFileLoader · .pdf → PdfLoader · .docx → DocxLoader
PDF and DOCX loaders
These loaders use optional peer dependencies, so you only install them when you need them.
PdfLoader
npm install pdf-parse
import { PdfLoader } from '@hazeljs/rag';
// One document per page — each page is independently searchable
const byPage = await new PdfLoader({
filePath: './reports/annual-report.pdf',
splitByPage: true,
}).load();
// byPage[0].metadata.page === 1
// byPage[1].metadata.page === 2
// Whole document as one chunk
const full = await new PdfLoader({
filePath: './contracts/terms.pdf',
splitByPage: false,
}).load();
splitByPage: true is recommended for long PDFs so that retrieval returns the specific page rather than the entire document.
DocxLoader
npm install mammoth
import { DocxLoader } from '@hazeljs/rag';
// Plain text output (best for embedding)
const textDocs = await new DocxLoader({
filePath: './contracts/agreement.docx',
outputFormat: 'text', // default
}).load();
// HTML output (preserves tables, lists, headings)
const htmlDocs = await new DocxLoader({
filePath: './reports/summary.docx',
outputFormat: 'html',
}).load();
WebLoader — scrape any URL
WebLoader fetches one or more URLs, strips HTML, and returns clean text. It includes configurable retry logic and timeout so it handles flaky external sites gracefully.
import { WebLoader } from '@hazeljs/rag';
// Single URL
const docs = await new WebLoader({
urls: ['https://hazeljs.com/docs/installation'],
timeout: 10_000, // ms
maxRetries: 3,
}).load();
// docs[0].metadata.url === 'https://hazeljs.com/docs/installation'
// docs[0].metadata.title === 'Installation — HazelJS'
// Multiple URLs at once
const batchDocs = await new WebLoader({
urls: [
'https://hazeljs.com/docs/packages/ai',
'https://hazeljs.com/docs/packages/rag',
'https://hazeljs.com/docs/packages/agent',
],
}).load();
// CSS selector (requires cheerio)
const articleOnly = await new WebLoader({
urls: ['https://hazeljs.com/blog/graphrag'],
selector: 'article.post-content', // npm install cheerio
}).load();
YouTubeTranscriptLoader
Downloads the transcript of any public YouTube video — no API key or developer account needed. The loader fetches the transcript directly from YouTube's internal API.
import { YouTubeTranscriptLoader } from '@hazeljs/rag';
// Full URL or just the video ID
const docs = await new YouTubeTranscriptLoader({
videoUrl: 'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
segmentDuration: 60, // group transcript lines into 60-second chunks
// null → return entire transcript as one document
}).load();
docs.forEach(d => {
const { startTime, endTime } = d.metadata as { startTime: number; endTime: number };
console.log(`[${startTime}s – ${endTime}s]: ${d.content.slice(0, 80)}`);
});
Setting segmentDuration is strongly recommended for long videos. Chunks of 60–120 seconds are a good starting point — short enough for precise retrieval but long enough to keep context intact.
GitHubLoader
Loads files from any GitHub repository using the GitHub REST API. No special setup is required for public repos, though adding a personal access token increases the rate limit from 60 to 5000 requests per hour.
import { GitHubLoader } from '@hazeljs/rag';
// Load the entire docs folder from a public repo
const docs = await new GitHubLoader({
owner: 'hazeljs',
repo: 'hazel',
ref: 'main', // branch or tag
directory: 'docs', // sub-directory (omit for root)
extensions: ['.md', '.mdx'], // only Markdown files
maxFiles: 100, // safety limit
maxFileSize: 200_000, // skip files > 200 KB
token: process.env.GITHUB_TOKEN, // optional but recommended
}).load();
// Each document's metadata includes the repo path and sha
docs.forEach(d => {
console.log(d.metadata?.path); // 'docs/installation.md'
});
For private repositories, a personal access token with repo scope is required.
Custom loaders
Using BaseDocumentLoader
Extend BaseDocumentLoader to add any data source. Call this.createDocument() to get consistent metadata (auto-generated ID, source path, timestamp):
import { BaseDocumentLoader, Loader, DocumentLoaderRegistry } from '@hazeljs/rag';
@Loader({
name: 'NotionLoader',
description: 'Loads pages from a Notion database',
extensions: [],
mimeTypes: ['application/vnd.notion'],
})
export class NotionLoader extends BaseDocumentLoader {
constructor(
private readonly databaseId: string,
private readonly token: string,
) {
super();
}
async load() {
const pages = await this.fetchNotionDatabase(this.databaseId, this.token);
return pages.map((page) =>
this.createDocument(page.content, {
source: `notion:${this.databaseId}/${page.id}`,
title: page.title,
lastEdited: page.lastEditedTime,
notionId: page.id,
}),
);
}
private async fetchNotionDatabase(id: string, token: string) {
const res = await fetch(`https://api.notion.com/v1/databases/${id}/query`, {
method: 'POST',
headers: {
Authorization: `Bearer ${token}`,
'Notion-Version': '2022-06-28',
},
});
return (await res.json()).results;
}
}
DocumentLoaderRegistry
Register custom loaders so DirectoryLoader and the registry can auto-detect and instantiate them:
// Register once at application startup
DocumentLoaderRegistry.register(
NotionLoader,
(databaseId: string) => new NotionLoader(databaseId, process.env.NOTION_TOKEN),
);
// Resolve a loader by extension or MIME type
const loader = DocumentLoaderRegistry.getByExtension('.notion');
End-to-end example
Loading from multiple sources and indexing everything in one pipeline:
import {
DirectoryLoader,
GitHubLoader,
WebLoader,
YouTubeTranscriptLoader,
RAGPipeline,
OpenAIEmbeddings,
MemoryVectorStore,
RecursiveTextSplitter,
} from '@hazeljs/rag';
async function buildKnowledgeBase() {
// Setup pipeline
const embeddings = new OpenAIEmbeddings({ apiKey: process.env.OPENAI_API_KEY });
const vectorStore = new MemoryVectorStore(embeddings);
const pipeline = new RAGPipeline({
vectorStore,
embeddingProvider: embeddings,
textSplitter: new RecursiveTextSplitter({ chunkSize: 800, chunkOverlap: 150 }),
});
await pipeline.initialize();
// Load from all sources in parallel
const [localDocs, githubDocs, webDocs, youtubeDocs] = await Promise.all([
new DirectoryLoader({ dirPath: './knowledge-base', recursive: true }).load(),
new GitHubLoader({ owner: 'hazeljs', repo: 'hazel', directory: 'docs', extensions: ['.md'] }).load(),
new WebLoader({ urls: ['https://hazeljs.com/docs', 'https://hazeljs.com/blog'] }).load(),
new YouTubeTranscriptLoader({ videoUrl: 'https://www.youtube.com/watch?v=...', segmentDuration: 60 }).load(),
]);
// Index everything at once
const allDocs = [...localDocs, ...githubDocs, ...webDocs, ...youtubeDocs];
const ids = await pipeline.addDocuments(allDocs);
console.log(`Indexed ${ids.length} chunks from ${allDocs.length} source documents`);
return pipeline;
}
Best practices
Tag documents at load time
Adding tags to metadata at ingest time makes filtered search possible later:
const docs = (await new GitHubLoader({ owner: 'hazeljs', repo: 'hazel', directory: 'docs' }).load())
.map(d => ({ ...d, metadata: { ...d.metadata, source_type: 'github', repo: 'hazel' } }));
Use splitByPage for PDFs
A 100-page PDF loaded as one document creates a single enormous chunk. Set splitByPage: true to make each page independently retrievable.
Set segmentDuration for YouTube
Long videos (>10 min) should always be segmented. 60–120 seconds gives a good balance between specificity and context.
Rate-limit web scraping
Scraping many pages at once can trigger bot detection. Use a small batch size and add a delay between requests when scraping large sites.
Authenticate GitHub requests
Unauthenticated GitHub API calls are limited to 60 requests per hour per IP. Always set GITHUB_TOKEN in production.
Next steps
- GraphRAG Guide — build a knowledge graph from your loaded documents
- RAG Patterns Guide — advanced retrieval patterns
- Agentic RAG Guide — autonomous multi-hop retrieval
- RAG Package Reference — full API documentation