Document Loaders

Every RAG pipeline starts with loading data. @hazeljs/rag ships 11 built-in document loaders that cover every common source — local files, PDFs, Word documents, web pages, YouTube transcripts, and entire GitHub repositories — all returning the same Document[] interface ready for chunking and indexing.

How loaders work

Every loader extends BaseDocumentLoader and returns an array of Document objects:

interface Document {
  id?: string;
  content: string;             // the text that gets embedded
  metadata?: Record<string, unknown>;  // source, heading, page, url, etc.
  embedding?: number[];        // populated after indexing
}

BaseDocumentLoader exposes a protected helper createDocument() that fills in a source path and a unique ID automatically, so custom loaders stay concise.

Loader overview

LoaderSourceZero extra deps?
TextFileLoader.txt files
MarkdownFileLoader.md / .mdx — heading splits + YAML front-matter
JSONFileLoader.json arrays or objects — textKey / JSON Pointer extraction
CSVFileLoader.csv rows → documents with configurable column mapping
HtmlFileLoader.html tag stripping; CSS selectors via cheerio✅ / opt.
DirectoryLoaderRecursive walk; auto-detects loader by extension
PdfLoaderPDFs via pdf-parse; split by page or full documentnpm i pdf-parse
DocxLoaderWord docs via mammoth; plain text or HTML outputnpm i mammoth
WebLoaderHTTP scraping; retry + timeout; CSS selectors via cheerio✅ / opt.
YouTubeTranscriptLoaderYouTube transcripts; no API key; segment by duration
GitHubLoaderGitHub REST API; filter by directory / extension / maxFiles

File loaders

TextFileLoader

The simplest loader — reads a plain .txt file and returns it as a single document.

import { TextFileLoader } from '@hazeljs/rag';

const docs = await new TextFileLoader({
  filePath: './notes/meeting.txt',
  encoding: 'utf-8',            // default
}).load();

// docs[0].metadata.source === '/abs/path/notes/meeting.txt'

MarkdownFileLoader

Parses Markdown files with two extra features:

  • Heading splits — each H1, H2, or H3 section becomes its own Document with metadata.heading
  • YAML front-matter — front-matter fields (title, date, tags, etc.) are copied into metadata
import { MarkdownFileLoader } from '@hazeljs/rag';

const docs = await new MarkdownFileLoader({
  filePath: './docs/guide.md',
  splitByHeading: true,           // creates one Document per section
  parseYamlFrontMatter: true,     // merges front-matter into metadata
}).load();

docs.forEach(d => {
  console.log(`${d.metadata?.heading}${d.content.slice(0, 80)}...`);
});
// "Installation — npm install @hazeljs/rag..."
// "Quick Start — The simplest way to get started..."

This loader is particularly powerful for documentation sites and knowledge bases where Markdown files have rich structure.

JSONFileLoader

Loads JSON files in two modes:

  • Array mode (default) — each element in the root array becomes a document
  • Object mode — the root object is treated as one document
import { JSONFileLoader } from '@hazeljs/rag';

// Array of articles — use 'body' field as content
const articleDocs = await new JSONFileLoader({
  filePath: './data/articles.json',
  textKey: 'body',              // field used as document content
  // metadataKeys: ['id', 'author', 'date'],  // fields moved to metadata
}).load();

// Nested JSON — navigate with a JSON Pointer
const nestedDocs = await new JSONFileLoader({
  filePath: './data/export.json',
  jsonPointer: '/results/items', // navigate to the array at this path
  textKey: 'description',
}).load();

CSVFileLoader

Converts each CSV row into a document. You control which columns become content and which become searchable metadata.

import { CSVFileLoader } from '@hazeljs/rag';

const docs = await new CSVFileLoader({
  filePath: './data/products.csv',
  contentColumns: ['name', 'description'],  // concatenated as content
  metadataColumns: ['sku', 'category', 'price'],
  delimiter: ',',                            // default
  hasHeader: true,                           // first row is headers
}).load();

// docs[0].content === 'Widget Pro — The best widget for professionals'
// docs[0].metadata === { sku: 'WP-001', category: 'Tools', price: '29.99' }

The built-in CSV parser handles quoted fields, escaped commas, and multi-line values — no external dependency required.

HtmlFileLoader

Strips HTML tags and extracts clean text. Optionally uses cheerio for CSS-selector-based extraction when you only want a specific part of the page.

import { HtmlFileLoader } from '@hazeljs/rag';

// Basic — strips all tags
const docs = await new HtmlFileLoader({
  filePath: './pages/about.html',
}).load();
// docs[0].metadata.title === 'About Us'

// With cheerio selector — only extract the <article> element
const articleDocs = await new HtmlFileLoader({
  filePath: './pages/blog-post.html',
  selector: 'article.post-content',   // requires: npm install cheerio
}).load();

DirectoryLoader — bulk ingest

DirectoryLoader walks a directory tree, identifies each file's type by extension, and delegates to the appropriate loader automatically. It is the fastest way to index an entire knowledge base from disk.

import { DirectoryLoader } from '@hazeljs/rag';

const docs = await new DirectoryLoader({
  dirPath: './knowledge-base',
  recursive: true,                         // include subdirectories

  // Optional filters
  extensions: ['.md', '.txt', '.csv'],     // only these extensions
  exclude: ['**/node_modules/**', '**/.git/**'],
}).load();

const sources = [...new Set(docs.map(d => d.metadata?.source as string))];
console.log(`Loaded ${docs.length} documents from ${sources.length} files`);

Auto-detected extensions:

.txtTextFileLoader · .md / .mdxMarkdownFileLoader · .jsonJSONFileLoader · .csvCSVFileLoader · .html / .htmHtmlFileLoader · .pdfPdfLoader · .docxDocxLoader


PDF and DOCX loaders

These loaders use optional peer dependencies, so you only install them when you need them.

PdfLoader

npm install pdf-parse
import { PdfLoader } from '@hazeljs/rag';

// One document per page — each page is independently searchable
const byPage = await new PdfLoader({
  filePath: './reports/annual-report.pdf',
  splitByPage: true,
}).load();
// byPage[0].metadata.page === 1
// byPage[1].metadata.page === 2

// Whole document as one chunk
const full = await new PdfLoader({
  filePath: './contracts/terms.pdf',
  splitByPage: false,
}).load();

splitByPage: true is recommended for long PDFs so that retrieval returns the specific page rather than the entire document.

DocxLoader

npm install mammoth
import { DocxLoader } from '@hazeljs/rag';

// Plain text output (best for embedding)
const textDocs = await new DocxLoader({
  filePath: './contracts/agreement.docx',
  outputFormat: 'text',   // default
}).load();

// HTML output (preserves tables, lists, headings)
const htmlDocs = await new DocxLoader({
  filePath: './reports/summary.docx',
  outputFormat: 'html',
}).load();

WebLoader — scrape any URL

WebLoader fetches one or more URLs, strips HTML, and returns clean text. It includes configurable retry logic and timeout so it handles flaky external sites gracefully.

import { WebLoader } from '@hazeljs/rag';

// Single URL
const docs = await new WebLoader({
  urls: ['https://hazeljs.com/docs/installation'],
  timeout: 10_000,    // ms
  maxRetries: 3,
}).load();
// docs[0].metadata.url === 'https://hazeljs.com/docs/installation'
// docs[0].metadata.title === 'Installation — HazelJS'

// Multiple URLs at once
const batchDocs = await new WebLoader({
  urls: [
    'https://hazeljs.com/docs/packages/ai',
    'https://hazeljs.com/docs/packages/rag',
    'https://hazeljs.com/docs/packages/agent',
  ],
}).load();

// CSS selector (requires cheerio)
const articleOnly = await new WebLoader({
  urls: ['https://hazeljs.com/blog/graphrag'],
  selector: 'article.post-content',   // npm install cheerio
}).load();

YouTubeTranscriptLoader

Downloads the transcript of any public YouTube video — no API key or developer account needed. The loader fetches the transcript directly from YouTube's internal API.

import { YouTubeTranscriptLoader } from '@hazeljs/rag';

// Full URL or just the video ID
const docs = await new YouTubeTranscriptLoader({
  videoUrl: 'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
  segmentDuration: 60,   // group transcript lines into 60-second chunks
                          // null → return entire transcript as one document
}).load();

docs.forEach(d => {
  const { startTime, endTime } = d.metadata as { startTime: number; endTime: number };
  console.log(`[${startTime}s – ${endTime}s]: ${d.content.slice(0, 80)}`);
});

Setting segmentDuration is strongly recommended for long videos. Chunks of 60–120 seconds are a good starting point — short enough for precise retrieval but long enough to keep context intact.


GitHubLoader

Loads files from any GitHub repository using the GitHub REST API. No special setup is required for public repos, though adding a personal access token increases the rate limit from 60 to 5000 requests per hour.

import { GitHubLoader } from '@hazeljs/rag';

// Load the entire docs folder from a public repo
const docs = await new GitHubLoader({
  owner: 'hazeljs',
  repo: 'hazel',
  ref: 'main',                   // branch or tag
  directory: 'docs',             // sub-directory (omit for root)
  extensions: ['.md', '.mdx'],   // only Markdown files
  maxFiles: 100,                 // safety limit
  maxFileSize: 200_000,          // skip files > 200 KB
  token: process.env.GITHUB_TOKEN,  // optional but recommended
}).load();

// Each document's metadata includes the repo path and sha
docs.forEach(d => {
  console.log(d.metadata?.path);  // 'docs/installation.md'
});

For private repositories, a personal access token with repo scope is required.


Custom loaders

Using BaseDocumentLoader

Extend BaseDocumentLoader to add any data source. Call this.createDocument() to get consistent metadata (auto-generated ID, source path, timestamp):

import { BaseDocumentLoader, Loader, DocumentLoaderRegistry } from '@hazeljs/rag';

@Loader({
  name: 'NotionLoader',
  description: 'Loads pages from a Notion database',
  extensions: [],
  mimeTypes: ['application/vnd.notion'],
})
export class NotionLoader extends BaseDocumentLoader {
  constructor(
    private readonly databaseId: string,
    private readonly token: string,
  ) {
    super();
  }

  async load() {
    const pages = await this.fetchNotionDatabase(this.databaseId, this.token);

    return pages.map((page) =>
      this.createDocument(page.content, {
        source: `notion:${this.databaseId}/${page.id}`,
        title: page.title,
        lastEdited: page.lastEditedTime,
        notionId: page.id,
      }),
    );
  }

  private async fetchNotionDatabase(id: string, token: string) {
    const res = await fetch(`https://api.notion.com/v1/databases/${id}/query`, {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${token}`,
        'Notion-Version': '2022-06-28',
      },
    });
    return (await res.json()).results;
  }
}

DocumentLoaderRegistry

Register custom loaders so DirectoryLoader and the registry can auto-detect and instantiate them:

// Register once at application startup
DocumentLoaderRegistry.register(
  NotionLoader,
  (databaseId: string) => new NotionLoader(databaseId, process.env.NOTION_TOKEN),
);

// Resolve a loader by extension or MIME type
const loader = DocumentLoaderRegistry.getByExtension('.notion');

End-to-end example

Loading from multiple sources and indexing everything in one pipeline:

import {
  DirectoryLoader,
  GitHubLoader,
  WebLoader,
  YouTubeTranscriptLoader,
  RAGPipeline,
  OpenAIEmbeddings,
  MemoryVectorStore,
  RecursiveTextSplitter,
} from '@hazeljs/rag';

async function buildKnowledgeBase() {
  // Setup pipeline
  const embeddings = new OpenAIEmbeddings({ apiKey: process.env.OPENAI_API_KEY });
  const vectorStore = new MemoryVectorStore(embeddings);
  const pipeline = new RAGPipeline({
    vectorStore,
    embeddingProvider: embeddings,
    textSplitter: new RecursiveTextSplitter({ chunkSize: 800, chunkOverlap: 150 }),
  });
  await pipeline.initialize();

  // Load from all sources in parallel
  const [localDocs, githubDocs, webDocs, youtubeDocs] = await Promise.all([
    new DirectoryLoader({ dirPath: './knowledge-base', recursive: true }).load(),
    new GitHubLoader({ owner: 'hazeljs', repo: 'hazel', directory: 'docs', extensions: ['.md'] }).load(),
    new WebLoader({ urls: ['https://hazeljs.com/docs', 'https://hazeljs.com/blog'] }).load(),
    new YouTubeTranscriptLoader({ videoUrl: 'https://www.youtube.com/watch?v=...', segmentDuration: 60 }).load(),
  ]);

  // Index everything at once
  const allDocs = [...localDocs, ...githubDocs, ...webDocs, ...youtubeDocs];
  const ids = await pipeline.addDocuments(allDocs);

  console.log(`Indexed ${ids.length} chunks from ${allDocs.length} source documents`);
  return pipeline;
}

Best practices

Tag documents at load time

Adding tags to metadata at ingest time makes filtered search possible later:

const docs = (await new GitHubLoader({ owner: 'hazeljs', repo: 'hazel', directory: 'docs' }).load())
  .map(d => ({ ...d, metadata: { ...d.metadata, source_type: 'github', repo: 'hazel' } }));

Use splitByPage for PDFs

A 100-page PDF loaded as one document creates a single enormous chunk. Set splitByPage: true to make each page independently retrievable.

Set segmentDuration for YouTube

Long videos (>10 min) should always be segmented. 60–120 seconds gives a good balance between specificity and context.

Rate-limit web scraping

Scraping many pages at once can trigger bot detection. Use a small batch size and add a delay between requests when scraping large sites.

Authenticate GitHub requests

Unauthenticated GitHub API calls are limited to 60 requests per hour per IP. Always set GITHUB_TOKEN in production.


Next steps