Crawler for LLMs - Nexo Docs

Overview

A high-performance, LLM-optimized web crawler built with TypeScript and Next.js (to expose API). This tool is designed to efficiently crawl websites, extract meaningful content, and process data into structured formats suitable for Large Language Models (LLMs). It supports advanced content parsing, chunking, and retrieval mechanisms to facilitate fine-tuning and retrieval-augmented generation (RAG) workflows. The processed output can be directly fed into any LLM API for seamless integration.

Features

Efficient Crawling: Uses modern TypeScript async/await patterns for concurrent crawling with configurable depth and page limits.
Smart Content Processing: Extracts and structures web data using cheerio and node-html-markdown.
LLM-Optimized Output: Leverages RecursiveCharacterTextSplitter from LangChain for context-aware text chunking.
Edge Runtime Support: Designed for execution on edge computing platforms for improved speed and performance.
Highly Configurable: Allows customization of crawl depth, maximum pages, chunk sizes, and overlap parameters.

Technical Architecture

Core Components

Crawler Engine (crawler.ts)
- Implements depth-first traversal for efficient crawling.
- Manages URL deduplication and request throttling to prevent redundant requests.
- Extracts and normalizes links for recursive crawling.
- Ensures robust HTML content extraction with structured error handling.
Seed Manager (seed.ts)
- Orchestrates crawling sessions and document generation.
- Utilizes LangChain’s text splitting techniques to format data for LLM consumption.
- Maintains source tracking for better data provenance.
API Layer (route.ts)
- Implements edge-compatible HTTP endpoints.
- Handles request validation, error management, and streaming responses.

Usage

// Initialize crawler with configuration
const crawler = new Crawler(maxDepth = 2, maxPages = 10);
const documents = await seed(startUrl, maxDepth, maxPages, {
  chunkSize: 1000,
  chunkOverlap: 200
});

Real-World Applications

1. Training Data Generation

Create domain-specific datasets for fine-tuning LLMs.
Build high-quality knowledge bases for Retrieval-Augmented Generation (RAG).

2. Content Aggregation

Automate documentation extraction and organization.
Construct searchable knowledge repositories for enterprises.
Implement content summarization pipelines for AI-based applications.

3. SEO & Website Analysis

Perform content audits for websites.
Analyze site structure for better indexing.
Assess internal linking strategies for optimization.

Contributing

We welcome contributions to enhance this project! You can contribute in the following areas:

Rate limiting & politeness policies
Support for additional content types (PDFs, dynamic content, etc.)
Improved metadata extraction (schema, structured data, etc.)
Parallel crawling optimizations for large-scale scraping
Advanced filtering and content selection strategies
Support for browser-based crawling (JavaScript-rendered content)

Getting Started

Fork the repository.

Clone your fork:

git clone https://github.com/kanakkholwal/web-crawler-llms.git
cd web-crawler-llms

Install dependencies:
```
npm install
```
Run the development server:
```
npm run dev
```

Future Roadmap

Browser automation integration for JavaScript-heavy pages.
Advanced text processing algorithms for better data segmentation.
Distributed crawling architecture for large-scale data gathering.
Content deduplication strategies for reducing redundancy.
Enhanced API rate limiting & caching mechanisms.
Custom output formats & processors for better usability.

License

This project is released under the MIT License.

🚀 Join us in scaling this project! Whether you’re improving crawling efficiency, enhancing data processing, or integrating with AI-powered applications, we welcome your contributions to make this web crawler more powerful and versatile for LLM applications.

Templates

​Overview

​Features

​Technical Architecture

​Core Components

​Usage

​Real-World Applications

​1. Training Data Generation

​2. Content Aggregation

​3. SEO & Website Analysis

​Contributing

​Getting Started

​Future Roadmap

​License