Crawler for LLMs
Lightweight web crawler optimized for scraping content to feed large language models
Overview
A high-performance, LLM-optimized web crawler built with TypeScript and Next.js (to expose API). This tool is designed to efficiently crawl websites, extract meaningful content, and process data into structured formats suitable for Large Language Models (LLMs). It supports advanced content parsing, chunking, and retrieval mechanisms to facilitate fine-tuning and retrieval-augmented generation (RAG) workflows. The processed output can be directly fed into any LLM API for seamless integration.
Features
- Efficient Crawling: Uses modern TypeScript async/await patterns for concurrent crawling with configurable depth and page limits.
- Smart Content Processing: Extracts and structures web data using
cheerio
andnode-html-markdown
. - LLM-Optimized Output: Leverages
RecursiveCharacterTextSplitter
from LangChain for context-aware text chunking. - Edge Runtime Support: Designed for execution on edge computing platforms for improved speed and performance.
- Highly Configurable: Allows customization of crawl depth, maximum pages, chunk sizes, and overlap parameters.
Technical Architecture
Core Components
-
Crawler Engine (
crawler.ts
)- Implements depth-first traversal for efficient crawling.
- Manages URL deduplication and request throttling to prevent redundant requests.
- Extracts and normalizes links for recursive crawling.
- Ensures robust HTML content extraction with structured error handling.
-
Seed Manager (
seed.ts
)- Orchestrates crawling sessions and document generation.
- Utilizes LangChain’s text splitting techniques to format data for LLM consumption.
- Maintains source tracking for better data provenance.
-
API Layer (
route.ts
)- Implements edge-compatible HTTP endpoints.
- Handles request validation, error management, and streaming responses.
Usage
Real-World Applications
1. Training Data Generation
- Create domain-specific datasets for fine-tuning LLMs.
- Build high-quality knowledge bases for Retrieval-Augmented Generation (RAG).
2. Content Aggregation
- Automate documentation extraction and organization.
- Construct searchable knowledge repositories for enterprises.
- Implement content summarization pipelines for AI-based applications.
3. SEO & Website Analysis
- Perform content audits for websites.
- Analyze site structure for better indexing.
- Assess internal linking strategies for optimization.
Contributing
We welcome contributions to enhance this project! You can contribute in the following areas:
- Rate limiting & politeness policies
- Support for additional content types (PDFs, dynamic content, etc.)
- Improved metadata extraction (schema, structured data, etc.)
- Parallel crawling optimizations for large-scale scraping
- Advanced filtering and content selection strategies
- Support for browser-based crawling (JavaScript-rendered content)
Getting Started
-
Fork the repository.
-
Clone your fork:
-
Install dependencies:
-
Run the development server:
Future Roadmap
- Browser automation integration for JavaScript-heavy pages.
- Advanced text processing algorithms for better data segmentation.
- Distributed crawling architecture for large-scale data gathering.
- Content deduplication strategies for reducing redundancy.
- Enhanced API rate limiting & caching mechanisms.
- Custom output formats & processors for better usability.
License
This project is released under the MIT License.
🚀 Join us in scaling this project! Whether you’re improving crawling efficiency, enhancing data processing, or integrating with AI-powered applications, we welcome your contributions to make this web crawler more powerful and versatile for LLM applications.