Crawl Data Flow (Admin)

This flow shows how crawl jobs are executed and added to the knowledge base.

Key Steps

Create a crawl job in Admin → Crawl Data (URL + limits).
Server stores a pending job in crawl_jobs.
crawlWorker picks up pending jobs and crawls pages via Playwright.
HTML is cleaned and converted to Markdown, then saved to the job record.
UI polls job status and shows results.
Add to Knowledge Base uploads markdown and indexes it in the tenant vector store.

Mermaid Flow

flowchart TD A["Admin Panel: Crawl Data"] --> B["Enter URL + max links/depth"] B --> C["POST /api/admin/tenant/{tenantId}/crawl/create"] C --> D["DB: create crawl_jobs record (status: pending)"] D --> E["Worker: crawlWorker polls pending jobs"] E --> F["Playwright crawls pages + extracts content"] F --> G["Convert HTML to Markdown (turndown)"] G --> H["Update crawl_jobs with markdown + status"] H --> I["UI polls /crawl/jobs and shows status"] I --> J["Click Add to Knowledge Base (completed only)"] J --> K["Create markdown file from job"] K --> L["POST /api/admin/tenant/{tenantId}/files/upload"] L --> M["Upload markdown to OpenAI (assistants)"] M --> N["Create or get tenant vector store"] N --> O["Add file to vector store"] O --> P["Update DB: tenant_files + vectorStoreId + openaiFileId"]