Crawl Data Flow (Admin)

This flow shows how crawl jobs are executed and added to the knowledge base.

Key Steps

  • Create a crawl job in Admin → Crawl Data (URL + limits).
  • Server stores a pending job in crawl_jobs.
  • crawlWorker picks up pending jobs and crawls pages via Playwright.
  • HTML is cleaned and converted to Markdown, then saved to the job record.
  • UI polls job status and shows results.
  • Add to Knowledge Base uploads markdown and indexes it in the tenant vector store.

Mermaid Flow

flowchart TD A["Admin Panel: Crawl Data"] --> B["Enter URL + max links/depth"] B --> C["POST /api/admin/tenant/{tenantId}/crawl/create"] C --> D["DB: create crawl_jobs record (status: pending)"] D --> E["Worker: crawlWorker polls pending jobs"] E --> F["Playwright crawls pages + extracts content"] F --> G["Convert HTML to Markdown (turndown)"] G --> H["Update crawl_jobs with markdown + status"] H --> I["UI polls /crawl/jobs and shows status"] I --> J["Click Add to Knowledge Base (completed only)"] J --> K["Create markdown file from job"] K --> L["POST /api/admin/tenant/{tenantId}/files/upload"] L --> M["Upload markdown to OpenAI (assistants)"] M --> N["Create or get tenant vector store"] N --> O["Add file to vector store"] O --> P["Update DB: tenant_files + vectorStoreId + openaiFileId"]