Crawl Data Flow (Admin)
This flow shows how crawl jobs are executed and added to the knowledge base.
Key Steps
- Create a crawl job in Admin → Crawl Data (URL + limits).
- Server stores a pending job in crawl_jobs.
- crawlWorker picks up pending jobs and crawls pages via Playwright.
- HTML is cleaned and converted to Markdown, then saved to the job record.
- UI polls job status and shows results.
- Add to Knowledge Base uploads markdown and indexes it in the tenant vector store.
Mermaid Flow
flowchart TD
A["Admin Panel: Crawl Data"] --> B["Enter URL + max links/depth"]
B --> C["POST /api/admin/tenant/{tenantId}/crawl/create"]
C --> D["DB: create crawl_jobs record (status: pending)"]
D --> E["Worker: crawlWorker polls pending jobs"]
E --> F["Playwright crawls pages + extracts content"]
F --> G["Convert HTML to Markdown (turndown)"]
G --> H["Update crawl_jobs with markdown + status"]
H --> I["UI polls /crawl/jobs and shows status"]
I --> J["Click Add to Knowledge Base (completed only)"]
J --> K["Create markdown file from job"]
K --> L["POST /api/admin/tenant/{tenantId}/files/upload"]
L --> M["Upload markdown to OpenAI (assistants)"]
M --> N["Create or get tenant vector store"]
N --> O["Add file to vector store"]
O --> P["Update DB: tenant_files + vectorStoreId + openaiFileId"]