Senior Web Intelligence/ Parsing Engineer Offline
Remote | Full-Time | Flexible Hours (3+ hrs/day EST overlap)
Comiq.ai is an AI-native platform helping investment teams build stronger capital relationships through smarter outreach. We combine structured data, AI enrichment, and real-time signals to replace generic fundraising workflows with precision targeting and contextual messaging.
Behind the scenes, our platform transforms noisy, fragmented web data into clean, structured inputs — powering enrichment pipelines and insight generation at scale. We’re hiring a Senior Web Intelligence & Parsing Engineer to lead the ingestion layer of our stack. Your mission: extract and structure data from 10M+ company websites, team pages, filings, and LinkedIn-style public profiles—then convert that chaos into clean, structured signals ready for AI enrichment and vector search.
🔧 What You’ll Own
- Scraping at Scale
Build and manage high-throughput scrapers for structured and semi-structured web sources
Extract people/org/fund info from websites, portals, filings, dark web sources, and public directories (e.g. LinkedIn-scale)
Operate large-scale crawlers with stealth browsers, IP rotation, session spoofing, and full automation - Smart Parsing & Structuring (Perplexity/Manus-style)
Use open-source frameworks like Unstructured.io, pdfplumber, and trafilatura to segment and extract content
Chunk and tag content sections (e.g. mandate, team, strategy) and output structured JSON + full text
Integrate prompt-based parsing or LLM-assisted validation when needed - Framework Integration & Agent Management
Leverage and extend tools like LangChain, LlamaIndex, and Haystack to manage parsing agents
Orchestrate crawlers and parsers across source types, retry logic, and evolving page structures
Monitor performance, schema compliance, and pipeline yield at scale Collaboration with Backend/LLM Engineer
Deliver structured data for downstream LLM enrichment (RAG search, entity resolution, semantic classification)
Coordinate pipeline triggers, batch jobs, and API endpoints with backend engineering
🛠️ Our Stack
Scraping
Playwright · Puppeteer · Stealth Browsers · Proxy Pools · Tor
Parsing
Unstructured.io · pdfplumber · trafilatura · BeautifulSoup · regex + prompt hybrids
Orchestration
LangChain · LlamaIndex · Haystack · Docker · GitHub Actions
Storage & Flow
PostgreSQL · Redis · ClickHouse · PeerDB
LLMs & Embeddings
OpenAI · Hugging Face · SentenceTransformers · GH200s via Lambda Labs
✅ You’re a Fit If You...
- Have 5–10+ years of experience in large-scale web data extraction and pipeline engineering
- Have previously scraped or managed 1M+ entity-scale data sets (e.g., PDL, Apollo, Crunchbase, custom OSINT stacks)
- Know how to deal with anti-bot protections, dynamic DOMs, session control, and throttling
- Write modular, maintainable parsing code that won’t break with minor site changes
- Are comfortable structuring messy data into well-formed, schema-compliant outputs
Bonus: Experience with financial, regulatory, or investor data parsing; Perplexity-style parsing; dark web scraping
🌍 Why This Role Matters
We’re not just scraping web pages—we’re building a structured knowledge engine for global capital intelligence. Your work will turn the open web into machine-readable signal that powers everything from mandate search to warm investor targeting to real-time enrichment.
The job ad is no longer active
Look at the current jobs Data and Analytics →