Senior Web Intelligence/ Parsing Engineer Offline

Remote | Full-Time | Flexible Hours (3+ hrs/day EST overlap)

 

Comiq.ai is an AI-native platform helping investment teams build stronger capital relationships through smarter outreach. We combine structured data, AI enrichment, and real-time signals to replace generic fundraising workflows with precision targeting and contextual messaging.

 

Behind the scenes, our platform transforms noisy, fragmented web data into clean, structured inputs — powering enrichment pipelines and insight generation at scale. We’re hiring a Senior Web Intelligence & Parsing Engineer to lead the ingestion layer of our stack. Your mission: extract and structure data from 10M+ company websites, team pages, filings, and LinkedIn-style public profiles—then convert that chaos into clean, structured signals ready for AI enrichment and vector search.

 

🔧 What You’ll Own

  • Scraping at Scale
    Build and manage high-throughput scrapers for structured and semi-structured web sources
    Extract people/org/fund info from websites, portals, filings, dark web sources, and public directories (e.g. LinkedIn-scale)
    Operate large-scale crawlers with stealth browsers, IP rotation, session spoofing, and full automation
  • Smart Parsing & Structuring (Perplexity/Manus-style)
    Use open-source frameworks like Unstructured.io, pdfplumber, and trafilatura to segment and extract content
    Chunk and tag content sections (e.g. mandate, team, strategy) and output structured JSON + full text
    Integrate prompt-based parsing or LLM-assisted validation when needed
  • Framework Integration & Agent Management
    Leverage and extend tools like LangChain, LlamaIndex, and Haystack to manage parsing agents
    Orchestrate crawlers and parsers across source types, retry logic, and evolving page structures
    Monitor performance, schema compliance, and pipeline yield at scale
  • Collaboration with Backend/LLM Engineer
    Deliver structured data for downstream LLM enrichment (RAG search, entity resolution, semantic classification)
    Coordinate pipeline triggers, batch jobs, and API endpoints with backend engineering

     

🛠️ Our Stack

 

Scraping

Playwright · Puppeteer · Stealth Browsers · Proxy Pools · Tor

Parsing

Unstructured.io · pdfplumber · trafilatura · BeautifulSoup · regex + prompt hybrids

Orchestration

LangChain · LlamaIndex · Haystack · Docker · GitHub Actions

Storage & Flow

PostgreSQL · Redis · ClickHouse · PeerDB

LLMs & Embeddings

OpenAI · Hugging Face · SentenceTransformers · GH200s via Lambda Labs

 

✅ You’re a Fit If You...

  • Have 5–10+ years of experience in large-scale web data extraction and pipeline engineering
  • Have previously scraped or managed 1M+ entity-scale data sets (e.g., PDL, Apollo, Crunchbase, custom OSINT stacks)
  • Know how to deal with anti-bot protections, dynamic DOMs, session control, and throttling
  • Write modular, maintainable parsing code that won’t break with minor site changes
  • Are comfortable structuring messy data into well-formed, schema-compliant outputs
  • Bonus: Experience with financial, regulatory, or investor data parsing; Perplexity-style parsing; dark web scraping

     

🌍 Why This Role Matters

We’re not just scraping web pages—we’re building a structured knowledge engine for global capital intelligence. Your work will turn the open web into machine-readable signal that powers everything from mandate search to warm investor targeting to real-time enrichment.

The job ad is no longer active

Look at the current jobs Data and Analytics →