Data Engineer (AI and Data Pipeline Focus)
Do you want to develop your deep data engineering skills in a complex and high-impact AI product? You have the opportunity to apply your knowledge and grow across all areas of our robust data ecosystem!
Join Aniline.ai! We are a forward-thinking technology company dedicated to harnessing the power of AI across various sectors, including HR, facility monitoring, retail analytics, marketing, and learning support systems. Our mission is to transform data into actionable insights and innovative solutions.
We are seeking a highly skilled Data Engineer with a strong background in building scalable data pipelines, optimizing high-load data processing, and supporting AI/LLM architectures. In this critical role, you will be the backbone of our data operations, ensuring quality, reliability, and efficient delivery of data across our entire platform.
Key Responsibilities & Focus Areas
You will be a key contributor across our platform, with a primary focus on the following data engineering areas:
1. ๐พ Data Pipeline Design & Automation (Primary Focus)
- Design, build, and maintain scalable data pipelines and ETL/ELT processes.
- Automate the end-to-end data pipeline for the periodic collection, processing, and deployment of results to production. This includes transitioning manual processes to robust automated solutions.
- Manage the ingestion of raw data (company reviews from various sources) into our GCP Data Lake and subsequent transformation and loading into the GCP Data Warehouse (e.g., BigQuery).
- Set up and maintain systems for pipeline orchestration.
- Develop ETL/ELT processes to update client-facing databases like Firebase and refresh reference data in PostgreSQL.
- Integrate data from various sources, ensuring data quality and reliability for analytics and reporting.
2. ๐ง AI Data Support & Integration
- Engineer data flow specifically for AI/LLM solutions, focusing on contextual retrieval and input data preparation.
- Automate the pipeline for updating contexts in the Pinecone vector database for Retrieval-Augmented Generation (RAG) architecture.
- Prepare processed and analyzed data for loading into result tables (including statistics and logs), which serve as the foundation for LLM inputs and subsequent client reporting.
- Perform general Python development tasks to maintain and support existing data-handling code, including LangChain logic and data processing within Jupyter Notebooks.
- Collaborate with cross-functional teams (data scientists and AI engineers) to ensure data requirements are met for LLM solution deployment and prompt optimization.
- Perform data analysis and reporting using BI tools (Looker, Power BI, Tableau, etc.).
3. โ๏ธ Infrastructure & Optimization
- Work with cloud platforms (preferably GCP) to manage, optimize, and secure data lakes and data warehouses.
- Apply knowledge of algorithmic skills and complexity analysis (including Big O notation) to select the most efficient algorithms for high-load data processing.
- Conduct thorough research and analysis of existing infrastructure, data structures, and code bases to ensure seamless integration and stability of new developments.
Requirements
- Proven experience as a Data Engineer, focusing on building and optimizing ETL/ELT processes for large datasets.
- Strong proficiency in Python development and the data stack (Pandas, NumPy).
- Hands-on experience with cloud-based data infrastructure (GCP is highly preferred), including Data Warehouses (BigQuery) and Data Lakes.
- Familiarity with database technologies including PostgreSQL, NoSQL (Firebase), and, crucially, vector databases (Pinecone, FAISS, or similar).
- Experience supporting LLM-based solutions and frameworks like LangChain is highly desirable.
- Solid grasp of software engineering best practices, including Git and CI/CD.
Nice-to-Have Skills
- Proven track record in building and optimizing ETL/ELT processes for large datasets.
- Experience integrating OpenAI API or similar AI services.
- Experience in a production environment with multi-agent systems.
Next Steps
We are keen to see your practical data engineering experience! We would highly value a submission that includes a link to a Git repository demonstrating your expertise in building a robust data pipeline, especially one that interfaces with LLM/RAG components (e.g., updating a vector database).
Ready to architect our next-generation data ecosystem? Apply today!
Required languages
| English | B1 - Intermediate |
| Ukrainian | Native |