Data Science UA

Joined in 2019

Data Science UA is a service company with strong data science and AI expertise.

Our journey began in 2016 with the organization of the first Data Science UA conference, setting the foundation for our growth. Over the past 8 years, we have diligently fostered the largest Data Science Community in Eastern Europe, boasting a network of over 30,000 AI top engineers.

For over 4 years, we have been at the forefront of establishing AI R&D centers in Western Europe, catering to esteemed product companies from the USA and UK.

We offer diverse cooperation models, including outsourcing and outstaffing, where we assemble top-notch tech teams of industry experts to craft optimal solutions tailored to your business requirements.

At Data Science UA, our core focus revolves around AI consulting. Whether you possess a clearly defined request or just a nascent idea, we are eager to collaborate and explore possibilities together.

Additionally, our comprehensive recruiting service extends beyond AI and data science specialists, allowing us to find the best candidate on your request for each level and specialization, bolstering teams worldwide.

Embrace the opportunity to partner with us at Data Science UA, and together, we can achieve extraordinary milestones for your enterprise. Reach out to us today, and let’s embark on this transformative journey hand in hand!

  • Β· 15 views Β· 0 applications Β· 5d

    Computer vision Lead

    Full Remote Β· Countries of Europe or Ukraine Β· 5 years of experience Β· B2 - Upper Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with the organization of the first Data Science UA conference, setting the foundation for our growth. Over the past 9 years, we have...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with the organization of the first Data Science UA conference, setting the foundation for our growth. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the role:
    We seek an experienced AI/ML Team Leader to join our Client’s startup team. As the Technical Lead, you will start with individual technical contributions and later will take an engineering manager role for the hiring team. You will create cutting-edge end-to-end AI camera solutions, effectively engage with customers and partners to grasp their requirements and ensure project success, overseeing from inception to completion.

    Requirements:
    - Proven leadership experience with a track record of managing and developing technical teams;
    - Excellent customer-facing skills to understand and address client needs effectively;
    - Master's Degree in Computer Science or related field (PhD is a plus);
    - Solid grasp of machine learning and deep learning principles;
    - Strong experience in Computer Vision, including object detection, segmentation, tracking, keypoint/pose estimation;
    - Proven R&D mindset: capable of formulating and validating hypotheses independently, exploring novel approaches, and diving deep into model failures;
    - Proficiency in Python and deep learning frameworks;
    - Practical experience with state-of-the-art models, including different versions of YOLO and Transformer-based architectures (e.g., ViT, DETR, SAM);
    - Expertise in image and video processing using OpenCV;
    - Experience in model training, evaluation, and optimization;
    - Fluent written and verbal communication skills in English.

    Would be a plus:
    - Experience applying ML techniques to embedded or resource-constrained environments (e.g. edge devices, mobile platforms, microcontrollers);
    - Ideally, you have led projects where ML models were optimized, deployed, or fine-tuned for embedded systems, ensuring high performance and low latency under hardware limitations.

    We offer:
    - Free English classes with a native speaker and external courses compensation;
    - PE support by professional accountants;
    - 40 days of PTO;
    - Medical insurance;
    - Team-building events, conferences, meetups, and other activities;
    - There are many other benefits you’ll find out at the interview.

    More
  • Β· 42 views Β· 8 applications Β· 10d

    Computer Vision Lead

    Full Remote Β· Countries of Europe or Ukraine Β· 6 years of experience Β· B2 - Upper Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with the organization of the first Data Science UA conference, setting the foundation for our growth. Over the past 9 years, we have...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with the organization of the first Data Science UA conference, setting the foundation for our growth. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the role:
    We seek an experienced AI/ML Team Leader to join our Client’s startup team. As the Technical Lead, you will start with individual technical contributions and later will take an engineering manager role for the hiring team. You will create cutting-edge end-to-end AI camera solutions, effectively engage with customers and partners to grasp their requirements and ensure project success, overseeing from inception to completion.

    Requirements:
    - Proven leadership experience with a track record of managing and developing technical teams;
    - Excellent customer-facing skills to understand and address client needs effectively;
    - Master's Degree in Computer Science or related field (PhD is a plus);
    - Solid grasp of machine learning and deep learning principles;
    - Strong experience in Computer Vision, including object detection, segmentation, tracking, keypoint/pose estimation;
    - Proven R&D mindset: capable of formulating and validating hypotheses independently, exploring novel approaches, and diving deep into model failures;
    - Proficiency in Python and deep learning frameworks;
    - Practical experience with state-of-the-art models, including different versions of YOLO and Transformer-based architectures (e.g., ViT, DETR, SAM);
    - Expertise in image and video processing using OpenCV;
    - Experience in model training, evaluation, and optimization;
    - Fluent written and verbal communication skills in English.

    Would be a plus:
    - Experience applying ML techniques to embedded or resource-constrained environments (e.g. edge devices, mobile platforms, microcontrollers);
    - Ideally, you have led projects where ML models were optimized, deployed, or fine-tuned for embedded systems, ensuring high performance and low latency under hardware limitations.

    We offer:
    - Free English classes with a native speaker and external courses compensation;
    - PE support by professional accountants;
    - 40 days of PTO;
    - Medical insurance;
    - Team-building events, conferences, meetups, and other activities;
    - There are many other benefits you’ll find out at the interview.

    More
  • Β· 59 views Β· 4 applications Β· 20d

    Data Engineer

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for a Data Engineer (NLP-Focused) to build and optimize the data pipelines that fuel the Ukrainian LLM and NLP initiatives. In this role, you will design robust ETL/ELT processes to collect, process, and manage large-scale text and metadata, enabling the Data Scientists and ML Engineers to develop cutting-edge language models.

    You will work at the intersection of data engineering and machine learning, ensuring that the datasets and infrastructure are reliable, scalable, and tailored to the needs of training and evaluating NLP models in a Ukrainian language context.

    Requirements:
    - Education & Experience: 3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms. A Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field is preferred. Experience supporting machine learning or analytics teams with data pipelines is a strong advantage.
    - NLP Domain Experience: Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies). Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    Understanding of FineWeb2 or a similar processing pipeline approach.
    - Data Pipeline Expertise: Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems. Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows. Familiarity with building pipelines for unstructured data (text, logs) as well as structured data.
    - Programming & Scripting: Strong programming skills in Python for data manipulation and pipeline development. Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.). Experience with SQL for querying and transforming data in relational databases. Knowledge of Bash or other scripting for automation tasks. Writing clean, maintainable code and using version control (Git) for collaborative development.
    - Databases & Storage: Experience working with relational databases (e.g., PostgreSQL, MySQL), including schema design and query optimization. Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus. Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as the NLP applications may require embedding storage and fast similarity search.
    - Cloud Infrastructure: Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing. Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions. Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus.
    - Data Quality & Monitoring: Knowledge of data quality assurance practices. Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing. An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks.
    - Collaboration & Domain Knowledge: Ability to work closely with data scientists and understand the requirements of machine learning projects. Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require. Good communication skills to document data workflows and to coordinate with team members across different functions.

    Nice to have:
    - Advanced Tools & Frameworks: Experience with distributed data processing frameworks (such as Apache Spark or Databricks) for large-scale data transformation, and with message streaming systems (Kafka, Pub/Sub) for real-time data pipelines. Familiarity with data serialization formats (JSON, Parquet) and handling of large text corpora.
    - Web Scraping Expertise: Deep experience in web scraping, using tools like Scrapy, Selenium, or Beautiful Soup, and handling anti-scraping challenges (rotating proxies, rate limiting). Ability to parse and clean raw text data from HTML, PDFs, or scanned documents.
    - CI/CD & DevOps: Knowledge of setting up CI/CD pipelines for data engineering (using GitHub Actions, Jenkins, or GitLab CI) to test and deploy changes to data workflows. Experience with containerization (Docker) to package data jobs and with Kubernetes for scaling them is a plus.
    - Big Data & Analytics: Experience with analytics platforms and BI tools (e.g., Tableau, Looker) used to examine the data prepared by the pipelines. Understanding of how to create and manage data warehouses or data marts for analytical consumption.
    - Problem-Solving: Demonstrated ability to work independently in solving complex data engineering problems, optimizing existing pipelines, and implementing new ones under time constraints. A proactive attitude to explore new data tools or techniques that could improve the workflows.

    Responsibilities:
    - Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information.
    - Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity.
    - Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to the language modeling efforts.
    - Implementation of NLP/LLM-specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising, detection, and deletion of personal data.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs.
    - Automate data processing workflows and ensure their scalability and reliability.
    - Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles.
    - Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs.
    - Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas.
    - Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models.
    - Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation.
    - Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control.
    - Manage data security, access, and compliance.
    - Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 31 views Β· 0 applications Β· 20d

    Senior/Middle Data Scientist (Data Preparation, Pre-training)

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an experienced Senior/Middle Data Scientist with a passion for Large Language Models (LLMs) and cutting-edge AI research. In this role, you will focus on designing and prototyping data preparation pipelines, collaborating closely with data engineers to transform your prototypes into scalable production pipelines, and actively developing model training pipelines with other talented data scientists. Your work will directly shape the quality and capabilities of the models by ensuring we feed them the highest-quality, most relevant data possible.

    Requirements:
    Education & Experience:
    - 3+ years of experience in Data Science or Machine Learning, preferably with a focus on NLP.
    - Proven experience in data preprocessing, cleaning, and feature engineering for large-scale datasets of unstructured data (text, code, documents, etc.).
    - Advanced degree (Master’s or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
    NLP Expertise:
    - Good knowledge of natural language processing techniques and algorithms.
    - Hands-on experience with modern NLP approaches, including embedding models, semantic search, text classification, sequence tagging (NER), transformers/LLMs, RAGs.
    - Familiarity with LLM training and fine-tuning techniques.
    ML & Programming Skills:
    - Proficiency in Python and common data science and NLP libraries (pandas, NumPy, scikit-learn, spaCy, NLTK, langdetect, fasttext).
    - Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models.
    - Ability to write efficient, clean code and debug complex model issues.
    Data & Analytics:
    - Solid understanding of data analytics and statistics.
    - Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance.
    - Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
    Deployment & Tools:
    - Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications.
    - Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML).
    - Experience with cloud platforms (AWS, GCP, or Azure) and big data technologies (Spark, Hadoop, Ray, Dask) for scaling data processing or model training.
    Communication & Personality:
    - Experience working in a collaborative, cross-functional environment.
    - Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies clearly.
    - Ability to rapidly prototype and iterate on ideas

    Nice to have:
    Advanced NLP/ML Techniques:
    - Familiarity with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency.
    - Understanding of FineWeb2 or similar processing pipelines approach.
    Research & Community:
    - Publications in NLP/ML conferences or contributions to open-source NLP projects.
    - Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research collaborations) indicating a passion for staying at the forefront of the field.
    Domain & Language Knowledge:
    - Familiarity with the Ukrainian language and context.
    - Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
    - Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    MLOps & Infrastructure:
    - Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
    - Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
    Problem-Solving:
    - Innovative mindset with the ability to approach open-ended AI problems creatively.
    - Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.

    Responsibilities:
    - Design, prototype, and validate data preparation and transformation steps for LLM training datasets, including cleaning and normalization of text, filtering of toxic content, de-duplication, de-noising, detection and deletion of personal data, etc.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Analyze large-scale raw text, code, and multimodal data sources for quality, coverage, and relevance.
    - Develop heuristics, filtering rules, and cleaning techniques to maximize training data effectiveness.
    - Collaborate with data engineers to hand over prototypes for automation and scaling.
    - Research and develop best practices and novel techniques in LLM training pipelines.
    - Monitor and evaluate data quality impact on model performance through experiments and benchmarks.
    - Research and implement best practices in large-scale dataset creation for AI/ML models.
    - Document methodologies and share insights with internal teams.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 32 views Β· 2 applications Β· 19d

    Data Engineer (NLP-Focused)

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for a Data Engineer (NLP-Focused) to build and optimize the data pipelines that fuel the Ukrainian LLM and NLP initiatives. In this role, you will design robust ETL/ELT processes to collect, process, and manage large-scale text and metadata, enabling the Data Scientists and ML Engineers to develop cutting-edge language models.

    You will work at the intersection of data engineering and machine learning, ensuring that the datasets and infrastructure are reliable, scalable, and tailored to the needs of training and evaluating NLP models in a Ukrainian language context.

    Requirements:
    - Education & Experience: 3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms. A Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field is preferred. Experience supporting machine learning or analytics teams with data pipelines is a strong advantage.
    - NLP Domain Experience: Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies). Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    Understanding of FineWeb2 or a similar processing pipeline approach.
    - Data Pipeline Expertise: Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems. Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows. Familiarity with building pipelines for unstructured data (text, logs) as well as structured data.
    - Programming & Scripting: Strong programming skills in Python for data manipulation and pipeline development. Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.). Experience with SQL for querying and transforming data in relational databases. Knowledge of Bash or other scripting for automation tasks. Writing clean, maintainable code and using version control (Git) for collaborative development.
    - Databases & Storage: Experience working with relational databases (e.g., PostgreSQL, MySQL), including schema design and query optimization. Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus. Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as the NLP applications may require embedding storage and fast similarity search.
    - Cloud Infrastructure: Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing. Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions. Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus.
    - Data Quality & Monitoring: Knowledge of data quality assurance practices. Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing. An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks.
    - Collaboration & Domain Knowledge: Ability to work closely with data scientists and understand the requirements of machine learning projects. Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require. Good communication skills to document data workflows and to coordinate with team members across different functions.

    Responsibilities:
    - Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information.
    - Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity.
    - Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to the language modeling efforts.
    - Implementation of NLP/LLM-specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising, detection, and deletion of personal data.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs.
    - Automate data processing workflows and ensure their scalability and reliability.
    - Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles.
    - Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs.
    - Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas.
    - Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models.
    - Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation.
    - Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control.
    - Manage data security, access, and compliance.
    - Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 25 views Β· 1 application Β· 10d

    Senior Data Scientist/NLP Lead

    Hybrid Remote Β· Ukraine (Kyiv) Β· Product Β· 5 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an experienced Senior Data Scientist / NLP Lead to spearhead the development of cutting-edge natural language processing solutions for the Ukrainian LLM project. You will lead the NLP team in designing, implementing, and deploying large-scale language models and NLP algorithms that power the products.

    This role is critical to the mission of advancing AI in the Ukrainian language context, and offers the opportunity to drive technical decisions, mentor a team of data scientists, and shape the future of AI capabilities in Ukraine.

    Requirements:
    - Education & Experience: 5+ years of experience in data science or machine learning, with a strong focus on NLP. Proven track record of developing and deploying NLP or ML models at scale in production environments. An advanced degree (Master’s or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
    - NLP Expertise: Deep understanding of natural language processing techniques and algorithms. Hands-on experience with modern NLP approaches, including embedding models, text classification, sequence tagging (NER), and transformers/LLMs. Deep understanding of transformer architectures and knowledge of LLM training and fine-tuning techniques, hands-on experience developing solutions on LLM, and knowledge of linguistic nuances in Ukrainian or other languages.
    - Advanced NLP/ML Techniques: Experience with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency. Background in information retrieval or RAG (Retrieval-Augmented Generation) is a plus for building systems that augment LLMs with external knowledge.
    - ML & Programming Skills: Proficiency in Python and common data science libraries (pandas, NumPy, scikit-learn). Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models. Ability to write efficient, clean code and debug complex model issues.
    - Data & Analytics: Solid understanding of data analytics and statistics. Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance. Experience on how to build a representative benchmarking framework given business requirements for LLM. Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
    - Deployment & Tools: Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications. Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML). Experience with cloud platforms (AWS, GCP or Azure) and big data technologies (Spark, Hadoop) for scaling data processing or model training is a plus. Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
    - Leadership & Communication: Demonstrated ability to lead technical projects and mentor junior team members. Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies clearly.

    Nice to have:
    - LLM training & evaluation experience: Hands-on experience of building tokenizers and SFT, RLHF techniques. Knowledge of the evaluation of model toxicity and ethical aspects, hallucinations, security, and building LLM guardrails.
    - Research & Community: Publications in NLP/ML conferences or contributions to open-source NLP projects. Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research
    collaborations) indicating a passion for staying at the forefront of the field.
    - Domain & Language Knowledge: Familiarity with the Ukrainian language and context. Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
    - MLOps & Infrastructure: Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
    - Problem-Solving: Innovative mindset with the ability to approach open-ended AI problems creatively. Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.

    Responsibilities:
    - Lead end-to-end development of NLP and LLM models - from data exploration and model prototyping to validation and production deployment. This includes designing novel model architectures or fine-tuning state-of-the-art transformer models (e.g., BERT, GPT) to solve project-specific language tasks.
    - Analyze large text datasets (Ukrainian and multilingual corpora) to extract insights and build robust training datasets.
    - Guide data collection and annotation efforts to ensure high-quality data for model training.
    - Develop and implement NLP algorithms for a range of tasks such as text classification, named entity recognition, semantic search, and conversational AI.
    - Stay up-to-date with the latest research to apply transformer-based models, embeddings, and other modern NLP techniques in the solutions.
    - Establish evaluation metrics and validation frameworks for model performance, including accuracy, factuality, and bias.
    - Design A/B tests and statistical experiments to compare model variants and validate improvements.
    - Deploy and integrate NLP models into production systems in collaboration with engineers - ensuring models are scalable, efficient, and well-monitored in a real-world setting.
    - Optimize model inference and troubleshoot issues such as model drift or data pipeline bottlenecks.
    - Provide technical leadership and mentorship to the NLP/ML team.
    - Review code and research, uphold best practices in ML (version control, reproducibility, documentation), and foster a culture of continuous learning and innovation.
    - Collaborate cross-functionally with product managers, software engineers, and MLOps engineers to align NLP solutions with product goals and infrastructure capabilities.
    - Communicate complex data science concepts to stakeholders and incorporate their feedback into model development.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 30 views Β· 0 applications Β· 18d

    Senior/Middle Data Scientist

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an experienced Senior/Middle Data Scientist with a passion for Large Language Models (LLMs) and cutting-edge AI research. In this role, you will focus on designing and prototyping data preparation pipelines, collaborating closely with data engineers to transform your prototypes into scalable production pipelines, and actively developing model training pipelines with other talented data scientists. Your work will directly shape the quality and capabilities of the models by ensuring we feed them the highest-quality, most relevant data possible.

    Requirements:
    Education & Experience:
    - 3+ years of experience in Data Science or Machine Learning, preferably with a focus on NLP.
    - Proven experience in data preprocessing, cleaning, and feature engineering for large-scale datasets of unstructured data (text, code, documents, etc.).
    - Advanced degree (Master’s or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
    NLP Expertise:
    - Good knowledge of natural language processing techniques and algorithms.
    - Hands-on experience with modern NLP approaches, including embedding models, semantic search, text classification, sequence tagging (NER), transformers/LLMs, RAGs.
    - Familiarity with LLM training and fine-tuning techniques.
    ML & Programming Skills:
    - Proficiency in Python and common data science and NLP libraries (pandas, NumPy, scikit-learn, spaCy, NLTK, langdetect, fasttext).
    - Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models.
    - Ability to write efficient, clean code and debug complex model issues.
    Data & Analytics:
    - Solid understanding of data analytics and statistics.
    - Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance.
    - Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
    Deployment & Tools:
    - Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications.
    - Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML).
    - Experience with cloud platforms (AWS, GCP, or Azure) and big data technologies (Spark, Hadoop, Ray, Dask) for scaling data processing or model training.
    Communication & Personality:
    - Experience working in a collaborative, cross-functional environment.
    - Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies clearly.
    - Ability to rapidly prototype and iterate on ideas

    Nice to have:
    Advanced NLP/ML Techniques:
    - Familiarity with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency.
    - Understanding of FineWeb2 or similar processing pipelines approach.
    Research & Community:
    - Publications in NLP/ML conferences or contributions to open-source NLP projects.
    - Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research collaborations) indicating a passion for staying at the forefront of the field.
    Domain & Language Knowledge:
    - Familiarity with the Ukrainian language and context.
    - Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
    - Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    MLOps & Infrastructure:
    - Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
    - Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
    Problem-Solving:
    - Innovative mindset with the ability to approach open-ended AI problems creatively.
    - Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.

    Responsibilities:
    - Design, prototype, and validate data preparation and transformation steps for LLM training datasets, including cleaning and normalization of text, filtering of toxic content, de-duplication, de-noising, detection and deletion of personal data, etc.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Analyze large-scale raw text, code, and multimodal data sources for quality, coverage, and relevance.
    - Develop heuristics, filtering rules, and cleaning techniques to maximize training data effectiveness.
    - Collaborate with data engineers to hand over prototypes for automation and scaling.
    - Research and develop best practices and novel techniques in LLM training pipelines.
    - Monitor and evaluate data quality impact on model performance through experiments and benchmarks.
    - Research and implement best practices in large-scale dataset creation for AI/ML models.
    - Document methodologies and share insights with internal teams.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 21 views Β· 0 applications Β· 14d

    Senior/Middle Data Scientist

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an experienced Senior/Middle Data Scientist with a passion for Large Language Models (LLMs) and cutting-edge AI research. In this role, you will focus on designing and prototyping data preparation pipelines, collaborating closely with data engineers to transform your prototypes into scalable production pipelines, and actively developing model training pipelines with other talented data scientists. Your work will directly shape the quality and capabilities of the models by ensuring we feed them the highest-quality, most relevant data possible.

    Requirements:
    Education & Experience:
    - 3+ years of experience in Data Science or Machine Learning, preferably with a focus on NLP.
    - Proven experience in data preprocessing, cleaning, and feature engineering for large-scale datasets of unstructured data (text, code, documents, etc.).
    - Advanced degree (Master’s or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
    NLP Expertise:
    - Good knowledge of natural language processing techniques and algorithms.
    - Hands-on experience with modern NLP approaches, including embedding models, semantic search, text classification, sequence tagging (NER), transformers/LLMs, RAGs.
    - Familiarity with LLM training and fine-tuning techniques.
    ML & Programming Skills:
    - Proficiency in Python and common data science and NLP libraries (pandas, NumPy, scikit-learn, spaCy, NLTK, langdetect, fasttext).
    - Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models.
    - Ability to write efficient, clean code and debug complex model issues.
    Data & Analytics:
    - Solid understanding of data analytics and statistics.
    - Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance.
    - Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
    Deployment & Tools:
    - Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications.
    - Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML).
    - Experience with cloud platforms (AWS, GCP, or Azure) and big data technologies (Spark, Hadoop, Ray, Dask) for scaling data processing or model training.
    Communication & Personality:
    - Experience working in a collaborative, cross-functional environment.
    - Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies clearly.
    - Ability to rapidly prototype and iterate on ideas

    Nice to have:
    Advanced NLP/ML Techniques:
    - Familiarity with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency.
    - Understanding of FineWeb2 or similar processing pipelines approach.
    Research & Community:
    - Publications in NLP/ML conferences or contributions to open-source NLP projects.
    - Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research collaborations) indicating a passion for staying at the forefront of the field.
    Domain & Language Knowledge:
    - Familiarity with the Ukrainian language and context.
    - Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
    - Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    MLOps & Infrastructure:
    - Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
    - Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
    Problem-Solving:
    - Innovative mindset with the ability to approach open-ended AI problems creatively.
    - Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.

    Responsibilities:
    - Design, prototype, and validate data preparation and transformation steps for LLM training datasets, including cleaning and normalization of text, filtering of toxic content, de-duplication, de-noising, detection and deletion of personal data, etc.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Analyze large-scale raw text, code, and multimodal data sources for quality, coverage, and relevance.
    - Develop heuristics, filtering rules, and cleaning techniques to maximize training data effectiveness.
    - Collaborate with data engineers to hand over prototypes for automation and scaling.
    - Research and develop best practices and novel techniques in LLM training pipelines.
    - Monitor and evaluate data quality impact on model performance through experiments and benchmarks.
    - Research and implement best practices in large-scale dataset creation for AI/ML models.
    - Document methodologies and share insights with internal teams.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 32 views Β· 0 applications Β· 14d

    Data Engineer (NLP-Focused)

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for a Data Engineer (NLP-Focused) to build and optimize the data pipelines that fuel the Ukrainian LLM and NLP initiatives. In this role, you will design robust ETL/ELT processes to collect, process, and manage large-scale text and metadata, enabling the Data Scientists and ML Engineers to develop cutting-edge language models.

    You will work at the intersection of data engineering and machine learning, ensuring that the datasets and infrastructure are reliable, scalable, and tailored to the needs of training and evaluating NLP models in a Ukrainian language context.

    Requirements:
    - Education & Experience: 3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms. A Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field is preferred. Experience supporting machine learning or analytics teams with data pipelines is a strong advantage.
    - NLP Domain Experience: Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies). Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    Understanding of FineWeb2 or a similar processing pipeline approach.
    - Data Pipeline Expertise: Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems. Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows. Familiarity with building pipelines for unstructured data (text, logs) as well as structured data.
    - Programming & Scripting: Strong programming skills in Python for data manipulation and pipeline development. Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.). Experience with SQL for querying and transforming data in relational databases. Knowledge of Bash or other scripting for automation tasks. Writing clean, maintainable code and using version control (Git) for collaborative development.
    - Databases & Storage: Experience working with relational databases (e.g., PostgreSQL, MySQL), including schema design and query optimization. Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus. Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as the NLP applications may require embedding storage and fast similarity search.
    - Cloud Infrastructure: Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing. Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions. Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus.
    - Data Quality & Monitoring: Knowledge of data quality assurance practices. Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing. An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks.
    - Collaboration & Domain Knowledge: Ability to work closely with data scientists and understand the requirements of machine learning projects. Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require. Good communication skills to document data workflows and to coordinate with team members across different functions.

    Responsibilities:
    - Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information.
    - Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity.
    - Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to the language modeling efforts.
    - Implementation of NLP/LLM-specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising, detection, and deletion of personal data.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs.
    - Automate data processing workflows and ensure their scalability and reliability.
    - Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles.
    - Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs.
    - Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas.
    - Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models.
    - Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation.
    - Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control.
    - Manage data security, access, and compliance.
    - Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 33 views Β· 1 application Β· 14d

    Senior/Middle Data Scientist (Data Preparation, Pre-training)

    Full Remote Β· Ukraine Β· Product Β· 3 years of experience Β· B1 - Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an experienced Senior/Middle Data Scientist with a passion for Large Language Models (LLMs) and cutting-edge AI research. In this role, you will focus on designing and prototyping data preparation pipelines, collaborating closely with data engineers to transform your prototypes into scalable production pipelines, and actively developing model training pipelines with other talented data scientists. Your work will directly shape the quality and capabilities of the models by ensuring we feed them the highest-quality, most relevant data possible.

    Requirements:
    Education & Experience:
    - 3+ years of experience in Data Science or Machine Learning, preferably with a focus on NLP.
    - Proven experience in data preprocessing, cleaning, and feature engineering for large-scale datasets of unstructured data (text, code, documents, etc.).
    - Advanced degree (Master’s or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
    NLP Expertise:
    - Good knowledge of natural language processing techniques and algorithms.
    - Hands-on experience with modern NLP approaches, including embedding models, semantic search, text classification, sequence tagging (NER), transformers/LLMs, RAGs.
    - Familiarity with LLM training and fine-tuning techniques.
    ML & Programming Skills:
    - Proficiency in Python and common data science and NLP libraries (pandas, NumPy, scikit-learn, spaCy, NLTK, langdetect, fasttext).
    - Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models.
    - Ability to write efficient, clean code and debug complex model issues.
    Data & Analytics:
    - Solid understanding of data analytics and statistics.
    - Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance.
    - Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
    Deployment & Tools:
    - Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications.
    - Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML).
    - Experience with cloud platforms (AWS, GCP, or Azure) and big data technologies (Spark, Hadoop, Ray, Dask) for scaling data processing or model training.
    Communication & Personality:
    - Experience working in a collaborative, cross-functional environment.
    - Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies clearly.
    - Ability to rapidly prototype and iterate on ideas

    Nice to have:
    Advanced NLP/ML Techniques:
    - Familiarity with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency.
    - Understanding of FineWeb2 or similar processing pipelines approach.
    Research & Community:
    - Publications in NLP/ML conferences or contributions to open-source NLP projects.
    - Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research collaborations) indicating a passion for staying at the forefront of the field.
    Domain & Language Knowledge:
    - Familiarity with the Ukrainian language and context.
    - Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
    - Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given the project’s focus.
    MLOps & Infrastructure:
    - Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
    - Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
    Problem-Solving:
    - Innovative mindset with the ability to approach open-ended AI problems creatively.
    - Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.

    Responsibilities:
    - Design, prototype, and validate data preparation and transformation steps for LLM training datasets, including cleaning and normalization of text, filtering of toxic content, de-duplication, de-noising, detection and deletion of personal data, etc.
    - Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
    - Analyze large-scale raw text, code, and multimodal data sources for quality, coverage, and relevance.
    - Develop heuristics, filtering rules, and cleaning techniques to maximize training data effectiveness.
    - Collaborate with data engineers to hand over prototypes for automation and scaling.
    - Research and develop best practices and novel techniques in LLM training pipelines.
    - Monitor and evaluate data quality impact on model performance through experiments and benchmarks.
    - Research and implement best practices in large-scale dataset creation for AI/ML models.
    - Document methodologies and share insights with internal teams.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 20 views Β· 0 applications Β· 13d

    Senior/Middle Data Scientist (Benchmarking and Alignment)

    Hybrid Remote Β· Ukraine (Kyiv) Β· Product Β· 3 years of experience Β· B2 - Upper Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an experienced Senior/Middle Data Scientist with a passion for Large Language Models (LLMs) and cutting-edge AI research. In this role, you will design and implement a state-of-the-art evaluation and benchmarking framework to measure and guide model quality, and personally train LLMs with a strong focus on Reinforcement Learning from Human Feedback (RLHF). You will work alongside top AI researchers and engineers, ensuring the models are not only powerful but also aligned with user needs, cultural context, and ethical standards.

    Requirements:
    Education & Experience:
    - 3+ years of experience in Data Science or Machine Learning, preferably with a focus on NLP.
    - Proven experience in machine learning model evaluation and/or NLP benchmarking.
    - Advanced degree (Master’s or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
    NLP Expertise:
    - Good knowledge of natural language processing techniques and algorithms.
    - Hands-on experience with modern NLP approaches, including embedding models, semantic search, text classification, sequence tagging (NER), transformers/LLMs, RAGs.
    - Familiarity with LLM training and fine-tuning techniques.
    ML & Programming Skills:
    - Proficiency in Python and common data science and NLP libraries (pandas, NumPy, scikit-learn, spaCy, NLTK, langdetect, fasttext).
    - Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models.
    - Solid understanding of RLHF concepts and related techniques (preference modeling, reward modeling, reinforcement learning).
    - Ability to write efficient, clean code and debug complex model issues.
    Data & Analytics:
    - Solid understanding of data analytics and statistics.
    - Experience creating and managing test datasets, including annotation and labeling processes.
    - Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance.
    - Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
    Deployment & Tools:
    - Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications.
    - Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML).
    - Experience with cloud platforms (AWS, GCP, or Azure) and big data technologies (Spark, Hadoop, Ray, Dask) for scaling data processing or model training.
    Communication:
    - Experience working in a collaborative, cross-functional environment.
    - Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies clearly.

    Nice to have:
    Advanced NLP/ML Techniques:
    - Prior work on LLM safety, fairness, and bias mitigation.
    - Familiarity with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency.
    - Knowledge of data annotation workflows and human feedback collection methods.
    Research & Community:
    - Publications in NLP/ML conferences or contributions to open-source NLP projects.
    - Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research collaborations) indicating a passion for staying at the forefront of the field.
    Domain & Language Knowledge:
    - Familiarity with the Ukrainian language and context.
    - Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
    - Knowledge of Ukrainian benchmarks, or familiarity with other evaluation datasets and leaderboards for large models, can be an advantage given the project’s focus.
    MLOps & Infrastructure:
    - Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
    - Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
    Problem-Solving:
    - Innovative mindset with the ability to approach open-ended AI problems creatively.
    - Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.

    Responsibilities:
    - Analyze benchmarking datasets, define gaps, and design, implement, and maintain a comprehensive benchmarking framework for the Ukrainian language.
    - Research and integrate state-of-the-art evaluation metrics for factual accuracy, reasoning, language fluency, safety, and alignment.
    - Design and maintain testing frameworks to detect hallucinations, biases, and other failure modes in LLM outputs.
    - Develop pipelines for synthetic data generation and adversarial example creation to challenge the model’s robustness.
    - Collaborate with human annotators, linguists, and domain experts to define evaluation tasks and collect high-quality feedback
    - Develop tools and processes for continuous evaluation during model pre-training, fine-tuning, and deployment.
    - Research and develop best practices and novel techniques in LLM training pipelines.
    - Analyze benchmarking results to identify model strengths, weaknesses, and improvement opportunities.
    - Work closely with other data scientists to align training and evaluation pipelines.
    - Document methodologies and share insights with internal teams.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
  • Β· 28 views Β· 1 application Β· 7d

    MLOps Engineer

    Hybrid Remote Β· Ukraine (Kyiv) Β· Product Β· 2 years of experience Β· B2 - Upper Intermediate
    About us: Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently...

    About us:
    Data Science UA is a service company with strong data science and AI expertise. Our journey began in 2016 with uniting top AI talents and organizing the first Data Science tech conference in Kyiv. Over the past 9 years, we have diligently fostered one of the largest Data Science & AI communities in Europe.

    About the client:
    Our client is an IT company that develops technological solutions and products to help companies reach their full potential and meet the needs of their users. The team comprises over 600 specialists in IT and Digital, with solid expertise in various technology stacks necessary for creating complex solutions.

    About the role:
    We are looking for an MLOps Engineer specializing in Large Language Model (LLM) infrastructure to design and maintain the robust platform on which the AI models are developed, deployed, and monitored. As an MLOps Engineer, you will build the backbone of the machine learning operations – from scalable training pipelines to
    Reliable deployment systems – ensuring that the NLP models (including LLMs) can be trained on large datasets and served to end-users efficiently.

    This role sits at the intersection of software engineering, DevOps, and machine learning, and is crucial for accelerating the R&D in the Ukrainian LLM project. You’ll work closely with data scientists and software engineers to implement best-in-class infrastructure and workflows for the continuous delivery of AI innovations.

    Requirements:
    Experience & Background:
    - 4+ years of experience in DevOps, MLOps, or ML Infrastructure roles.
    - Strong foundation in software engineering and DevOps principles as they apply to machine learning.
    - A Bachelor’s or Master’s in Computer Science, Engineering, or a related field is preferred.
    Cloud & Infrastructure:
    - Extensive experience with cloud platforms (AWS, GCP, or Azure) and designing cloud-native applications for ML.
    - Comfortable using cloud services for compute (EC2, GCP Compute, Azure VMs), storage (S3, Cloud Storage), container registry, and serverless components where appropriate.
    - Experience managing infrastructure with Infrastructure-as-Code tools like Terraform or CloudFormation.
    Containerization & Orchestration:
    - Proficiency in container technologies (Docker) and orchestration with Kubernetes.
    - Ability to deploy, scale, and manage complex applications on Kubernetes clusters; experience with tools like Helm for Kubernetes package management.
    - Knowledge of container security and networking basics in distributed systems.
    CI/CD & Automation:
    - Strong experience implementing CI/CD pipelines for ML projects.
    - Familiar with tools like Jenkins, GitLab CI, or GitHub Actions for automating testing and deployment of ML code and models.
    - Experience with specialized ML CI/CD (e.g., TensorFlow Extended TFX, MLflow for model deployment) and GitOps workflows (Argo CD) is a plus.
    Programming & Scripting:
    - Strong coding skills in Python, with experience in writing pipelines or automation scripts related to ML tasks.
    - Familiarity with shell scripting and one or more general-purpose languages (Go, Java, or C++) for infrastructure tooling.
    - Ability to debug and optimize code for performance (both in data pipelines and model inference code).
    ML Pipeline Knowledge:
    - Solid understanding of the machine learning lifecycle and tools.
    - Experience building or maintaining ML pipelines, possibly using frameworks like Kubeflow, Airflow, or custom solutions.
    - Knowledge of model serving frameworks (TensorFlow Serving, TorchServe, NVIDIA Triton, or custom Flask/FastAPI servers for ML).
    Monitoring & Reliability:
    - Experience setting up monitoring for applications and models (using Prometheus, Grafana, CloudWatch, or similar) and implementing alerting for anomalies.
    - Understanding of model performance metrics and how to track them in production (e.g., accuracy on a validation stream, response latency).
    - Familiarity with concepts of A/B testing or canary deployments for model updates in production.
    Security & Compliance:
    - Basic understanding of security best practices in ML deployments, including data encryption, access control, and dealing with sensitive data in compliance with regulations.
    - Experience implementing authentication/authorization for model endpoints and ensuring infrastructure complies with organizational security policies.
    Team Collaboration:
    - Excellent collaboration skills to work with cross-functional teams.
    - Experience interacting with data scientists to translate model requirements into scalable infrastructure.
    - Strong documentation habits for outlining system designs, runbooks for operations, and lessons learned.

    Nice to have:
    LLM/AI Domain Experience:
    - Previous experience deploying or fine-tuning large language models or other large-scale deep learning models in production.
    - Knowledge of specialized optimizations for LLMs (such as model parallelism, quantization techniques like 8-bit or 4-bit quantization, and use of libraries like DeepSpeed or Hugging Face Accelerate for efficient training) will be highly regarded.
    Distributed Computing:
    - Experience with distributed computing frameworks such as Ray for scaling up model training across multiple nodes.
    - Familiarity with big data processing (Spark, Hadoop) and streaming data (Kafka, Flink) to support feeding data into ML systems in real time.
    Data Engineering Tools:
    - Some experience with data pipelines and ETL.
    - Knowledge of tools like Apache Airflow, Kafka, or dbt, and how they integrate into ML pipelines.
    - Understanding of data warehousing concepts (Snowflake, BigQuery) and how processed data is used for model training.
    Versioning & Experiment Tracking:
    - Experience with ML experiment tracking and model registry tools (e.g., MLflow, Weights & Biases, DVC).
    - Ensuring that every model version and experiment is logged and reproducible for auditing and improvement cycles.
    Vector Databases & Retrieval:
    - Familiarity with vector databases (Pinecone, Weaviate, FAISS) and retrieval systems used in conjunction with LLMs for augmented generation is a plus.
    High-Performance Computing:
    - Exposure to HPC environments or on-prem GPU clusters for training large models.
    - Understanding of how to maximize GPU utilization, manage job scheduling (with tools like Slurm or Kubernetes operators for ML), and profile model performance to remove bottlenecks.
    Continuous Learning:
    - Up-to-date with the latest developments in MLOps and LLMOps (Large Model Ops).
    - Active interest in new tools or frameworks in the MLOps ecosystem (e.g., model optimization libraries, new orchestration tools) and a drive to evaluate and introduce them to improve the processes.

    Responsibilities:
    - Design and implement modern, scalable ML infrastructure (cloud-native or on-premises) to support both experimentation and production deployment of NLP/LLM models. This includes setting up systems for distributed model training (leveraging GPUs or TPUs across multiple nodes) and high-throughput model serving (APIs, microservices).
    - Develop end-to-end pipelines for model training, validation, and deployment.
    - Automate the ML workflow from data ingestion and feature processing to model training and evaluation, using technologies like Docker and CI/CD pipelines to ensure reproducibility and reliability.
    - Collaborate with Data Scientists and ML Engineers to design MLOps solutions that meet model performance and latency requirements.
    - Architect deployment patterns (batch, real-time, streaming inference) appropriate for various use-cases (e.g., a real-time chatbot vs. offline analysis).
    - Implement and uphold best practices in MLOps, including automated testing of ML code, continuous integration/continuous deployment for model updates, and rigorous version control for code, data, and model artifacts.
    - Ensure every model and dataset is properly versioned and reproducible.
    - Set up monitoring and alerting for deployed models and data pipelines.
    - Use tools to track model performance (latency, throughput) and accuracy drift in production.
    - Implement logging and observability frameworks to quickly detect anomalies or degradations in model outputs.
    - Manage and optimize our Kubernetes-based deployment environments. Containerize ML services and use orchestration (Kubernetes, Docker Swarm, or similar) to scale model serving infrastructure.
    - Handle cluster provisioning, health, and upgrades, possibly using Helm charts for managing LLM services.
    - Maintain infrastructure-as-code (e.g., Terraform, Ansible) for provisioning cloud resources and ML infrastructure, enabling reproducible and auditable changes to the environment.
    - Ensure the infrastructure is scalable, cost-effective, and secure.
    - Perform code reviews and guide other engineers (both MLOps and ML developers) on building efficient and maintainable pipelines.
    - Troubleshoot issues across the ML lifecycle, from data processing bottlenecks to model deployment failures, and continuously improve system robustness.

    The company offers:
    - Competitive salary.
    - Equity options in a fast-growing AI company.
    - Remote-friendly work culture.
    - Opportunity to shape a product at the intersection of AI and human productivity.
    - Work with a passionate, senior team building cutting-edge tech for real-world business use.

    More
Log In or Sign Up to see all posted jobs