IOIX Ukraine

Joined in 2019
100% answers

We specialize on data mining, scraping and news analysis systems for major Japanese clients for 20 years.

  • Web Crawling/scraping Engineer (5+), Google, Twitter platforms to $5000

    Part-time · Full Remote · Ukraine · Product · 5 years of experience · Intermediate
    Bypassing Google CAPTCHA and Twitter's hidden bans for data collection through crawling the search results on these platforms. Preference will be given to candidates with ready-made solutions who can demonstrate their work. Respond to this job offer...

    Bypassing Google CAPTCHA and Twitter's hidden bans for data collection through crawling the search results on these platforms.

     

    Preference will be given to candidates with ready-made solutions who can demonstrate their work. Respond to this job offer only if you comply.

     

    Work Experience: Minimum of 5 years of experience in web crawling and automation.

     

    Key responsibilities include collecting statistics on various types of blocks and developing automated tests to check the responses of websites. The specialist also works on automating data structure management, such as user profiles.

     

    An important part of the job is simulating user behavior on social networks, such as Google and other sites, including the use of artificial intelligence. This specialist should not just be a consultant-analyst but also possess the ability to independently gather and analyze data, rather than just requesting prepared information about the frequency of CAPTCHA blocks and the types of CAPTCHAs shown to different users.

     

    Candidate Requirements:

     

    Experience:

    • Proven experience in developing web scrapers and crawlers for collecting data from complex web applications (5+ years).
    • Deep understanding of the principles of HTTP, HTML, CSS, and JavaScript.
    • Experience with HTML parsing tools (Beautiful Soup, lxml, etc.).
    • Experience with browser automation tools (Selenium, Puppeteer).
    • Experience in bypassing blocks and CAPTCHAs.
    • Experience with proxy servers and VPNs.

       

    Skills:

    • Ability to analyze the structure of web pages and APIs.
    • Ability to develop effective and reliable blocking bypass algorithms.
    • Ability to work with asynchronous requests.
    • Ability to write clean, maintainable, and well-documented code.

     

  • LLM QA Engineer / Analyst to $2500

    Part-time · Full Remote · Ukraine · Product · 3 years of experience · Intermediate
    We're looking for LLM QA Engineer/Analyst for analyzing and improving the quality of LLM results. This role involves collaborating with LLM engineers to enhance overall performance. QA LLM engineer has basic knowledge of testing, training, and...

    We're looking for LLM QA Engineer/Analyst for analyzing and improving the quality of LLM results. This role involves collaborating with LLM engineers to enhance overall performance.

     

    QA LLM engineer has basic knowledge of testing, training, and fine-tuning language models, make immediate assessment of results and necessary adjustments, helping identify the most suitable model for our needs in terms of quality and performance.

     

    There are many good pre-trained models with good performance. 

     

    The work strategy for LLM QA involves:  
    - conducting quick tests,  
    - making comparisons,  
    - selecting the best pre-trained model for fine-tuning,  
    - using advanced LLMs like Anthropic or GPT-o1 to create a training dataset,  
    - fine-tuning our selected pre-trained model, like Llama, to achieve optimal results.

     

    Key Responsibilities:

    • Conduct automated and manual checks to verify model responses.
    • Prepare automated data sets from raw data according to specified requirements.
    • Create automation scripts for statistical analysis to handle typical requests, such as:
      • Identifying the most and least popular items from a data set.
      • Sorting or filtering data sets based on predefined criteria.
    • Develop automated tests for large data sets, ensuring success based on simple criteria, such as predefined numeric or textual values, or ranges of values.
    •  

    Candidate Selection Criteria:

    • A portfolio demonstrating experience with similar tasks.
    • Code examples of automation that can be reviewed.
    • Platform: Linux. Language preferences include Python, Bash, Java, and JavaScript. Lesser interest in Ruby, Rust, Go, and other rising technologies.

     

    Respond to this job offer with list of your skills and experience that matching required to handle such tasks.

  • Senior LLM engineer - custom LLMs creation and fine-tuning

    Part-time · Full Remote · Worldwide · Product · 3 years of experience · Pre-Intermediate
    Custom LLM Models Creation and Fine-Tuning (in japanese and english social content) The project aims to develop a custom machine learning model to 1) accurately detect country names and 2) determine if texts pertain to Japanese national elections ......

    Custom LLM Models Creation and Fine-Tuning

    (in japanese and english social content)

     

    The project aims to develop a custom machine learning model to 

    1) accurately detect country names and 

    2) determine if texts pertain to Japanese national elections

    ... improving upon common issues found in standard language models.

     

    To respond to this offer, list the competencies needed for this task and confirm they align with your skills and experience. 

     

    Introduction

    Custom model tuning aims to enhance accuracy in entity detection and handling special cases. Standard models like Llama, Mistral, Llama3, QWQ, and Gemma often face several issues:

    1. Inaccurate or vague responses unsuitable for extracting entity names or feature values.
    2. Inconsistent response formats that complicate parsing.
    3. Incorrect outputs in feature detection and attribute tasks.
    4. Semantic errors in entity detection and evaluation tasks.

     

    The custom model tuning attempt addresses problem #3. This is a relatively simple task for LLM, typically solved quickly (less than 2 seconds per request) when filtering arrays of 10-100 thousand elements.

     

    Task Examples: 

    1. Identify countries referenced in the text, either directly or indirectly. 

    2. Assess whether the text pertains to national elections in Japan. 

    Both tasks analyze Japanese or mixed Japanese-English texts up to 500 characters from Twitter, encompassing official news, personal opinions, and dialogues. 

     

    INPORTANT NOTE: Identifying a country's name in the text is straightforward and can be effectively handled with regular expressions, as it's an NLP task rather than an LLM's. However, the challenge lies in detecting indirect references to a country, as specified in the task definition. For instance, mentions like the "White House" or a USA political party point to the USA, while the name of a Japanese political party indicates Japan. This complexity is the main challenge of the task. 

     

    An explicit mention of the name of a country, language, or nationality, without using it as an object or subject that is the main part of the sentence or semantic agent, is also preferable not to include in the list. 

     

     Country List Detection with LLM model: 

    Common errors: 

    • Including geographical locations (regions, prefectures) 
    • Including continents and organization names 
    • Missing indirect references 
    • Incorrect detection based on political parties, positions, politician names 

    Result: delimited list of country names in Japanese (0 to 4 elements) 

     
    Election Topic Detection with LLM model: 

    Task: determine if Japanese national elections are the main text topic 

     

    Positive cases include: voting processes, preparation, election campaigns, program discussions 

    Common errors: 

    • Wrong country identification 
    • Incorrect election type detection 
    • False positives on keyword mentions 

    Result: clear boolean format (YES/NO)  

     

    Platforms and Tools

     

    The main analysis tool is Ollama, utilized as both a CLI and REST HTTP service for experimental research and routine processing. Ideally, the custom model should be manageable via Ollama, though this is not mandatory. If Ollama is unavailable, the model must offer a REST HTTP service for local deployment on a dedicated server.

     

    Datasets of 10 000 items will be provided, along with small sets (10-100) of typical error cases for selective testing.

    That is, we need to split the task into two stages or steps: 

    1. Preparation of quality datasets for training 
    2. Training, tuning, or creating a model from scratch. 

    _______

     

    Canditate requirements

     

    The candidate should have strong experience with these specific tools rather than a broader but shallower knowledge of many frameworks. This focused approach aligns better with the project's specific goals of building a fast, accurate multilingual classification system.

     

    Machine Learning & NLP Expertise:

    • Strong background in Natural Language Processing (NLP) and text classification
    • Experience with multilingual text processing (specifically Japanese and English)
    • Proficiency in developing and fine-tuning machine learning models
    • Knowledge of modern language models and their applications

     

    Programming & Tools:

    • Proficiency in Python (or C++, Ruby, JavaScript, or another language platform depending on a framework and model basics) and relevant ML/NLP libraries
    • Experience with text processing and classification frameworks
    • Familiarity with large language models (LLMs)

     

    Data Processing Skills:

    • Experience in handling multilingual datasets
    • Ability to work with various data formats and sources
    • Knowledge of data cleaning and preprocessing techniques
    • Experience with social media data processing (particularly Twitter/X data)

     

    Performance Optimization:

    • Ability to optimize models for speed (requirement of 2-second response time)
    • Experience in handling large-scale data processing (10,000-100,000 items)
    • Skills in model optimization and efficiency improvement

     

    Task-Specific Experience:

    • Text classification and entity recognition (specifically for country detection)
    • Context-based classification (such as election-related content detection)
    • Experience with short-text classification (500 characters or less)

     

    Language Requirements:

    • Proficiency in Japanese language processing
    • Experience with mixed language content (Japanese-English)
    • Understanding of multilingual NLP challenges

     

    Education and Experience:

    • Master's or Ph.D. in AI LLMs
    • Minimum 3-5 years of experience in LLMs ML/NLP development
    • Demonstrated experience with similar text classification projects, examples of fine tuned or created from scratch specialized models 

     

    Additional Desired Qualifications:

    • Experience with Japanese language NLP tools and frameworks
    • Knowledge of social media content analysis
    • Background in building production-ready ML systems
    • Understanding of ML model deployment and scaling

     

    1.  Experience with Major LLM Platforms and Their Tools:

    •  Meta's LLaMA ecosystem (especially LLaMA 2)

    •  Experience with Ollama deployment and management

    •  Knowledge of other major LLMs: OpenAI API, Anthropic Claude, Cohere

    •  Understanding of open-source LLM deployment and fine-tuning

     

    2.  Fine-tuning and Adaptation Skills:

    •  Experience in adapting pre-trained models for specific tasks

    •  Knowledge of efficient fine-tuning techniques (LoRA, QLoRA, PEFT)

    •  Understanding of prompt engineering and few-shot learning

    •  Experience with model quantization and optimization

     

    3.  Practical Skills:

    •  Ability to evaluate and choose appropriate base models

    •  Experience in model deployment and serving

    •  Knowledge of cost-effective approaches to model adaptation

    •  Understanding of inference optimization techniques

     

    4.  Task-Specific Requirements:

    •  Experience with multilingual models (Japanese-English specifically)

    •  Knowledge of entity recognition fine-tuning

    •  Understanding of context classification 

    •  Experience with short text processing optimization

     

    The task of extracting country names from text called Named Entity Recognition or NER, and their extraction from text, if explicitly present, is Named Entity Extraction (NEE). And Named Entity Identification (NEI) if entities are not explicitly mentioned and need to be formulated from context. 

     

    Manual verification of each result is very important because otherwise, even a small percentage of incorrect data can significantly damage the trained model and reduce the statistical quality of its future performance. 

     

    1st task is preparing quality samples for model training. This will focus more on the end result, namely obtaining quality datasets with verified results for both tasks: the list of countries and the topic about elections in Japan. 

     

    We expect quality datasets of various sizes suitable for LLM training as output. 


    We can stipulate that the first small dataset (0.5-2K) should be provided within a week, and larger ones (>5K) later... 

    Sample dataset will be provided in JSON format, filtered by topic, and likely to contain target entities: 
     

    - 55K element archive 
    - 1.5K element dataset with results from different models for the country list task might be useful for comparison examples - as we already have. 
     

    The task should be into two stages or steps: 

    1. Preparation of quality datasets for training 
    2. Training, tuning, or creating a model from scratch. 
Log In or Sign Up to see all posted jobs