Senior Python Developer with Big Data experience (PySpark) (57) Offline

Data engineers use big data technology to create best-in-industry analytics capability. This position is an opportunity to use Hadoop and Spark ecosystem tools and technology for micro-batch and streaming analytics. Data behaviors include ingestion, standardization, metadata management, business rule curation, data enhancement, and statistical computation against data sources that include relational, XML, JSON, streaming, REST API, and unstructured data. The role has responsibility to understand, prepare, process and analyze data to drive operational, analytical and strategic business decisions.

 

 

About the Customer:

The customer is an American company based in Chicago. It accelerates digital transformation for the insurance and automotive industries with AI, IoT and workflow solutions.

About the Project:

The customer has been working on an analytics platform since 2018. The platform is on Hadoop and the Hortonworks Data Platform, and the customer is planning on moving it to Amazon EMR in 2021. The customer has a variety of products, the data for all of which comes into one data lake on this analytics platform, which also allows the customer to do next generation analytics on the amassed data.

 

Architecture:

Hortonworks is the current vendor. It will be replaced by Amazon EMR. Tableau is going to be the BI vendor. Microstrategy currently exists and will be phased out by early 2023.

 

All data is sent to the data lake, and the customer can do industry reporting. These data are used by a data science team to build new products and an AI model.

 

We will be moving to real-time streaming using Kafka and S3. We are doing POC to use Dremio and Presto for the query engine.

 

We're migrating to version 2.0 using Amazon EMR and S3, and Query engine is bucketed under 2.0 project.

 

Project Advantages:

Cross product analytics

Analytics for every new product customer has. Analytics team products is how the customer sells the products value to clients

Quarterly Business Review meetings use data to explain how customer’s product is helping clients in their business

You'll get to work with a cross-functional team

You will learn the customer’s company business

Project Tech Stack:

Technologies used are all open source Hadoop, Hive, PySpark, Airflow, Kafka to name a few

Project Stage:

Active Development

Must Have Qualifications:

Proficiency in Python and PySpark

3+ years experience building, maintaining, and supporting complex data flows with structural and unstructural data

Experience working with distributed applications

Experience working with big data tools such as HDFS / or HIVE / or SQOOP

Ability to use SQL for data profiling and data validation

Master’s or Bachelor’s degree

 

Nice to have (If no such experience we expect the candidate to be willing to learn or having experience with similar tools, because these tools are a part of the project ecosystem):

 

Understanding of AWS ecosystem and services such as EMR and S3

Familiarity with Apache Kafka and Apache Airflow

Experience in Unix commands and scripting

Experience and understanding of Continuous Integration and Continuous Delivery (CI/CD)

Understanding in performance tuning in distributed computing environment (such as Hadoop cluster or EMR)

Familiarity with BI tools (such as Tableau or MicroStrategy)

 

Responsibilities:

Build end-to-end data flows from sources to fully curated and enhanced data sets. This can include the effort to locate and analyze source data, create data flows to extract, profile, and store ingested data, define and build data cleansing and imputation, map to a common data model, transform to satisfy business rules and statistical computations, and validate data content

Modify, maintain, and support existing data pipelines to provide business continuity and fulfill product enhancement requests

Provide technical expertise to diagnose errors from production support teams

Coordinate within on-site teams as well as work seamlessly with the US team

An ideal candidate will develop and maintain exceptional SQL code bases and expand our capability through Python scripting

 

English level:

Intermediate +

 

Advantages of Working with Exadel:

You'll build your expertise with Sales Support, which provides assistance with existing and potential projects

You can join any Exadel community or create your own to communicate with like-minded colleagues

There are opportunities for continuing education as a mentor or speaker

You can take part in internal and external meetups as a speaker or listener

You'll have the chance to improve your English skills with the help of native speakers

We participate in cultural, sport, charity, and entertainment events, and we'd love to have you there, too!

The job ad is no longer active
Job unpublished on 10 October 2021

Look at the current jobs Python Kyiv→