Senior AI Engineer
TARGET
Total years of experience :10 years, 5 Months
• Led a team of 4 engineers in developing and maintaining a high-performance, distributed Feature Dataset with more than 200 features
• Created and managed data pipelines, involving extracting data from various sources, transforming it into a usable format, and loading it into data storage or analytics platforms
• Designed and upheld a framework for automated ETL processes, guaranteeing seamless execution of data integration and transformation tasks while reducing the need for manual involvement
• Implemented the seamless migration of on-premise data systems to Google Cloud Platform (GCP), ensuring minimal disruption and maximizing efficiency in data storage, processing, and management
• Developed a library of common PySpark functions and deployed it in virtual environment which is being used across multiple teams, thus reducing the development time
• Executed data quality measures, data governance protocols, and data validation checks, resulting in a 60% decrease in data errors, thereby enhancing the precision of analyses and decision-making processes
• Designed and maintained data warehouses/data lakes to store structured and unstructured data efficiently, setting up schemas, and optimizing for query performance
• Implemented the deployment of several CPU-based AI/ML models to GPU using Docker, Kubernetes, GPU Array, resulting in enhanced efficiency across multiple models and reduced runtime
• Provided technical mentorship to junior team members, conducting code and design reviews, and enforcing coding standards and best practices
• Led the migration of petabytes of unstructured/semi-structured data from legacy systems (TeraData, CR, and
Informatica) to AWS.
• Created and upheld a data lake housing more than 1 PB of data, facilitating data-informed decision-making for critical business endeavors
• Developed efficient framework for staging, cleansing, transforming, and loading data using HDP, HDFS, Spark, Hive, and Sqoop
• Optimized multiple batch and stream processing workflows for increased performance and reliability
• Worked closely with the data science team to comprehend their needs and convert data into the necessary formats
Devised and executed a real time data pipeline for processing semi-structured data, amalgamating 150 million raw records sourced from over 30 data origins through Kafka and PySpark.
Developed an in-house Python library utilized for parsing and reformatting data obtained from external vendors, resulting in a 7% decrease in the error rate within the data pipeline
Created various lambda functions for data cleansing and transformation using Scala and Spark API