Atif Afsar

Work Experience

Total years of experience :13 years, 3 Months

Technology Expert at ThoughtWorks

India - Gurgaon
My current job since August 2022

● Led a team of 20 members for observability, reliability, and resilience projects for multiple clients, which presented distinct challenges and opportunities for enhancement.
● Expertly managing the intricate resolution of multifaceted service reliability challenges, frequently entailing collaboration with different teams of customers.
● Taking ownership of crafting comprehensive roadmaps that distinctly outline the envisioned improvements and enhancements planned for our valued customers
● Provided consultancy services to other Thoughtwork clients, where I played a vital role in designing and implementing resiliency patterns that ensured the systems remained responsive, reliable, and available, even when facing unforeseen challenges or failures.
● Conducted SLO workshop for numerous ThoughtWorks SRE projects. This included implementing
effective quality gates such as SLI, SLO, and SLA, as well as establishing error budgets to ensure
optimal performance.
● Shaped our ThoughtWorks SRE solutions by creating a comprehensive service delivery framework,
developing innovative proposals, and defining archetypes that reflected our clients' unique needs and challenges. Through these efforts, we were able to deliver highly effective SRE solutions that were tailored to our clients' specific requirements and delivered measurable business value.

Senior SRE Project Manager at Yatra

India - Gurgaon
February 2021 to August 2022

● Continual assessment, development, and delivery of SRE strategy while overseeing the day to day IT operational performance of the end-to-end platforms of infrastructure & flight services.
● Partnering with leaders in product, engineering, business, and operations to identify and address risks, performance bottlenecks, and limits in our system before they lead to large-scale issues. Also managing SLOs / Error Budgets for service teams
● Responsible for availability and reliability of applications, in particular driving incident and problem RCAs to cost-effective solutions.
● Integrated AWS OpenSearch to leverage its machine learning capabilities (Based on random cut forest) to automatically detect anomalies.
● Collaborating with various flights teams to understand pain points and where SRE can improve to remove toil and manual repetitive tasks.
● For 8 months, I was an active member of Yatra’s Change Advisory Board, where I drove improvements in release and change management aligned with SRE principles. I led initiatives like implementing full-stack observability, adopting progressive deployment strategies, and optimizing incident response. These changes significantly enhanced system reliability and aligned us better with our Service Level Objectives (SLOs).
● Provide extensive root-cause analysis and recommendations for issues identified during proactive monitoring in the field
● Adhere to problem management practices that focus on root cause analysis and prevention of future problems
● Build strong partnerships with external vendors (Payment Gateways, Airline Supplier, etc.) to ensure platform
stability and success to exceed expectations of internal business partners.
● Define and promote Observability Driven Development (ODD) standards across the organization
● Implementing monitoring and alarming level, operational and service, of infrastructures and applications (physical and virtualized hardware, network and communications equipment, operating systems, databases, application servers, etc.)
● Consulting with microservice teams to drive reliability efforts, including adding monitoring, alerting, deployment practices, application tuning and chaos resiliency. This effort significantly increased the percentage of critical infrastructure teams that had effective monitoring and alerting.

Senior Project Manager at Heal Inc

India - Gurgaon
March 2011 to February 2021

● SRE, Product Reliability & High Availability - Lead a 24x7 team (10 Members) of site reliability engineers which were accountable for handling issues like latency, availability, quality, and saturation in a Kubernetes enabled DevOps environment. Planning & providing high availability capabilities which also includes providing/training SRE team with required tools which enables them to troubleshoot, research, analyse, and diagnose complicated technical issues by diving into backend systems and logging. Also responsible for design and document procedures, analysis, insight models, and alerting techniques for the purpose of ensuring services are working optimally, including error and fault tracking, data audits and alerting triggers
● Product Implementation, Management & Delivery - Complete end-to-end product management of APM (Appnomics home-grown tool) from implementation to delivery. Actively involved in creating use cases, identifying product gaps, understanding business requirements/issues from stakeholders, and turning them into solutions.
● Incident Detection, Response & Risk Management - Takes a central role as Incident Manager in Yatra.com and OBC Bank for production critical incidents focusing on minimizing MTTR & MTTD. Providing 24x7 on-call emergency response & support as necessary for critical incidents. Act as an escalation point/SPOC for critical issues related with Bank Services, Delivery Channels, Airlines Supplier & Payment Gateway. Perform periodic risk assessments related to service availability and processes to identify emerging risks & gaps
● Data Ingestion & Architecting Solutions - Years of experience in building solutions around site reliability, integrating API’s, enhancing alert capabilities, and identifying anomalies. Built applications like central incident management system which ingest all incidents in a single window that displays correlation of issues and root cause. Written several scripts to automate the switching of payment gateway. Built several web-bots using python & selenium to check reliability of airline pricing, have also consumed Telegram API for alerting purposes. integrated Xdistributor (Flight Middleware) by using its API with elasicsearch/kibana to measure various KPI’s of supplier (Amadeus, Naviatire, Galileo, etc.). Above all, developed an AIOps-like alerting system to find anomalies in LOB’s which was highly regarded.
● Business Continuity Management & Disaster Recovery - Lead and drive the Business Continuity Management related engagements, supporting Yatra & OBC Bank in their continuity, resilience needs. Defining business continuity strategies based on the results of the business impact analysis, risk assessments and draft business continuity plans (BCPs) in line with the defined strategy. Also ensuring the IT Disaster Recovery Plans for DR protected systems, associated procedures, and supporting documentation are maintained, tested, and improved over time.

Education

Master's degree, MBA

at Swami Vivekanand Subharti University
April 2011

Specialties & Skills

IT Service Management

Products By Bayt.com

Share My Profile

Block User

Work Experience

Technology Expert at ThoughtWorks

Senior SRE Project Manager at Yatra

Senior Project Manager at Heal Inc

Education

Master's degree, MBA

Specialties & Skills

Languages

Training and Certifications

RHCE (Certificate) Date Attended: December 2012 Valid Until: February 2013

Certified Ethical Hacker (Certificate) Date Attended: March 2013 Valid Until: March 2013

ITIL V3 Certified (Certificate) Date Attended: August 2009 Valid Until: September 2009

Masters Diploma in Internet Architecture (Certificate) Date Attended: March 2004 Valid Until: April 2004

RHCE (Certificate)

Date Attended:

December 2012

Valid Until:

February 2013

Certified Ethical Hacker (Certificate)

Date Attended:

March 2013

Valid Until:

March 2013

ITIL V3 Certified (Certificate)

Date Attended:

August 2009

Valid Until:

September 2009

Masters Diploma in Internet Architecture (Certificate)

Date Attended:

March 2004

Valid Until:

April 2004