Atif Afsar , Technology Expert

Atif Afsar

Technology Expert

ThoughtWorks

Location
United Arab Emirates
Education
Master's degree, MBA
Experience
13 years, 3 Months

Share My Profile

Block User


Work Experience

Total years of experience :13 years, 3 Months

Technology Expert at ThoughtWorks
  • India - Gurgaon
  • My current job since August 2022

● Led a team of 20 members for observability, reliability, and resilience projects for multiple clients, which presented distinct challenges and opportunities for enhancement.
● Expertly managing the intricate resolution of multifaceted service reliability challenges, frequently entailing collaboration with different teams of customers.
● Taking ownership of crafting comprehensive roadmaps that distinctly outline the envisioned improvements and enhancements planned for our valued customers
● Provided consultancy services to other Thoughtwork clients, where I played a vital role in designing and implementing resiliency patterns that ensured the systems remained responsive, reliable, and available, even when facing unforeseen challenges or failures.
● Conducted SLO workshop for numerous ThoughtWorks SRE projects. This included implementing
effective quality gates such as SLI, SLO, and SLA, as well as establishing error budgets to ensure
optimal performance.
● Shaped our ThoughtWorks SRE solutions by creating a comprehensive service delivery framework,
developing innovative proposals, and defining archetypes that reflected our clients' unique needs and challenges. Through these efforts, we were able to deliver highly effective SRE solutions that were tailored to our clients' specific requirements and delivered measurable business value.

Senior SRE Project Manager at Yatra
  • India - Gurgaon
  • February 2021 to August 2022

● Continual assessment, development, and delivery of SRE strategy while overseeing the day to day IT operational performance of the end-to-end platforms of infrastructure & flight services.
● Partnering with leaders in product, engineering, business, and operations to identify and address risks, performance bottlenecks, and limits in our system before they lead to large-scale issues. Also managing SLOs / Error Budgets for service teams
● Responsible for availability and reliability of applications, in particular driving incident and problem RCAs to cost-effective solutions.
● Integrated AWS OpenSearch to leverage its machine learning capabilities (Based on random cut forest) to automatically detect anomalies.
● Collaborating with various flights teams to understand pain points and where SRE can improve to remove toil and manual repetitive tasks.
● For 8 months, I was an active member of Yatra’s Change Advisory Board, where I drove improvements in release and change management aligned with SRE principles. I led initiatives like implementing full-stack observability, adopting progressive deployment strategies, and optimizing incident response. These changes significantly enhanced system reliability and aligned us better with our Service Level Objectives (SLOs).
● Provide extensive root-cause analysis and recommendations for issues identified during proactive monitoring in the field
● Adhere to problem management practices that focus on root cause analysis and prevention of future problems
● Build strong partnerships with external vendors (Payment Gateways, Airline Supplier, etc.) to ensure platform
stability and success to exceed expectations of internal business partners.
● Define and promote Observability Driven Development (ODD) standards across the organization
● Implementing monitoring and alarming level, operational and service, of infrastructures and applications (physical and virtualized hardware, network and communications equipment, operating systems, databases, application servers, etc.)
● Consulting with microservice teams to drive reliability efforts, including adding monitoring, alerting, deployment practices, application tuning and chaos resiliency. This effort significantly increased the percentage of critical infrastructure teams that had effective monitoring and alerting.

Senior Project Manager at Heal Inc
  • India - Gurgaon
  • March 2011 to February 2021

● SRE, Product Reliability & High Availability - Lead a 24x7 team (10 Members) of site reliability engineers which were accountable for handling issues like latency, availability, quality, and saturation in a Kubernetes enabled DevOps environment. Planning & providing high availability capabilities which also includes providing/training SRE team with required tools which enables them to troubleshoot, research, analyse, and diagnose complicated technical issues by diving into backend systems and logging. Also responsible for design and document procedures, analysis, insight models, and alerting techniques for the purpose of ensuring services are working optimally, including error and fault tracking, data audits and alerting triggers
● Product Implementation, Management & Delivery - Complete end-to-end product management of APM (Appnomics home-grown tool) from implementation to delivery. Actively involved in creating use cases, identifying product gaps, understanding business requirements/issues from stakeholders, and turning them into solutions.
● Incident Detection, Response & Risk Management - Takes a central role as Incident Manager in Yatra.com and OBC Bank for production critical incidents focusing on minimizing MTTR & MTTD. Providing 24x7 on-call emergency response & support as necessary for critical incidents. Act as an escalation point/SPOC for critical issues related with Bank Services, Delivery Channels, Airlines Supplier & Payment Gateway. Perform periodic risk assessments related to service availability and processes to identify emerging risks & gaps
● Data Ingestion & Architecting Solutions - Years of experience in building solutions around site reliability, integrating API’s, enhancing alert capabilities, and identifying anomalies. Built applications like central incident management system which ingest all incidents in a single window that displays correlation of issues and root cause. Written several scripts to automate the switching of payment gateway. Built several web-bots using python & selenium to check reliability of airline pricing, have also consumed Telegram API for alerting purposes. integrated Xdistributor (Flight Middleware) by using its API with elasicsearch/kibana to measure various KPI’s of supplier (Amadeus, Naviatire, Galileo, etc.). Above all, developed an AIOps-like alerting system to find anomalies in LOB’s which was highly regarded.
● Business Continuity Management & Disaster Recovery - Lead and drive the Business Continuity Management related engagements, supporting Yatra & OBC Bank in their continuity, resilience needs. Defining business continuity strategies based on the results of the business impact analysis, risk assessments and draft business continuity plans (BCPs) in line with the defined strategy. Also ensuring the IT Disaster Recovery Plans for DR protected systems, associated procedures, and supporting documentation are maintained, tested, and improved over time.

Education

Master's degree, MBA
  • at Swami Vivekanand Subharti University
  • April 2011

Specialties & Skills

IT Service Management
IT Management
Project Management
Linux Administrator
MANAGEMENT
RELIABILITY
SERVICE DELIVERY
RESILIENCE
RISK MANAGEMENT
BUSINESS CONTINUITY
KUBERNETES
PERFORMANCE MANAGEMENT
RESEARCH
Network Engineer
Linux Professional
Python Appliation Developer

Languages

English
Intermediate

Training and Certifications

RHCE (Certificate)
Date Attended:
December 2012
Valid Until:
February 2013
Certified Ethical Hacker (Certificate)
Date Attended:
March 2013
Valid Until:
March 2013
ITIL V3 Certified (Certificate)
Date Attended:
August 2009
Valid Until:
September 2009
Masters Diploma in Internet Architecture (Certificate)
Date Attended:
March 2004
Valid Until:
April 2004