Senior Design Engineer
British Telecom
مجموع سنوات الخبرة :13 years, 4 أشهر
Managing a team that is focused on software deployment and automation in compliance with PCI (Payment Card Industry) norms.
Take ownership of all existing products and solutions utilized by BT customers.
Obsessed with uptime of the solutions that fall under my responsibility.
Docker configuration, support and design, Including a private internal registry solution.
Puppet experience, configuration and setup.
Ansible, AWX/Ansible Tower
Spacewalk, Nexus.
Shell scripting/ Python development.
Software versioning - GitLab
Jenkins
PCI DSS compliance
VMware ESXi Template creation.
• An agile development environment, collaborating with application teams.
• With internal and external teams to collaborate on infrastructure development.
• Support high end, complex problem management to identify complex problems and develop work around solutions and resolutions.
Part of a team that is focused on automation and integration.
GIT (Version Control)
Puppet
Docker
Nagios/Sensu Server Monitoring System
Bash
Python
Vmware/Vsphere
EMC Networker (Backups)
- Currently a member of DevOps team which manages Souq's production infrastructure.
- Experience in working with globally distributed, multicultural teams operating in different time zones.
- Worked in environments with ITIL based change management procedures as well as with fast paced, startup environments.
- Worked on AWS Cloud.
- Experience with configuration management with Puppet.
- Experience with setting up monitoring and trending.
- Expertise in RHEL/CentOS, Debian/Ubuntu. Redhat Certified Engineer (RHEL 6)
- Python/Bash
Monitoring and fixing various Linux-based server related issues.
Take ownership and follow up on recurring issues.
Adjust monitoring as required to reduce the level of ‘noise’ and non-actionable alerts.
Tracking and troubleshooting major website outages.
Onboarding alerts pertaining to a particular component.
Refine alerting priority to align with SLA.
Trouble shooting network related issues.
Automation using Python and Bash.
Configuration Management System - Puppet
Fulfillment of Site Up responsibility.
Participate in incident postmortem discussion.
Maintain runbooks which helps in troubleshooting issues and technical documentation
Configure monitoring and reporting tools such as Nagios.
• Take individual responsibility of stabilizing incidents escalated by the Operations Center (OC)
• Perform advanced level troubleshooting on issues escalated by OC or partner.
• Provide world-class customer service and support our partners at every opportunity.
• Provide a rapid response to escalations, leading to a decrease in response time and the Mean-time-to-resolution (MTTR).
• Aggressively troubleshoot and multitask incidents of varying difficulty and priority with a focus on prioritization of tasks, ensuring that higher priority items are addressed first.
• Maintain runbooks which helps in troubleshooting issues and technical documentation.
• Automation.
• Participate in postmortem discussion.
• Attend change management reviews for supported properties.
• Perform changes which are in line with Site Up responsibility.
• We are expected to maintain a strict SLA with bugs which are outlined in the bug management guidelines documentation.
• Take ownership and follow up on re-occurring issues during daily stand up and monthly bug/incident review meeting with Service Engineering Team (L3).
• Should attend daily stand up. Provide details of events worked on, and highlight recurring issues for permanent solution with the Dev team.
• Must appropriately partner with on the boarding team when engaged in moving alerts and/or monitoring activities from SE to SRE and SRE team.
• Adjust monitoring as required to reduce the level of ‘noise’ and non-actionable alerts.
• Continuously refine property monitoring by making changes to alert thresholds and develop alert correlation to reduce mean time to detect.
• Continuously refine alerting priority to align with SLA’s and proper urgency.
• Conduct OC (L1) training and knowledge transfer for properties transitioning from SRE(L2) to OC
• Write knowledge base steps for alerts transitioned to OC. Conduct trainings on new technologies in various properties or as requested by OC or SE.
• Setup and configure monitoring and reporting tools such as Nagios
• Complete all monitoring change bugs assigned to by due date.
• Build Domain Knowledge in peripheral technologies to assist with property dependency incidents.
• Maintaining 100+ servers (Linux, Apache, MYSQL, PHP stack) along with DNS and email servers.
• Monitoring services using a network monitoring software application Nagios.
• Server and and site account migrations. Managing partitions and filesystems, a fair knowledge in physical volumes, logical volumes and volume groups.
• Network configuration and troubleshooting.
• Configuring Network File Sharing services.
• Setting up new servers and performing all required installations, configuration and setups.
• Constantly monitoring and investigating server resource abuse.
• Basic MYSQL tuning, database management as per customer needs and requirements.
• Performing application/user account migrations among servers to ensure better capacity management.
• Server audits and monitoring tools configuration and management.
• Suggesting, analyzing, and monitoring server Audits, security checks and upgrades.
• Resolve service requests submitted by users as per the SLA.
• Setting up DNS zones and DNS records such as CNAME, A, MX, PTR SPF and also setting up RDNS record, and fixing the DNS issues for customers.
• Troubleshooting email server related issues.
• Managing and installing software firewall (CSF, APF, IPtables) on the servers.
• Worked on cPanel based webservers.
• Worked on PHP development project(internal). It’s a process and SLA driven work management system that monitors the productivity, performance & technical know-how of our technicians round the clock. My contribution was the candidate/employee technical evaluation component, which is an online employee - skill testing interface, for all new techs.
Vishveshvaraiah Technological University, Belgaum, Karnataka.