Incident Response Manager - Infrastructure Engineering
ByteDance
Responsibilities
The Data Systems Infrastructure (DSI) team sits within the ByteDance global technology structure and supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and are reliable. We are seeking a technically skilled and detail-oriented professional to serve as a front-line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will have a strong foundation in facility operations, broad knowledge across IT, infrastructure, or engineering disciplines, experience in critical environments, and the ability to analyze incidents, manage them calmly, identify trends, and drive sustained improvements. This role requires performance under pressure, data-driven thinking, and a proactive approach to continuous improvement and operational resilience. Responsibilities - Serve as the first responder in the IRC Operation Center, detecting and responding to events across infrastructure, facilities using tools such as Server Automation, Data Center Infrastructure Management, Network monitoring, Grafana, and related systems. - Respond promptly to events including but not limited to: - Data Center Environmental systems (e.g. high temperature, humidity, power fluctuations or failures) - IT infrastructure (e.g. server performance issues, network outages, system failures) - Facility and environmental alerts relevant to operations (e.g. Flooding, - External Facing Services (e.g. colocation maintenance notices, service requests from CDN partners, and critical notifications) - Conduct detailed investigations to diagnose the root cause of events, assess their impact, and determine appropriate response actions. - Monitor and analyze detected events, accurately classify incidents based on potential or actual customer impact, and proactively communicate risks. Coordinate timely escalations by notifying and collaborating with relevant support teams to ensure swift incident resolution. - Monitor incident response performance against agreed SLAs, ensuring timely alerts and notifications. - Manage incidents efficiently, performing in-depth investigations to determine root causes and impacts, while promptly engaging and coordinating with the designated resolver teams to facilitate timely resolution. - Draft detailed incident reports and conduct post-mortem reviews to document lessons learned. - Generate regular reports to deliver comprehensive insights into the effectiveness of incident response and recovery processes. - Analyze trends and patterns in events to identify opportunities for improvement and optimization - Own and drive the Incident, Problem, and Change Management processes in alignment with ITIL or internal ITSM frameworks. - Develop and maintain a comprehensive library of Standard Operating Procedures (SOPs), Methods of Procedure (MOPs), runbooks, and operational guides to ensure consistency and readiness across teams. - Lead or support continuous improvement projects aimed at enhancing incident response capabilities, operational security, system reliability, and overall infrastructure performance. Collaborate with cross-functional teams to implement engineering solutions and process optimizations. - Provide technical and operational leadership to the incident response center team, ensuring consistent performance and adherence to best practices.
Qualifications
Minimum Qualifications - Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field. - Strong technical background, prioritizing experience in Data Center Facility Operations Center (DC FOC) management; experience in IT infrastructure, network operations, or systems monitoring is also desirable. - Proven ability to analyze complex systems, investigate incidents, and identify root causes effectively. - Familiarity with monitoring and alerting tools such as Grafana, Nagios, or similar platforms. - Experience in incident and problem management processes, with the ability to drive corrective actions and coordinate cross-functional teams. - Strong communication skills to draft reports, conduct reviews, and liaise with technical and non-technical stakeholders. - Proactive mindset with a focus on continuous improvement and operational excellence. Preferred Qualifications: - 5+ years of experience in IT environments—such as data centers or enterprise systems—combined with hands-on incident and problem management experience. - Proven experience in facility management across mechanical, electrical and plumbing (MEP) systems. - Proven ability to perform effectively under pressure and within tight time constraints to resolve issues and meet deliverables. - Hands-on experience with ticketing systems, monitoring tools such as Grafana, server infrastructure, and data center systems. - Working knowledge and/or certifications in one or more of the following: ITIL Foundation/ CompTIA Server+/ Schneider Electric Data Center Certified Associate (DCCA)/ Cisco Certified Network Associate (CCNA)/ Project Management Professional (PMP)/ Data Analytics and Visualization tools or methodologies - Demonstrated experience in driving or contributing to improvement projects focused on operational efficiency, security enhancements, or infrastructure reliability. - Ability to manage multiple tasks and projects, ensuring timely delivery and alignment with organizational goals. - Strong adaptability and problem-solving skills in ambiguous and rapidly changing environments. - Willingness to be on call during weekends, nights, and holidays.
Job Information
About Us
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.