
Let’s grow together.
NETSKRT JOB OPENINGS
Live Operations / Systems Reliability Engineer
Netskrt is looking for a Live Operations / Systems Reliability engineer to be part of our Live Operations team that oversees our managed service. Netskrt’s eCDN service is comprised of three major components: intelligent content collection, staging and distribution; adaptive networking, leveraging connectivity as and when available; and an edge cache that allows users to access the content they want locally, using the apps and subscriptions that they already have.
Your prime responsibility and priority is to ensure customer excellence. You are passionate about system reliability to influence and support the strategic Systems Reliability Engineering mission. As a Live operations / Systems Reliability engineer you are responsible for monitoring and maintaining the health of the system.
We are a highly motivated team, dedicated to delivering products and services that improve the customer experience when accessing internet video at the edges of the network. We are developing a set of inter-related technologies targeting businesses that offer WiFi to their customers but which have limited bandwidth.
You are somebody who enjoys solving problems and has a customer-centric mindset. You should be passionate not only about learning new technologies, but also about running systems and software in the real world. You must enjoy a close-knit team environment of shared responsibility, be a team player and a self-starter.
You have exceptional technical skills, and enjoy solving challenging problems. You are a quick learner, you adapt easily and you have great interpersonal and communication skills.
Netskrt offers the opportunity to obtain hands on experience with storage, networking, security, and cloud technologies. As part of the Netskrt team you will have the opportunity to design and implement solutions to solve challenging problems in a startup environment; working with accomplished engineers and a leadership team with a proven track history of success.
Key Responsibilities:
As a Live Operations / Systems Reliability engineer you are responsible for monitoring, supporting and maintaining system health, performance and reliability. Your mission is to ensure that our service is fast, highly available, scalable and able to withstand unprecedented load. The Live Operations team will be at the heart of solving deployment and production problems; building automation tools for system health and production acceptance tests to validate production changes, and for ongoing live monitoring and reporting.
You will work closely and collaborate with the Engineering and Networking & Infrastructure teams to ensure a holistic approach to troubleshooting and implementing preventative measures; ensuring the system is well instrumented and highly fault tolerant.
The successful candidate will possess an outstanding record of professional experience and will thrive in an environment that demands accountability. You will be a key member of a team that understands the big picture perspective, and instills a customer-first attitude.
Specific areas of responsibility include:
-
Monitor and perform analysis of our managed service and operational procedures and processes to improve and maintain quality standards
-
Ensure availability, latency, scalability and efficiency by instilling engineering reliability into our deployed systems with a focus on fault tolerant approaches
-
Establish, implement and maintain tools and processes that ensure system quality and reliability
-
Establish and implement optimum release and deployment tools
-
Implement metrics driven processes to ensure service quality targets are met
-
Engage, influence, and evangelize SRE practices with development, operational and product groups to align technology service/solution delivery
Required Qualifications, Skills, Experience:
-
Degree in Computer Science or related technical field
-
Minimum of 5-years experience supporting, developing and deploying large scale software systems
-
Outstanding knowledge in release engineering/management and performance optimization
-
Prior successful experience as a systems performance or systems reliability engineer
-
Solid experience in the use of Linux/Unix
-
Solid experience in coding / scripting languages (e.g., C++, PHP, Python, Perl)
-
Solid experience in use of fault tolerant approaches in a large scale distributed environment and high performance systems
-
Demonstrated experience working in large, complex systems environments
-
Deep understanding of internet and networking protocols
-
Analytical mind with excellent problem-solving skills
-
Excellent written and verbal communication skills
-
Excellent time management, communication, decision-making, presentation, and organizational skills
Desired Qualifications:
-
Experience in system and server administration, large system deployments
-
Experience developing, maintaining CI/CD pipelines
-
Wide knowledge in networking, security, database and cloud systems
-
Configuration/container management (Kubernetes, Chef, Puppet, Mesos)
-
Monitoring tools, e.g., Zabbix, Nagios
-
Knowledge of patch management, intrusion detection/prevention systems
-
Cloud computing and cloud technologies (AWS, OpenStack)
-
Experience with caching and CDN (content delivery network) technologies (Netflix, Amazon, Google, Limelight, Akamai, Fastly)
Personal Information:
