This automated remediation tool creates online versions of runbooks and can record debug sessions to capture best practices.
Incident automation company Shoreline.io has a new tool for site reliability engineers: Notebooks. This online tool captures debug data in real time and records fleetwide repair commands. Notebooks also can be tied to alarms, making it easier to resolve incidents.
The Notebooks can record repair sessions along with the data used by the on-call team. These recordings can be used for training and for post-mortem analyses of security and other incidents.
Anurag Gupta, founder and CEO of Shoreline, said in a press release that the new service combines documented best practices with real-time diagnostic data.
“Just as Jupyter Notebooks transformed data science, Shoreline Notebooks are transforming on-call operations,” he said. “Our Notebooks make it easier to onboard new team members and to safely empower everyone on-call.”
Data scientists use Jupyter Notebooks to create and share documents that contain live code, equations, visualizations and narrative text. This open source web application makes it easy to extract data with code and collaborate with other data scientists.
Runbooks do something similar for sys admins and site reliability engineers but these documents are often static files. These reference books include procedures to start, stop and debug a system and can be physical books or electronic files. Shoreline’s Notebooks make these guides available on the web and more interactive.
Gupta is familiar with the challenges of keeping cloud deployments up and running, as he was a vice president at AWS for almost eight years and ran the analytic and relational database services on the AWS Database team. He founded Shoreline.io to make managing a fleet of servers as easy as working with a single box and to build site reliability tools that makes fixing a problem permanently as easy as fixing it the first time.
Pros and cons of automated remediation
Naveen Chhabra, a senior analyst for infrastructure and operations, said Shoreline offers a platform that helps remediate operational issues automatically. The company focuses on public cloud assets and services, as compared to other vendors that have served data centers.
Chhabra said that automated remediation tools can deliver significant value but sometimes fail to do so.
“Automated remediation can only be applied to known issues and known resolutions,” he said. “If any of these two variables are unknown, automated resolution will barely even move a step.”
Tech silos still exist, which is a problem for developing solutions that require significant organizational collaboration across many teams, including infrastructure, applications, security, operations and others, Chhabra said.
Ongoing maintenance is another challenge for automated remediation tools, as well as the complexity of most tech stacks.
“Today’s IT is so full of heterogeneous technology stacks that it is virtually impossible for any one remediation solution to support those all,” he said.
Chhabra said that the automated remediation tools provide immense potential if tech leaders can identify the problem surface and develop collaboration amongst teams to address these issues proactively.