EAWS US-East-1 had issues on Tuesday. And it brought a lot of the internet down with it.
Usually when this sort of thing occurs, people scream that the solution is to “use multiple regions!” That’s a great idea, but it’s hard to implement.
It also ignores the fact that many AWS users have to use US-East-1, either directly (because it’s Amazon’s biggest region) or indirectly (because that's where some AWS core global services run).
At the same time, using a different cloud service provider won't solve the problem; everyone can make mistakes. The only way to survive is to have a hybrid system.
Reliability 101 tells us that “two is one” and “one is none.”
Most people apply this for simple high availability: Have at least two servers, and make sure one is always up.
But what if we increase the “blast radius”: Use multiple availability zones and make sure you have enough capacity to survive an availability zone failure. Then pray that you don't use the services affected by the outage, and pray that your CFO won't fire you when the new bill arrives on her desk.
If you're big enough, you want to go fancier and do multiple regions. Then you realize that most systems are regional, and it's your problem to figure out how to make things work in multiple regions.
Go find an architect with experience with this, hope your services are friendly enough to be in multiple regions, and be ready to pay the bill. And paychecks, because you'll need a big team to manage this and run the requisite fire drill exercises. After all, a system that isn't tested isn't reliable.
Maybe you're even bigger and can afford to go multi-cloud to make sure everything works all the time.
That all works great in theory. But then you get your team to figure this out, and they check the hundreds of CSP pages talking about this and giving advice that boils down to “it depends.” And your team is stumped. You decide to hire someone that is an expert in this because, as AWS’s CTO would say, “There's no compression algorithm for experience.”
Now you have a system working in multiple regions or clouds, as well as a well-trained team frequently running failure simulations and prioritizing fixing issues to ensure you're always up. Then you realize that you can't ship new features because the system you pushed the team to quickly make robust isn't actually architected for full reliability due to “historical reasons” and you're doing non-differentiated work for the most part.
Congrats! You're fired after spending years and tens of millions.
Or you can find someone to solve the problem for you. Someone who has solved this already, and who has run the fire drills and checks to make sure things work.
Someone, moreover, who has built an isolation between the control and data planes (surprised by these terms? - yeah, you're not the only one) for more isolated failures and to give a single pane of glass while allowing things to run where they need to run to stay healthy. Someone who is a subject matter expert.
How does this relate to what my company, Verta, does?
Well, the challenges and lessons learned in achieving reliability and availability apply to our core competency - our subject matter expertise, if you will - of model operations and highlight the crucial need for a hybrid, multi-cloud approach.
Relying on a single cloud service provider or region can pose significant risks to business operations and innovation for organizations that rely on AI and machine learning models.
Verta’s model operations platform offers a hybrid, multi-CSP, multi-region solution, so that organizations can navigate the complexities of multi-cloud environments while ensuring redundancy and resilience.
If you're looking to maximize reliability and unlock the full potential of your machine learning initiatives, let’s talk. We have the expertise and tools to help you deploy and manage models effectively, minimize downtime and drive innovation forward.
Get in touch, and let’s have a chat about how we can elevate your model operations to new heights.
In the meantime, here is a summary of some lessons learned (again and again) about best practices to ensure system reliability in model operations:
System reliability is of paramount importance in modern operations
In model operations, system reliability ensures that machine learning models are available and perform predictably in production environments. Unreliable systems can lead to downtime, data inconsistencies and degraded model performance, resulting in costly business disruptions and financial impact.
Adopting a hybrid system approach with multiple CSPs or regions enhances resilience and reduces the risk of widespread failures
By diversifying infrastructure across multiple CSPs or regions, model operations can mitigate the impact of service disruptions. This approach provides redundancy, allowing models to remain accessible even if one CSP or region experiences issues. It also helps maintain continuous availability and minimizes the risk of single-point failures.
Implementing multiple regions or clouds requires careful architectural planning, expert knowledge and significant resource investment
Model operations must consider architectural design principles when implementing multiple regions or clouds. This includes understanding data replication, synchronization and load balancing across regions. Expert knowledge is required to ensure proper implementation, configuration and management of the infrastructure. Adequate resource allocation and budgeting are crucial to support the increased complexity and associated costs.
Skilled architects and experienced teams are crucial for designing, implementing and regularly testing complex systems to ensure their reliability
Model operations rely heavily on skilled architects and experienced teams to design and implement reliable systems. They are responsible for developing robust architectures, defining deployment strategies and integrating models seamlessly. Regular testing, including failure simulations and performance evaluations, helps identify potential issues, optimize configurations and ensure that the models operate reliably under various conditions.
Conducting regular testing, simulations, and fire drills is essential to identify weaknesses, vulnerabilities and failure points in the system
Testing is critical in model operations to detect and address any weaknesses or vulnerabilities in the system. By simulating failure scenarios, teams can proactively identify potential points of failure, evaluate recovery mechanisms and fine-tune system resilience. Regular fire drills ensure that response plans are effective and provide opportunities for optimization and continuous improvement.
Striking a balance between system reliability and innovation is vital to avoid stagnation and foster continuous improvement
Model operations should aim for a balance between ensuring system reliability and enabling innovation. While reliability is crucial, it should not hinder the development and deployment of new features and models. A balanced approach ensures that the system remains stable and performant while embracing advancements in machine learning and adapting to changing business needs.
Leveraging subject matter experts can provide valuable guidance and tools for building and managing highly available and fault-tolerant systems
Subject matter experts who specialize in model operations have in-depth knowledge and experience in designing, implementing and managing reliable systems for machine learning models. They can provide guidance on best practices, recommend appropriate tools and technologies, and offer insights to address challenges specific to model operations, ultimately improving system reliability and performance.
These best practices help organizations ensure that their machine learning models are deployed in a reliable and resilient manner, minimizing disruptions and maximizing their value to the business.
Subscribe To Our Blog
Get the latest from Verta delivered directly to you email.