This is the third part of cloud computing reliability. From the customer’s point of view, cloud services should just work. However, as we have already discussed in this series of articles, service interruption is actually inevitable. This is not a question of “whether it will happen”, but a question of “when will it happen” in the strict sense.
No matter how refined the design and construction of online services are, emergencies will inevitably occur. The difference lies in how the service provider predicts and recovers from these situations in a timely manner. So as to ensure the customer experience.
Guiding design principles
Three design guidelines for cloud services:
These are the three attributes that customers expect to meet, at least, these three attributes must be guaranteed in their services. Data integrity refers to protecting the fidelity of information entrusted by customers.
Fault tolerance is the ability of service providers to detect failures and automatically take corrective measures so that the service will not be interrupted. Fast recovery capability refers to the ability to quickly and completely restore service when an unexpected failure occurs.
As a service provider, we need to identify and find out various potential failures as early as possible, and then fully consider these situations in the service design stage. This kind of thoughtful planning can help us decide exactly how to serve and how to respond when unexpected challenges occur.
The service must be able to recover from these failures and guarantee minimal interruption. Although we cannot predict every failure point or every failure mode, with forward-looking, business continuity planning and a large number of practices, we can formulate a set of emergency plan procedures for emergencies.
According to the characteristics of cloud computing, it can be described as a complex system composition, which relies on shared infrastructure and loose coupling. Many features are outside the direct control of the supplier.
Traditionally, many companies maintain internally deployed computing environments that allow them to directly control their applications, infrastructure, and related services. However, as the use of cloud computing continues to grow, many companies have begun to choose to give up some control rights to reduce costs, make full use of resource flexibility (for example; computing, storage, network resources), and promote business flexibility. And? Use their IT resources more effectively .
Understand the role of the team
From the perspective of the engineering services team, design and construction services (as opposed to box products, or solutions deployed within the enterprise ) mean that they expand their scope of responsibility. When designing a solution deployed within the enterprise, the engineering team only needs to design, build and test the service, package it, and then release it according to the computing environment described in the software operation recommendations.
In contrast, after the engineering service team designs, builds and tests the service, it must also carry out related deployment and monitoring to ensure the continued operation of the service. If there is an emergency, they need to ensure that it is resolved as soon as possible. And the engineering service team often has less control over the service computing environment!
Use failure mode and impact analysis
Many service teams use failure models (FMA) and root cause analysis (RCA) to help them improve service reliability and prevent failures. My opinion is that these are necessary, but not enough. Instead, the design team should use failure mode and effects analysis (FMEA) to help ensure more effective results.
FMA aims to identify and mitigate failures in the service design process through a repeatable design process. RCA includes identifying and determining the nature, scale, location, and time factors that lead to harmful results.
The main benefits of an overall end-to-end FMEA method include a comprehensive map of failure points and failure modes, which can form a priority list of engineering investments to reduce the mapping of known failures.
Topics related to The third part of cloud computing reliability
- what is reliability in cloud computing
- explain reliability and availability of cloud computing
- download reliability and availability of cloud computing pdf
- reliability, availability and security of services deployed from the cloud
- documents on reliability in cloud computing pdf
- types of reliability of cloud model
- reliability of calculations cloud based
- high availability in cloud computing
FMEA system reliability engineers use technology development and research, we found that may arise (complex) system failure. The study understands the possible problems of the fault impact by evaluating the severity, frequency of occurrence, and detection ability, so that the required engineering investment can be prioritized based on different risks.
- Preparation stage: In this step, it is important to understand the integrity of the system and generate a complete logic diagram of the system, including its components, data sources, and data service flow. Using templates to complete, which improves the overall analysis results, by providing possible points of failure, the design team can unearth important clues.
Discover the interaction between components: everything is within the scope of this step. Start with the logic diagram indicated earlier to determine whether all components are prone to failure. Understand the interaction between all components (connectors) and how each component functions in the complete system.