Is the Shandong Mobile BOSS emergency system construction a waste?
Building an IT emergency (disaster recovery) system seems to be a thankless thing, spending a lot of money just to prevent "in case", "in case" does not appear, a lot of investment seems to be wasted, but has to prevent " just in case". The worst case scenario is that, despite spending a lot of money, the emergency system didn't work when there was a real "just in case".


After years of construction, China Mobile Shandong Corporation (hereinafter referred to as "Shandong Mobile") has established a relatively complete emergency support system for the BOSS system. In December 2009, Shandong Mobile's BOSS emergency system construction project won the China Mobile Group's "2009 Best Emergency Guarantee Efficiency Award". In January 2010, China Billing Network (Telelink) announced the results of "2009 China Telecom Industry Operation Support & IT System Annual Selection", and the cloud computing application project implemented by Shandong Mobile in the construction of the BOSS emergency system won the "Annual Management Innovation Project Award" ". Here, I introduce some experience and experience of emergency system construction, and refer to IT colleagues.
"Three more and one smaller" emergency system
Shandong Mobile's BOSS emergency system construction is summarized as "three more and one smaller", that is, multi-level redundancy, data first; multi-level plan, gradual upgrade; multi-level linkage, key guarantee; minimize business impact.
Multi-level redundancy, data first. One is the redundant configuration of the host and the disk array. The host's power supply, memory, disk, fiber card, and network card are all redundant. The redundant network card and fiber card must be connected to different switches. The second is the redundancy of the computer room, air conditioner, and power supply. The power supply must be equipped with independent dual UPS and dual power cord access. The third is the redundancy of physical transmission routes. Redundant fiber transmission must come from different physical routes. The fourth is the redundancy of the data center. The three centers are disaster prepared for each other, and any one of them can be taken over by the other two data centers.
Multi-level plan, gradually upgraded. The first level is the business emergency plan. In the local operation mode of the business, the system is started when the system fails, so as to achieve minimal or no impact on the business. Through business deployment, business redundancy processing module and other methods to achieve. The second level is the local takeover plan. When a single point of failure or equipment is actively maintained, each BOSS business system realizes local HA (high availability) takeover. Key systems such as business, billing, roaming, and interfaces are automatically taken over. The third level is the disaster recovery switching plan. Shandong Mobile has built a three-center disaster recovery system with load sharing and mutual backup. The database layer uses a one-to-two architecture and uses the technology of "storage bottom synchronization replication + timing snapshot replication" to prevent data Physical or logical errors. Level 4 is a backup and recovery system, which provides snapshots of production data and tape backups. Daily full backup and incremental backups are directly backed up to remote centers through remote SANs for emergency recovery of serious incidents. From the first level to the fourth level plan, the degree of damage caused by the failure gradually increased, and the processing time gradually increased, of course, the degree of impact on the business also gradually increased. According to the nature of the incident and the degree of impact, priority is given to plans that have a small impact on the business, and the emergency plan is gradually upgraded as appropriate, with the aim of reducing the impact of the incident on the business to a minimum.
Multi-level linkage and key guarantee. Only by establishing a perfect emergency management system can we ensure that the emergency management organization system can operate effectively in the event of an emergency. After several years of practice, Shandong Mobile has established an emergency monitoring and early warning mechanism, an information communication mechanism, an emergency decision-making and coordination mechanism, and a hierarchical responsibility and response mechanism. Straightened out the relationship between business and IT departments in responding to emergencies, the relationship between emergency command and implementation departments, the relationship between integrated emergency departments and support vendors, and established a unified emergency management mechanism for information integration, management docking, resource sharing, and collaboration To mobilize the enthusiasm of emergency management in all aspects. On the one hand, combing internal processes and opening up channels between provinces and cities, accounting centers and other departments; on the other hand, combing manufacturers' support processes and establishing cooperative relations with supporting manufacturers, thus building a "province-city-partnership" "Multi-level linkage emergency support system.
The emergency costs of different services and systems are completely different, and the emergency methods brought by them are also very different. It is necessary to carry out the "fastest emergency" of the system at the "minimum cost". For key services such as account opening, payment, and business change, Shandong Mobile independently developed a separate small system for critical business support, which is independent of the BOSS system, and at the same time establishes an automatic interface with the BOSS system to ensure that the system can be activated in an emergency to carry out the most Handling of key business. In the process of system anomalies, versions going online, local takeover, and disaster recovery switching, you can enable the critical business assurance subsystem, and then create a critical business uninterrupted BOSS system.
Based on business independent innovation
The core business of Shandong Mobile's BOSS system is built on EMC's equipment and software, including EMC Symmetrix DMX storage array, SRDF / snapshot, etc. EMC also has extensive experience in business continuity assurance. In accordance with established practice, operators generally hand over system architecture design and construction to manufacturers / system integrators for leadership. Shandong Mobile insists on its own leadership in the construction of the BOSS emergency system, interacts with manufacturers, and makes full use of vendor technologies, products, services and experience , Independent innovation has achieved good results.
Shandong Mobile's "multi-center business disaster tolerance" model is an independent innovation in the emergency system architecture. The specific approach is to mix the disaster recovery computer room and the production computer room. For example, there are three computer rooms A, B, and C, and each computer room has a complete BOSS system, and each undertakes a part of the business of each city. Computer room C is the largest and provides disaster recovery for computer rooms A and B at the same time. If any computer room has serious problems, the other two computer rooms can take over all services.
The key to "multi-center business disaster recovery" is that, based on years of system maintenance experience, it is proposed to vertically split the business process. The vertical split of business processing is a decision made by Shandong Mobile based on its consideration of business applications, with the aim of reducing the impact of failures on customers to a minimum. This is also a prerequisite for good operation of the multi-center. In this way, Shandong Mobile distributes business processing to three data centers, and each center usually has a complete BOSS system to undertake the business of one district. When a system fails, it only affects the area, and can be switched to other areas for emergency treatment. In contrast, horizontal split refers to let all users in the province run a system, such as the province's business system, the province's accounting system, when a system fails, it will affect the province.
This also helps to improve the usability of the emergency platform. When the business processing system is in the "normal state", the business load is balanced, the emergency system processing pressure is not great, and the business service response speed is fast. When the business system of a data center is in an "emergency state", only the relevant business resources of the data center are needed for emergency switching, and an emergency response can be made quickly.
When specifically splitting business processing, Shandong Mobile makes full use of vendor resources and refers to EMC's business load analysis tools to reasonably split business processing.
Technically, "multi-center business disaster recovery" uses virtualization technology to place production resources and disaster recovery resources in a unified resource pool, and dynamically allocate disaster recovery resources to production applications during holidays or business emergencies. This approach coincides with the current hot topic of cloud computing, and the Shandong company's approach is also a model case for the successful landing of cloud computing.
Innovate emergency management methods through "resource dynamic management", allocate resources according to business development volume and actual needs, and provide temporary centralized resource protection solutions for business peaks, business emergency, and major events, which can instantly improve system processing capacity to support The effectiveness of the emergency system. In the event of a business peak, or in emergencies such as inefficient application software and HA takeover, you can dynamically adjust resources to ensure stable system operation. For example, on December 1, 2008, a server's CPU failed, causing system downtime, and the database B node of the first business district switched to node A. However, due to the large volume of business at the beginning of the month, the machine pressure of node A in the first business district was very heavy. By dynamically adjusting the disaster recovery resources of other partitions where the node is located to the machine, the stable operation of the foreground system is ensured. When accounting processing or production report is performed at the end of the month at night, the resources of other partitions can also be transferred to the accounting system. After the task is completed, return to the original system.
By exerting the scale effect of the resource pool, resources are greatly saved. During normal operation, 10% of the resources are allocated to disaster recovery. If the active / standby mode is adopted, about 50% of the resources need to be allocated to disaster recovery. For horizontal comparison, the national standard is that for each additional user, the investment in the construction of the business support system increases by about 20 yuan on average, while Shandong Mobile only needs about 10 yuan.
Opening a "green emergency channel" for key businesses is another independent innovation of Shandong Mobile. The primary task of the BOSS system is to serve customers well, improve customer satisfaction, increase the timeliness of payment and start-up, and minimize business impact. Shandong Mobile has opened a green emergency channel for 8 types of services in 6 scenarios. For example, when payment is paid and the start-up delay reaches 30 seconds, the green channel is automatically opened from the service level to start the user first and then the standard process will be processed when the system recovers.
At present, Shandong Mobile has applied for 9 patents for the BOSS emergency system.
"Kung Fu is outside the poem"
The leadership of Shandong Company attaches great importance to the business support system, which is the driving force and guarantee for the BOSS emergency system to be strengthened. Company leaders require the BOSS system to use the best equipment, and all key links require backup equipment. By increasing the investment of the system to ensure the reliability and stability of the system operation, this is also a strong guarantee for the realization of the company's "customer-centric" service concept.
The last point to emphasize is that the construction of the emergency system should not be narrowly defined. "Kung Fu is outside the poem", it is necessary to stay fit and healthy in order to be less ill. This is also in the same vein as the traditional Chinese medicine theory of "treating the untreated disease". The establishment of a complete system, no problems, fewer problems, so that the emergency system is rarely activated, is the root of the emergency system. For example, Shandong Mobile's bill query system and billing system are separate. We are thus conducive to reducing the load on the system, allowing the billing system to advance lightly, and ensuring the timeliness of payment and start-up. This is also considered by the emergency system. Shandong Mobile also deployed EMC enterprise flash drives on the BOSS system to increase the reading speed of customer data, thereby improving the overall system's processing capacity, which is also considered by the emergency system.
In addition, the drill of the emergency system is also very important. Shandong Mobile conducts a drill every quarter. Shandong Mobile has formulated 6 types of emergency scenarios and conducted drills on the implementable scenarios. Every exercise has a big gain. The drill is divided into multiple levels, small aspects, such as turning off a switch, checking whether two network cards are on the same switch, turning off a UPS, turning off an HA node, etc .; a big aspect, such as putting a certain service in the entire computer room Stop and continue to optimize through drills. During the exercise, it was found that system navigation through disaster recovery navigation software can improve emergency response speed and processing accuracy.
Since the official establishment of the Shandong Mobile Emergency Response System, the BOSS system withdrawal time index has decreased month by month, the customer complaint rate index has dropped significantly, and the BOSS system customer service satisfaction index has increased significantly. The hall rarely encounters failures, and the contribution of the emergency system is obvious. According to the internal evaluation of the business department, the group assessment and the results of the external customer satisfaction survey, the satisfaction of the BOSS system of Shandong Mobile is among the top places in the country. The system's unplanned daily service withdrawal time has been shortened by several tens of times. Before the emergency system was built, the annual service withdrawal time was several hundred minutes. Now, the annual service withdrawal time reaches less than 100 minutes, and the user terminal hardly feels the system stop. The ratio of support complaints per 10,000 users dropped from 0.4 to around 0.05. The speed of starting payment is also greatly accelerated, from the original few minutes to the current average order of a few seconds.
It is hoped that the experience of Shandong Mobile can inspire IT colleagues.
photovoltaic system,Hybrid Inverter,Battery Storage Inverters,ESS Inverter
Shenzhen Unitronic Power System Co., Ltd , https://www.unitronicpower.com