IT in Manufacturing


Enhanced rescue procedures in industrial applications

April 2014 IT in Manufacturing

Computing platforms inevitably suffer from corruptions that creep into the operating system, or elsewhere into the software platform. It is possible to reinstall individual applications or the operating system, but such remedies are slow and expensive both in terms of money and downtime. For this reason, system administrators typically rely upon system recovery applications, several of which are currently available on the market.

Unfortunately, most of these system recovery solutions are inadequate in one or several ways. For automated industrial systems the clearest point of failure would be their lack of automation: currently, none of the most widely known software recovery systems offer any means of configuring automated, smart reactions to system events. However, to achieve a reliably automated system recovery solution requires BIOS level engineering, and careful planning for the customer’s end needs. In the white paper that follows, we describe for you what the key scenarios a suitably smart, automated software platform recovery tool must rectify, and describe the technology needed to produce it.

Expertise, efficiency and speed

Automation advances in industrial processes increasingly involve using industrial PCs for the control and monitoring of every sort of machinery and process. Yet as industrial PCs penetrate ever further into the automation stack, the problem of data loss and system corruption becomes a critical consideration both for the security issues raised as well as overall system availability. As an operating system ages, small corruptions work their way into the root system’s software and it slows down. Eventually – and sometimes quickly and suddenly – small corruptions become big corruptions, bringing down the entire OS.

The most vexing problems are, of course, those which remote troubleshooting and reconfiguration cannot fix: if a systems administrator cannot reliably and quickly bring a hung system back online, then the most effective solution is to replace the system (either hardware or software) as quickly as possible. Unfortunately, the expertise of the staff who maintain and operate these automated facilities – whether it is electrical substations or wind farms, hazardous environments such as oil and gas fields, or transport environments such as trains and ships – does not include computer administration, troubleshooting, and repair.

The problem is compounded by the fact that, in recent years, computers used in automated processes are increasingly installed as embedded devices with no ready means of local manipulation or input. Keyboards, mice and even monitors are often absent, leaving many controller devices only accessible over networked connections, via the scada or a remote control station. Even for people who are comfortable troubleshooting a desktop system, and can make their way around an office computer or an enterprise server, administering multiple embedded computers from a remote scada is often a new and knotty challenge.

The industry is already knocking about

These problems are increasingly worrisome for administrators tasked with keeping remote systems available 24 hours a day, 7 days a week. Preventive maintenance has become a critical consideration for both technology at the edge as well as industrial solutions built on established technologies. Many examples are found. One leading multinational integrator of wind systems has been looking for a fully automated, fail-safe means of rescuing a computing platform’s software system: when devices are mounted high on an industrial-grade wind turbine, sending maintenance and repair personnel up to reinstall an operating system becomes a costly and expensive distraction. An East Asian systems integrator for on-board train solutions required an automated network solution to re-flash the computing hardware for any platform along the route. A Chinese high-speed rail company required a means of restoring computer systems at controller stations where the devices lacked any means of input or local systems monitoring. Finally, a European manufacturer of railway vehicles required an automated solution that could be called immediately by system users on board the vehicle, with no need for formal training or consultation with remote system administrators.

In each of these systems, the requirements were beyond what most other rescue software provides, running the gamut from regular maintenance updates for preventive measures to rescues on crashed systems that are no longer able to boot up. These are industrial challenges, and are too much for currently available rescue/rewrite solutions. None of the currently available solutions are fully automated, standalone systems capable of rescuing the entire platform from a permanent system crash. Instead, the market’s current offerings all suffer from a fatal design flaw: they operate in user-space, which means that should an operating system crash from software corruption then remote, human interaction is required to reset the system. While it is theoretically possible to configure a software solution that allows remote administration, doing so on a case-by-case basis requires a lot of detailed (and costly) coding and reliability testing. Even then, bugs are likely to creep in, and the system will require a networked solution that still requires human management and oversight.

Giving the customer what the customer wants

In contrast, the most effective platform rescue is one that can both rewrite the operating system once it has slowed down from corruption and also rescue it from a full crash where the machine can no longer even boot up. For this, an automated mechanism integrated with the hardware platform at the BIOS level is required, something that can re-write the entire suite of installed software – operating system, all applications, and the full system configuration – at the block level, from a cached copy, and then reboot into normal operations. A solution of this sort would be capable of resetting the entire platform to its earliest configuration state, effectively returning the entire software system to a pristine operating condition.

Moreover, as the four examples cited above indicate, a wide variety of solutions are needed to serve the full industrial horizon. For wind farm solutions, for instance, the recovery system will need to be local, standalone, and capable of automatically initialising in response to pre-set conditions with no human input whatsoever. In the case of a smart grid solution, the system must be capable of all that, and also able to respond to commands sent to it from a control centre (either on-site or distantly remote). Finally, for the last train system, the rescues must also be able to be initiated by the user at the physical device itself, without any need for specialised computer knowledge or administrative expertise.

Four fundamental needs

This paper presents four automated re-write procedures which, taken together, will allow any automated re-write mechanism to maximise predictive maintenance efficacy and convenience while automatically resolving all possible system failures attributable to software corruption. These procedures may be broken down thus:

1. Automated re-writes at scheduled intervals.

2. Fully automated recoveries on system crashes or slowdowns.

3. Remotely initiated automated recoveries.

4. Manual recoveries initiated at the device’s physical location.

BIOS-initiated block level recoveries

Simple file-level rewrites managed from within the OS are not enough. When software becomes so corrupted that the OS refuses to boot then automated recoveries can only be initiated before the operating system kicks in. Effective platform recovery automation must, therefore, initiate before the operating system kicks in, integrated with both hardware and software. The device must support an alternate storage mechanism for use as a dedicated cache – from which the recovery image and operating environment are read – while the BIOS itself must also be enhanced with the addition of a watchdog, that measures boot-times, registers when the platform is crashing or performing poorly, and then automatically calls up the recovery environment while monitoring the entire process for success or failure.

Perhaps most importantly, engineers should be able to simply and conveniently configure these automated system re-writes from within the OS, using only tags and a watchdog timer. After noting the precise time it takes for the system to boot up, administrators may open a dialogue to set a timeout. Whenever the platform’s boot process slows beyond the configured timeout, the BIOS will automatically call the recovery procedure, switching the system’s boot procedure over to an alternate recovery platform stored on a separate storage drive.

However, to guarantee accurate and uncorrupted re-writes one more design feature must be carefully attended to: re-writes should take place at the block level, recopying the system by bits, rather than by files. By copying the system at the block level, system corruptions are far less likely to creep in than when software is copied over at the file level. File level rewrites cannot compensate for corruption of the physical storage device, but bit level rewrites can. Only by guaranteeing that every bit is successfully re-written to the platform’s physical storage medium – whether disk or solid state – can the system recovery mechanism guarantee that a successful recovery procedure has been completed, and that every fragment of data has been successfully returned to its initial post-install state.

Automated recoveries at scheduled intervals

As systems deteriorate over time, the computing platform slows down. This can be a troublesome, debilitating problem for finely tuned automation networks that demand extreme precision at the process level. Preventive maintenance procedures are, for these situations, a critical tool in an administrator’s maintenance arsenal. A system recovery option that allows system administrators to configure a scheduled software recovery eradicates this worry.

To combat the persistent problem of system slowdowns, administrators must be able to configure a system to re-write itself at scheduled intervals. After estimating when the system begins to suffer and slow down from routine use, an administrator may set it to perform a scheduled recovery to its initial post-install state, allowing engineers to push the efficiency of their predictive maintenance routine further than it has ever gone before. By resetting a software platform whenever at set times, administrators can guarantee that a healthy hardware setup will always function at the benchmarks for which it was initially configured. By periodically refreshing the entire computing platform, every computer in the network will consistently perform at its freshest post-install configuration.

Fully automated recoveries on system crashes or slowdowns

An intelligent recovery mechanism will need to be configured to automatically re-write the system whenever a specified period of time has elapsed.

As mentioned above, administrators should be able to simply set a platform to return itself to a fresh post-install state whenever the boot process crashes. By enabling engineers to access a BIOS timeout counter, whenever the system fails to boot up by an appointed time a truly smart recovery system will automatically re-boot the system into a secure recovery environment, from which it will then re-write the entire platform, bit-by-bit. Once this is completed, the system will again re-boot into the original platform. If the new initialisation fails, then our smart recovery bot will continue attempting the re-writes until one of two basic conditions is met:

1. The system successfully recovers.

2. The system consistently fails, whereupon the recovery mechanism concludes its work and takes the system off-line. The number of times the system will attempt the recovery process will be configured by the system administrator.

Of course, if preferred, the system may also be allowed to continue its recovery attempts. Depending on administrator preferences, repeated boot failures may actually serve as a notification mechanism for critical maintenance.

Remotely initiated automated recoveries

While full automation is useful, certain situations will demand user-initiated recoveries as well. User-initiated recoveries may be broken down into two basic types: recoveries initiated remotely, via a scada or control room, and recoveries initiated by a user present at the physical device.

‘Remote’ recoveries include not only procedures that are initiated from a far-distant control room, but also those called from a local control station located on-site. The mechanism, in either case, is straightforward: when a system administrator perceives that a computer is crashing, or suffering from problematic slowdown, they may send a call to the device that begins an automated rewrite. With little more than a click of a mouse, the administrator will take the device offline and return the platform to its earliest, most pristine configuration. Because the rewrite is at the bit level, administrators can confidently recover the system whenever they feel the need, for whatever reasons they feel are justified.

Remote rewrites give administrators the power of tuning up or rescuing a system remotely, whenever the need arises. These recoveries can become a powerful tool beyond system rescues, allowing administrators to evaluate the condition of remote sites whenever they need.

Manual recoveries initiated at the device’s physical location

Manual recoveries, on the other hand, take place at the device itself. Using an automated system recovery, administrators may build a USB key that automatically triggers a full system rewrite. Manual recovery keys are perfect for user-initiated recoveries where authorised or trained engineers are not available, as is often the case at remote sites where computers are used as HMIs for heavy industrial machinery: for instance, ships, oil platforms or solar farms.

Manual recoveries are simplicity itself: after determining that the system requires software maintenance, users only need to insert the USB key into the device and then restart the computer. The recovery then proceeds automatically and securely with no further interaction required. Once the process is completed, the platform will either be returned to its earliest post-install state, or its permanent failure confirmed.

In conclusion

An effective smart recovery solution is a secure, fully automated, intelligent BIOS level platform recovery that copies all software from the block level. It offers convenient usage modes and configuration options that will suit the needs of any industrial computing or automation system. Configuring the recovery solution should be the last process automation engineers complete before deployment, after every other element of platform has been set up and configured. In today’s automation environment, every remote embedded computer should be equipped with the hardware and BIOS enhancements that deliver this powerful improvement on traditional maintenance and rescue procedures.



Credit(s)



Share this article:
Share via emailShare via LinkedInPrint this page

Further reading:

Siemens’ PAVE360 to support new Arm Zena Compute Subsystems
IT in Manufacturing
Siemens Digital Industries Software is expanding its longstanding relationship with Arm and adding support for the newly launched Arm Zena Compute Subsystems in its PAVE360 software, designed for software-defined vehicles

Read more...
Fortifying the state in a time of cyber siege
IT in Manufacturing
In an era where borders are no longer physical, South Africa is being drawn into a new kind of conflict, one fought not with tanks and missiles, but with lines of code and silent intrusions. The digital battlefield is here, and cyber space has become the next frontier of conflict.

Read more...
Levelling up workplace safety - how gamification is changing the rules of training
IT in Manufacturing
Despite the best intentions, traditional safety training often falls short, with curricula either being too generic, too passive, or ultimately unmemorable. Enter gamification, a shift in training that is redefining how businesses train for safety and live by those principles.

Read more...
Reinventing data centre design: critical changes to meet surging
Schneider Electric South Africa IT in Manufacturing
AI technologies are pushing the boundaries of what is possible which, in turn, is presenting data centres with a whole new set of challenges. Fortunately, several options are emerging which include optimising design and infrastructure for efficiency, cooling and management systems

Read more...
Watts next - can IT save the planet
IT in Manufacturing
The digital age’s insatiable demand for computing power has collided with an urgent and pressing need for sustainability. As data centres and AI workloads consume unprecedented energy, IT providers are pivotal in redefining how technology intersects with environmental stewardship.

Read more...
South Africa’s digital revolution:
IT in Manufacturing
South Africa stands at a pivotal moment in its technological evolution, poised to redefine itself as Africa’s leading digital powerhouse. Over the past two years, political leaders and media narratives have painted a picture of rapid digital transformation, underscoring the government’s ambition to position South Africa at the forefront of innovation.

Read more...
Smart manufacturing, APC and the SA marketplace
Schneider Electric South Africa IT in Manufacturing
Manufacturers are prioritising the integration of smart technologies into their daily operations to stay one step ahead of the competition. In South Africa, some experts believe the country has the potential to leapfrog its global peers through the creation of smart factories.

Read more...
Schneider Electric’s Five-Pillar Strategy takes the guesswork out of equip
Schneider Electric South Africa IT in Manufacturing
Schneider Electric’s Field Service Cycle, otherwise known as the Five-Pillar Strategy, is a structured approach to managing the lifecycle of equipment to prolong asset lifespan while reducing the total cost of ownership for customers.

Read more...
Enhancing operational safety and efficiency through advanced risk-based modelling
IT in Manufacturing
Now, more than ever, capital and operational cost can be reduced while enhancing operational safety and increasing production uptime by applying transformative methods such as Computational Fluid Dynamics modelling.

Read more...
Laying the groundwork in IT/OT
IT in Manufacturing
In the realm of manufacturing, the core mandate is to deliver value to stakeholders. For many in the industry, this is best achieved through a risk-averse approach. Only upon establishing a robust foundation should a business consider venturing into advanced optimisation or cutting-edge technological innovations such as industrial AI.

Read more...









While every effort has been made to ensure the accuracy of the information contained herein, the publisher and its agents cannot be held responsible for any errors contained, or any loss incurred as a result. Articles published do not necessarily reflect the views of the publishers. The editor reserves the right to alter or cut copy. Articles submitted are deemed to have been cleared for publication. Advertisements and company contact details are published as provided by the advertiser. Technews Publishing (Pty) Ltd cannot be held responsible for the accuracy or veracity of supplied material.




© Technews Publishing (Pty) Ltd | All Rights Reserved