In the last two decades, Ethernet and the TCP/IP suite of protocols have become the de-facto standards for not only home networks and the Internet, but also for mission-critical control, automation and protection systems used in the industrial and utility sectors.
Originally mostly just used for monitoring of devices and systems as well as facilitating production monitoring and communication between departments, the network is becoming more and more essential to actual protection and automation functionality. However, this does mean that the network is becoming even more mission-critical and stability and redundancy on the communications are essential.
This article will explore the criticality of the network, its initial planning, design, commissioning and installation, as well as the importance of having dedicated personnel to maintain and expand it as necessary.
A strong backbone supports a healthy nervous system
The modern-day TCP/IP network can and should be viewed as the nervous system of an industrial or utility application and should be treated with the same importance and respect. The network is often seen as just a secondary system that is not as important as the end devices connected to it. While this is true in a high-level sense, it is important to realise that the network services and supports all of these end devices and as such is just as important, if not more so, than the end devices. Having redundancy and backups for the end devices means nothing if the network does not allow said devices to communicate correctly with one another.
Due to this lack of understanding of the importance of the network, it is often the case that the incorrect person/people are put in charge of designing and/or maintaining the network, leading to substandard implementations that are either not working at full efficiency, not stable and/or reliable enough, or much more costly than they need to be. Alternatively, a good network could be designed but then not commissioned or monitored correctly, leading to a loss of efficiency in the future.
This initial planning and design of the network is one of the more critical phases. As with most systems, if not planned correctly from the beginning, a communications network can be inefficient and not provide the service, reliability and/or availability required for a mission-critical control or automation network. One can also end up with a bloated network that over-caters for future expansion and adaptation, meaning a very large capital expenditure that is often not fully utilised over time.
Stay flexible without over-stretching
Catering for expansion is important, however, these days this can be done using various modular hardware that allows expansion via modules installed at a later stage, rather than catering for high-port count (and high-cost) switches that are not planned to be fully populated with end devices. A modular setup like this can also be used to cater for spares in a much more efficient way.
Often on these networks, two or three general ‘categories’ of switches may be identified, such as a small switch (low port count), a larger switch (higher port count for end devices) and possibly a backbone switch (low port count but high bandwidth for connecting various sections of the network together).
Modular switch options often allow for the modules to be shared between different switches, so one could hold one or two of each chassis type (small, large, backbone) on hand, with a selection of common modules which can be installed into any of the chassis at a moment’s notice to replace a failed switch or expand a network section.
Similarly, SFP (small form-factor pluggable) modules can be utilised, which allow installations that can provide a variety of different copper and fibre options. A common option, for instance, could be to select a unit that comes with a set number of copper RJ45 ports and then a set number of SFP slots. SFPs could be kept separately, to be installed without delay when required, to provide the appropriate fibre or copper cable interface.
These types of replacement/spares strategies are cost-effective while still allowing network administrators to react promptly to any failure or change on the network, without having to wait for long business delays in procuring new hardware. This delay could be made non-critical by replacing hardware out of on-hand stock and then ordering replacement stock while the network is able to continue running. Such strategies also allow for a much ‘slimmer’ network that does not have an abundance of unused ports, but which can still react very quickly to any requirement for expansion.
Prepare properly before stepping into the ring
Once planned, designed and approved, the network still needs to be commissioned and configured for its required role. This, again, is an essential step that should not be underestimated. A proper and detailed network design means nothing if it is not implemented correctly.
Configuration should be performed by qualified personnel in a controlled and comfortable fashion. Where possible, initial configuration and testing should be done in a lab environment rather than on the live system itself, especially where interruptions to the network could result in production impacts. This lab environment should match the planned final system as closely as possible from a logical point of view. This means that if software in a control room will speak with a device on site via a routed connection, this routed connection must be in place during testing with as close a logical match to the site as possible. Often a very well thought out design is wasted by an incorrect commissioning phase, leading to networks that do not fully match the design or have not been tested to identify possibly unforeseen issues.
Proper testing at this stage is also critical, both to confirm that the configuration and commissioning was done correctly, as well as to identify other possible issues as mentioned above. In this author’s experience, end systems are often tested across ‘flat’ networks which do not run different VLANs, IP ranges, routers, etc. This ensures that the system itself works at a base level and as such the system is signed off during testing. When being commissioned on site, it is often found that the site network is not as ‘flat’ as the testing one and thus unforeseen issues can arise, especially when firewalls are in place between different sections of the network. As such, it is critical that testing be done on a closely matching logical network including routers, firewalls, correct end software and so on.
Rectifying an issue, whether major or minor, is much simpler in a controlled test system than it is on a live system and often on a live system troubleshooting is simply impossible without arranging an entire shutdown of the site. Ensuring that the initial commissioning is done properly (including regular configuration backups which are also often overlooked) and making sure it matches the documented design makes future troubleshooting much simpler and less intrusive.
On-site installation is generally a much simpler step, especially when most of the initial commissioning is done in a lab environment. During the installation phase the goal is to have the hardware be simply plug-and-play, meaning that once mounted, powered and connected with relevant communications cables, the hardware should be all ready to go. A final on-site test can then be implemented, which does not have to be as comprehensive as the lab testing but must ensure the end systems and network are performing to expectations. Where possible, no configuration of hardware should be required at this stage.
Knocked down doesn’t mean knocked out
A critical part of industrial Ethernet networks is proper link redundancy, meaning that if certain backbone links are damaged or disconnected for any reason, a backup redundant cable link will be activated to allow traffic to continue around the network unimpeded. It is often seen that a network is planned and implemented with multiple levels of redundancy, meaning high availability and reliability, which is key. However, if this redundancy is not monitored, which is often the case, then there is no-one reacting to failures, meaning that over time the network loses redundancy.
For instance, if the backbone of the network is connected in a ring fashion, then we have a single level of link redundancy (meaning one link can be lost without experiencing network failure of any kind). If a link in this ring is damaged or disconnected for any reason, the network should still be fully available, i.e., the redundancy will do its job. However, if the failed link is not rectified, then the network is no longer redundant and a second link failure will then cause a break in communications between sections of the network. Correct monitoring and regular maintenance of the network, if implemented, would pick up the original link failure soon after it occurred, allowing proactive rather than reactive maintenance.
Replacing the original link is something that can be done over a period of time, knowing the network is at least operating correctly while the link is being sorted out. However, in the event of two link failures, communications will be interrupted, meaning operation and safety may be compromised until the link failures are resolved, which could affect production negatively. Similarly, a device with two redundant power supplies can lose one and continue running, but if the faulty supply is not replaced, then the device has lost power supply redundancy.
Maintain a healthy lifestyle
Proactive monitoring and maintenance of the network and attached devices can be performed in a number of different ways, but a highly important and recommended method is to use some form of NMS (network management system). This is a software application that automatically monitors the network and/or attached devices.
Using a common open protocol called SNMP (simple network management protocol), one can have the NMS actively query network and end devices based on a schedule (good for non-critical information such as port utilisation, etc.). Similarly, one could set the end devices or network devices themselves to send an active notification of any issues, which are often used for more critical notices such as ports going up or down. The NMS can then be used to store and conglomerate all these network events, as well as monitor things like port utilisation over time.
Any detected issues can be configured to send user notifications, which can normally be pushed to an email address or cellphone number, allowing the engineer to step in and resolve any issues. These systems automate a large portion of the network monitoring and eliminate a large portion of the required manual maintenance. Often, they can also be used to monitor end devices at the same time, reporting on things such as HDD utilisation in servers or temperatures and conditions in cameras. They also provide other useful functionality such as asset accounting, statistics reporting, visual topologies and more.
Another important consideration at the beginning and throughout the lifetime of a network are the policies surrounding maintenance and changes to the network. This includes not only security considerations (such as password control and availability, access to devices, firewall implementations, etc.) which are outside the scope of this article, but also more standard maintenance considerations.
For instance: keeping track of IP addresses. Often, especially in the case of many third-party contractors or different departments within a single organisation, IP addresses assigned to devices are not properly documented and administered. This could lead to duplicate IP addresses on the network which will cause issues, or to incorrectly subnetted and supernetted IP ranges, leading to breakdowns in security and communications in many cases. IP address assignment (and other logical design changes or additions) should be handled by a single individual or team, with all requests being formally submitted, approved/denied and then documented.
Get a good coach on your side
It is important to realise that the long-term maintenance of a mission-critical network is not a hugely time consuming operation, especially when the initial design and implementation of the network were done correctly and according to best practices. However, being able to react quickly to failures and/or changes is essential. As such, it is often not critical to have a permanent on-book staff member handling the day-to-day network maintenance, but it should also not be handed off as a separate responsibility for an engineer whose focus should be elsewhere. Rather, it is worth having an agreement with a service provider that can provide the technical knowledge and professional services for the network maintenance as required.
Initial design, planning and implementation phases should enlist the services of a professional as well, to ensure a strong, reliable and cost-effective network that provides the uptime and reliability you require without completely breaking the bank.
© Technews Publishing (Pty) Ltd | All Rights Reserved