Human factors: security incident management (advanced)¶

This section complements and concretises the processes used to manage cybersecurity incidents. Attention is paid to the requirements that proper operation imposes on the people involved. Education and training are central but more is needed.

Lifecycle of security incident management

The figure presents a simplified management process for security incidents (based mainly on NIST SP800-61). An organisation must prepare for security incidents, handle them when they occur, and monitor incidents after they have been closed.

Security incident management makes use of all the capabilities and tools described earlier in this chapter. Prevention and response must be kept in balance. Complete prevention has proven impossible (for example, for usability and cost reasons), and because attackers have more attack techniques and imagination than system designers can anticipate. Therefore, it is organisation-specific whether resources are invested more heavily in prevention or in response. This must be considered carefully, because prevention increases costs, while reliance on response may lead to fatal consequences if the organisation is unable to recover from an incident. In addition, appropriate response to security incidents also incurs costs that must not be overlooked. Risk assessment is therefore an essential part of security incident management.

Prepare: planning security incident management (advanced)¶

The first phase of security incident management is to put appropriate processes and capabilities in place before anything happens. This is also a statutory requirement (see NIS2) for all operators of critical infrastructure, such as electricity, water, healthcare, and logistics.

The creation of policies and procedures depends on the organisation’s structure and sector. Policies must include senior management in order to correctly define the scope of incident management, organisational structure, and performance and reporting practices. Policies should include formal action plans based on risk assessment processes, serving as a roadmap for responding to security incidents. The plans specify standard actions that incident handlers can quickly follow and select concrete measures when an incident is underway. All these practices, plans, and procedures are organisation- and sector-specific and are influenced by various regulations. For example, for financial organisations these include Basel II and Sarbanes–Oxley.

An important part of preparation relates to communication. Regulations now generally require that security incidents also be reported to authorities. It may also be important to establish trusted communication channels in advance with technology and service providers (such as software vendors and Internet service providers). Similar channels should be established between peer organisations to facilitate the sharing of early warnings and best practices. International organisations and facilitators of information exchange include Computer Security Incident Response Teams (TF-CSIRT), the Forum of Incident Response and Security Teams (FIRST), and the European Union Agency for Cybersecurity (ENISA).

Communication with customers, the media, and the general public must also be planned in advance. Organisations must be prepared to communicate when a cybersecurity incident affects customers or becomes public. For example, the General Data Protection Regulation (GDPR) requires that users be informed of data breaches. GDPR compliance requirements affect cybersecurity, as organisations must protect and monitor their systems in order to comply.

Responding to security incidents is largely a personnel‑intensive task associated with crisis management and requires the ability to work under pressure. Internally, it must be possible to prevent the spread of the incident and a shutdown of organisational operations. Externally, pressure from management, regulators, and the media must be handled. This requires qualified personnel who have practised their roles in advance. Continuous training is also necessary to keep pace with the latest developments in threats. Integrating key personnel into relevant communities, such as CERTs, also helps with information sharing and ensures that best practices are exchanged within the appropriate community.

Handle: actual incident response (advanced)¶

Handling security incidents requires three different kinds of activity: analysis, mitigation, and communication.

Analysis relates to investigating the security incident in order to understand its scope and the damage caused to systems, particularly to data. If data has been lost or altered, the damage may be significant. Therefore, the investigation must assess precisely what has been compromised and what has not, as well as when this occurred. This is extremely difficult, because attacks may last for months, attackers employ various techniques to remain hidden (for example, deleting logs or systems, encrypting communications), and it is difficult to operate on systems under attack (for example, if attackers detect remediation attempts, they may begin destroying data). At the same time, one should collect evidence properly.

Mitigation involves the use of emergency measures that can contain the incident and limit its impact. Mitigation must first limit damage to systems (such as data deletion or disclosure), which attackers may initiate if they are detected. It must also ensure that attackers do not move on to other systems. Containment may include blocking network access in certain areas or shutting down services, systems, or communications. Containment can have adverse effects on necessary operations. For example, cutting network access prevents attackers from communicating with compromised systems, but also complicates system recovery or backup operations.

It is common for containment measures to reveal additional information about the attacker, their methods, and their targets. In this way, analysis and mitigation influence each other (the loop shown in the figure).

Communication is an essential part of handling security incidents, as highlighted in the Prepare section. Once the extent of the damage has been established, it may be necessary to notify authorities and comply with regulations as required.

Follow-up: post-incident actions (advanced)¶

The final phase in responding to a security incident is to ensure that the attack has ended and to clean up the system.

After a security incident, it is important to review team performance and procedures in order to improve them. This is often difficult, because handling new incidents and returning to normal operations consumes resources that could otherwise be devoted to long-term improvement (for example, faster and/or more accurate mitigation). The impacts of security incidents should also be considered. Although major incidents usually lead to post-mortem analysis and policy changes, low-impact incidents may be excluded from follow-up procedures. In practice, however, low-impact incidents often consume a large share of resources and should therefore also be analysed.

Communication is also an important part of follow-up. Lessons learned from security incidents should inform training in order to stay up to date with attackers’ methods. It should also be possible to share information with peers so that best practices spread throughout the community. This enables learning from incidents that do not directly affect one’s own organisation.

In post-incident follow-up, attribution—tracking the attacker—is useful (see more on attribution). Its goal is to understand where and why the attack originated, and in particular the attacker’s motives. This helps restore the system to a functional state and prevent follow-up attacks. Attribution is, however, extremely expensive, especially if the goal is to use forensics to support legal action. At present, forensics and attribution are highly specialised fields and are not part of routine security operations and incident management. They require expertise, tools, and time beyond what SIEM analysts are able to provide.

Restoring a system is closely related to reliability engineering. In this context, it is worth noting that reliability, safety, and cybersecurity are converging. One reason for this is the increasing prevalence of IoT devices, which are having a growing impact on safety-critical systems while at the same time requiring secure remote maintenance. In such cyber‑physical environments, those responsible for security incident management must broaden their expertise. Conversely, specialists focused on reliability engineering are also increasingly required to operate in purely IT environments (for example cloud computing platforms), which must be resilient against all kinds of failures, such as power outages.