- COMP.SEC.100
- 12. Operating Systems and Virtualisation
- 12.5 Procedures for isolation and mediation
Procedures for isolation and mediation¶
This large section, accounting for over a quarter of the text in this module, refines the view of security mechanisms in operating system. Most of them were already introduced in the introduction. The following topics belong to this theme, but some are treated elsewhere and several are ”advanced” content (marked with * in the list):
- user authentication, although this is mainly referring to another module.
- access control: access lists and * capabilities.
- removal of data from memory and storage media. This also belongs to operating system functionality, but is raised here only in this list, as it has been treated thoroughly in the context of “reverse” forensics.
- * memory protection, implemented in software and with hardware assistance.
- * hardware-provided protection rings. The information mediation mentioned in the heading is most clearly visible here, when transitions are made from one ring to another.
- accounting, that is, logging. This is mainly a reference to an earlier module and, like the following topic, is presented only in this introductory section.
- administration, that is, how the operating system itself must be managed.
- in the final submodule, embedded systems and IoT devices are discussed very briefly.
Pay attention to the fact—since it is not explicitly emphasized—that only privileged code is able to configure the required isolation and ensure checks for all operations. Some contextual background is provided beyond the core learning objectives. This can be useful, as many of these ideas were already emerging in the 1960s. We therefore start from there.
Multics was the first large operating system in the 1960s that was designed from the outset with security in mind. Although it never became very popular, many of its security innovations are still found in today’s most widely used operating systems. Even if some individual features were not invented by the Multics team, their integration into a coherent, working, security-driven operating system design was novel. Multics provided protection rings, virtual memory, segment-based memory protection, and a hierarchical file system that supports both discretionary and mandatory access control (DAC and MAC). The MAC added to Multics at the request of the U.S. military was largely a software implementation of the Bell–LaPadula model. Finally, Multics ensured that its many small software components were strongly encapsulated and accessible only through their published interfaces.
Multics was very advanced, perhaps even beyond some modern operating systems, but this was also its problem. The system became so large and complex that it violated the principle of psychological acceptability, at least in the view of some of its developers. Frustrated, Ken Thompson and Dennis Ritchie decided to write a new and much simpler operating system. As a pun, they called it “Unics”, later spelled Unix. Like all of today’s most common operating systems, it relied on a small number of primitives for isolating its security domains. Even smaller was—and remains—the teaching-oriented operating system Minix. Its restrictive license was one of the reasons for the emergence of the now very widely used Linux (1991). Its name refers both to Linus Torvalds and to Unix, with which it shares many similarities, especially regarding the shell.
The subsections below mostly deal with security arrangements located fairly deep within the operating system. Before delving into them, the following is a list-style overview of topics that belong to the subject but are somewhat closer to the user. Examples are taken from Linux systems.
- User accounts and login
- password systems, encrypted password storage, and hiding the storage location
- home directory, user profile; at system level
/etc/profile and/etc/passwd ; for the user.login and.cshrc (or.profile ). - privileges of the superuser (root); other special users (e.g. certain service processes such as printer or web servers); granting situational root privileges to non‑administrators (the sudo command, “s. user and do”, where the interpretation of “s” has evolved: super → switch → substitute).
- groups (a default group and additional groups; in some implementations, only one group is active at a time and switching is required).
- Access control, especially for files:
- basic discretionary access control using a simplified access list: the “ugo” actors (user, group, others) and “rwx” operations (x = execute). More detailed ACLs may also be in use (e.g. AIX, HP‑UX, Linux). It is important to understand how ACLs are interpreted, especially in the presence of conflicts or wildcards.
- additional issues: changing access rights or ownership; default protections (umask, whose bits remove permission bits from rwx'rwx'rwx), and their effects when creating, copying, and moving files; file deletion—whether it is final, intermediate, or merely removal of the name and directory entry link (as in Linux).
- use of elevated privileges via the suid and sgid mechanisms (“set {user|group} id”, where the process started from a file runs with the file owner’s or group’s ID). Group mechanisms and sgid may be safer than direct suid in, for example, device control.
- mounting files from other devices into the local view using the mount command (also from the USB bus or encrypted partitions). This involves risks, mitigated using options such as ‘nosuid’ and ‘nodev’, which prevent devices from showing. This is needed, for example, to prevent a stored kmem file from being mapped as
/dev/kmem , which represents memory used by the OS kernel. - the global file
/etc/hosts.equiv specifies hosts from which users are treated as locally authenticated under the same identity; the user‑specific file.rhosts can offer similar behavior under a different name. These mechanisms suffer from address spoofing, so ssh should be used instead, with configuration specifying at leastRhostsAuthentication no . IfRhostsRSAAuthentication yes is set, then.rhosts data is combined with RSA authentication of the remote host.
- Accounting, which in addition to managing resource usage relates to detecting security incidents (see an earlier module)
- lastlog, which directly relates to login events; other log files
- accounting also exists in the file system: file metadata (inode) contains three timestamps—last access (atime), modification (mtime), and metadata change (ctime, e.g. owner or permissions).
- since the superuser can ultimately do anything in Unix, keeping a fully tamper‑proof audit log is not possible. Special arrangements are needed for administration—centralized logging, remote logging, write‑once media.
- Configuration and administration
- installation, configuration; restricting default settings; possible separation of privileges between technical, user, and security administrators.
- user management: updating user information (new, changed, removed); application updates and troubleshooting.
- security administration: identifying weak passwords; searching for other vulnerabilities; detecting and handling attacks.
- technical administration: updating hardware and operating systems, including security and other patches and new versions.
Authentication and identification¶
Since authentication is the topic of a later module, it is only noted here that alongside or instead of traditional passwords, many other methods are now used: smart cards and other token devices, as well as fingerprints and other biometrics. Regardless of the authentication method, the operating system maintains for each user an account identifier and possibly information about group membership, which affects access rights (e.g. student, teacher, administrator). Most operating systems associate the user identity with every process and use it to track file ownership and access permissions.
Credentials are the data that a user provides for authentication, and the same term is also used for the data against which the operating system compares them. The former are no longer needed after use, but the secure storage of the latter is critically important. Many modern operating systems rely on hardware to protect such sensitive data. For example, they may use a TPM module or an isolated virtual machine as a credential store, so that even a compromised virtual machine cannot directly access credentials. Similar features appear in the storage of cryptographic keys, and some credentials are in fact keys.
Access lists¶
In the Multics file system, each data object had an access list, that is, an access control list (ACL). Conceptually, an ACL is a table containing users and data objects, specifying for each object which users have which access rights. Most modern operating systems have adopted some form of ACLs, typically for file systems. In Unix‑based systems, the default access control is very simple: each file is owned by one user and one group. Each user may belong to one or more groups. The bits rwx indicate read, write, and execute permissions (for directories, x means access to the files within, provided they have r, w, or x). There are three such bit groups: the first applies to the owner, the second to group members, and the third to all other users of the same machine.
Basic Unix file permissions are thus simple, but modern systems (such as Linux and Windows) also support extended ACLs (for example, explicit permissions for multiple users or groups). Whenever someone attempts to read, write, or execute a file, the operating system checks the ACL to see whether the required permissions are present. In addition, Unix access control is typically discretionary, because the file owner can assign these rights to others.
In addition to discretionary access control (DAC), Multics also introduced mandatory access control (MAC). Although it took other systems a long time to adopt it, MAC is now available in many operating systems inspired by Multics. Linux even provides frameworks that allow different access control solutions to be connected to reference monitors. These mediate references in accordance with access control settings, meaning that operations deemed security‑sensitive are checked. Several MAC solutions exist; the best known is probably Security‑Enhanced Linux (SELinux). It is a set of Linux security enhancements originally developed by the U.S. National Security Agency (NSA).
SELinux assigns users and processes a context consisting of three strings: username, role, and domain. The inclusion of roles provides a natural link to RBAC. However, many systems use MAC only with domains, setting the username and role to the same value for all users. In addition to processes, SELinux contexts apply to resources such as files, network ports, and devices. Based on this configuration, administrators can define system‑wide access control policies. For example, they can specify which domains are allowed to perform particular operations (r, w, x, or establishing connections) on a resource. Policies can also be far more complex, including multiple security levels and information‑flow control in accordance with the Bell–LaPadula model.
Mandatory access control in systems such as SELinux is based on a single system‑wide policy. It is set by the administrator and does not change. Untrusted processes cannot define or modify their own policies. Such ideas have, however, been explored in research systems.
Capabilities (advanced)¶
An alternative to the ACL technique is capabilities, which date back as far as 1966. Unlike ACLs, capabilities do not require resource‑centric management with explicit information about who is allowed to perform which operation. Instead, users present a capability that itself proves that the requested access is permitted. Possession of a capability grants all rights defined in it, and whenever a process wishes to perform an operation on a resource, it must present the appropriate capability.
Naturally, forging capabilities must be impossible, so that users cannot arbitrarily grant themselves access to any object they desire. Consequently, the operating system must either store capabilities in a secure location (for example, in the kernel) or protect them against forgery using dedicated hardware support or software‑based cryptography. In a hardware-based approach, the operating system may store a process’s capabilities in a protected in‑kernel table, and when the process wants to operate on a file, it provides only a reference to the capability (such as a file descriptor) without directly manipulating the capability itself. In a software-based approach, the process may handle the capability directly, but any attempt to modify it improperly is detected through cryptographic verification.
Capabilities are highly flexible and enable convenient delegation policies. For example, a process that fully owns a capability can grant another process the same rights or some subset of them. This makes discretionary access control easy to implement. On the other hand, in some situations it may be undesirable to freely copy and distribute capabilities. For this reason, most capability‑based systems augment capabilities with a few additional bits indicating such restrictions: whether copying is allowed, whether the lifetime of the capability should be limited to the duration of a procedure call, and so on.
When comparing ACLs and capabilities, another difference becomes apparent: ACLs typically revolve around users (“a user with ID X may read and write this object”), whereas capabilities can be extremely fine‑grained. For example, separate capabilities may be used for sending and receiving data. In accordance with the least privilege principle, executing every process with the user’s “full power” is less secure than executing a process with only the capabilities it has been granted. Running processes with the credentials of the user who launched them, as is common in modern operating systems, is known as ambient authority. This violates the least privilege principle much more readily than equipping a process with fine‑grained capabilities tailored to exactly what it needs. In addition, capabilities do not even allow a process to name an object unless it has the corresponding capability, whereas ACL‑based systems allow all objects to be named, because access checks occur only when an operation is attempted. A further drawback of ACLs is that they may become very large as the number of users, permissions, and objects grows.
On the other hand, revoking a particular user’s access to a specific object is straightforward with ACLs: one simply removes the relevant permissions from the object’s ACL. With capabilities, revoking access to a given object is more difficult, because it may not even be known which users or processes possess the capability. Introducing an additional level of indirection can help somewhat. For example, capabilities may be made to point to an indirect object, which in turn refers to the real object. To revoke the capability (from all users or processes), the operating system can invalidate the indirect object. But what if the goal is to revoke the capability only from some processes? Although solutions exist beyond scanning all processes, revocation remains the most challenging aspect of capabilities.
Today, many major operating systems also provide at least some support for capabilities. For example, the L4 microkernel—used in many mobile devices—has included capabilities since its 2008 version. The formally verified security variant seL4 is based on a similar access control approach. Linux has offered so‑called POSIX capabilities since 1997, but these are limited and differ substantially from full capability systems.
Memory protection and address spaces (advanced)¶
A process should not normally be able to read another process’s data without an appropriate access control check. Multics and nearly all subsequent operating systems (such as Unix and Windows) isolate data by giving each process (a) its own processor state (registers, program counter, etc.) and (b) its own portion of memory. Whenever the operating system decides to replace the currently running process P1 with process P2, it performs a so‑called context switch: it first stops P1 and saves the entire processor state to memory in a region inaccessible to other processes. Next, it loads P2’s corresponding state into the CPU (or initializes it), updates the bookkeeping that determines which parts of physical memory are available, and begins executing P2 at the address indicated by its program counter. Because user processes cannot manipulate this bookkeeping, P2 cannot access any of P1’s data. Most modern operating systems maintain memory bookkeeping using page tables and store in a register a pointer to the top‑level table. This register is part of the processor state that is saved and restored during context switches.
The primary purpose of page tables is to give each process its own virtual address space, even though the amount of physical memory may be much smaller. From a security perspective, the key consequence of this structure is that a process can access memory data only if its page tables define a mapping for it. These mappings are controlled by the operating system, which can therefore decide precisely which memory is private, which is shared, and with whom. Enforcement is carried out by hardware, the memory management unit (MMU). If the translation from a virtual address to a physical one is not found in the small but extremely fast cache known as the translation lookaside buffer (TLB), the MMU searches for it by traversing the page tables step by step.
Page tables are the primary mechanism for managing memory usage in modern operating systems. Another method, now largely obsolete, is segmentation. The basic idea (see the introduction) is nevertheless worth understanding, because similar complex and multi‑level address translation also appears in virtualized environments.
Hardware-based memory protection (advanced)¶
Hardware‑assisted memory protection continues to evolve, often rediscovering old ideas. Intel’s Memory Protection Keys (MPK), introduced in the 2010s, are an example of this (see a discussion from 2015, which also refers back to the 1960s—perhaps it makes sense to reinvent the wheel when the vehicle, in this case memory, offers new possibilities). MPK allows the use of four previously unused bits in page table entries, yielding one of 16 possible values. These allow developers to partition memory into protection domains and, for example, permit only a specific cryptographic library to access encryption keys. An additional 32‑bit register associates two bits with each key value, indicating whether reading or writing pages marked with that key is allowed. Through this register, access can be adjusted easily without traversing all page table entries for the protected domain.
Some processor models support even finer‑grained memory protection. In ARM processors, this is known as Memory Tagging Extensions (MTE). The idea is simple but powerful—and not entirely new. The processor assigns a so‑called tag to each memory granule (for example, 16‑byte blocks) in hardware. Corresponding tags are also attached to memory pointers, which are software‑stored data. The tags are small, for example 4 bits, allowing them to be conveniently stored in the top byte of a 64‑bit pointer, which would otherwise be unused. Whenever a program allocates memory, the allocator assigns a random tag and sets it in the pointer. From that point on, access via that pointer is permitted only if the tags match. This effectively prevents most spatial and temporal memory errors.
At the same time, some processors—especially in low‑power devices—lack even a full‑featured MMU. Instead, they provide a much simpler Memory Protection Unit (MPU). Its sole purpose is memory protection, implemented in a way reminiscent of the MPK technique described above. In the MPU model, operating systems define a set of memory regions with specific access rights and attributes. For example, the MPU of ARMv8‑M processors supports up to 16 regions. According to the configured regions, the MPU monitors all memory accesses—both instructions and data.
So far, we have assumed that the operating system needs protection against untrusted user applications. The opposite situation is also possible. A user may be running a security‑sensitive application atop a potentially compromised operating system or in a cloud environment where the service provider is not fully trusted. It should then be possible to protect data and applications without trusting other software. For this purpose, processors may offer hardware support for executing highly sensitive code in a protected, isolated environment. Such trusted execution environments include ARM TrustZone and Intel Software Guard Extensions (SGX). They offer somewhat different primitives. For example, code running in an SGX environment is intended to be part of a normal user process, but its data is always encrypted when it is written to memory outside the CPU. SGX also provides hardware support for attestation, allowing another party—possibly remote—to verify that code is executing in isolation and is the intended code. ARM TrustZone, by contrast, separates the regular operating system and user applications from a “secure world” with its own smaller operating system and a limited number of security‑related applications. Code in the secure world is invoked in a manner similar to a normal system call.
One interesting application of special environments such as TrustZone or Intel’s SMM mode is to monitor at runtime the integrity of a general‑purpose operating system, as an attempt to detect malware or rootkits early. Although trusted environments are closely related to operating system security, they are not explored further here. It is nevertheless important to note that in recent years several side channels have been discovered even in hardware‑based trusted environments.
It may be that the operating system is sound, but the hardware is not. Malicious or faulty devices may use direct memory access (DMA) to read or overwrite data that should be beyond their reach. Furthermore, with some standards (such as Thunderbolt over USB‑C), the computer’s PCIe links may be directly exposed to devices that users connect (PCIe, Peripheral Component Interconnect Express, uses point‑to‑point links rather than a shared bus).
Unfortunately, it is difficult for users to be certain that even a display or power cable does not contain malicious circuitry. A partial mitigation in most modern architectures is a dedicated MMU for device‑initiated memory access: the IOMMU (Input‑Output Memory Management Unit). It maps virtual addresses used by DMA devices to physical addresses, mimicking page‑based memory protection. In other words, devices may reference virtual addresses that the IOMMU translates, checks for permission, and blocks if the page is not mapped for the device or if protection bits do not permit the requested access. While this offers some protection against malicious devices (or even drivers), it is important to understand that the IOMMU was designed primarily to facilitate virtualization and should not be viewed as a full security solution. Many things can still go wrong. For example, an administrator may want to revoke a device’s access to a memory page. Because updating IOMMU page tables is slow, operating systems often delay such updates and perform them opportunistically together with other operations. As a result, there may be a short window during which the device still has access to a memory page, even though the permissions appear to have been revoked.
The increasing number of transistors per unit area gives manufacturers opportunities to integrate ever more hardware extensions onto processor chips. The mechanisms described above are not the last security‑related extensions. Other examples include cryptographic units, memory encryption, instructions for efficiently switching extended page tables, and pointer authentication, where hardware detects modifications of pointer values. New features will appear in future processors, and operating systems must adapt to make meaningful use of them. A broader perspective on these issues can be found in the module on hardware.
Protection rings (advanced)¶
One of Multics’ revolutionary ideas was protection rings, which form a hierarchy in which the innermost ring (0) has the greatest privileges and the outermost rings the least. Untrusted user processes execute in the outer ring, whereas the trusted and privileged kernel, which interacts directly with hardware, executes in ring 0. Intermediate rings may be used for system processes with varying degrees of privilege.
Protection rings generally require hardware support. Most general‑purpose processors provide such support today, although the number of rings varies. For example, the Honeywell 6180 supported eight rings in the 1970s, Intel x86 supports four, ARM v7 supports three (plus an additional one for TrustZone), and PowerPC supported two in the 1990s. As will be seen later, the situation becomes more complex because some modern processors have added additional execution modes. For now, it suffices to note that most common operating systems use only two rings: one for the operating system and one for user processes.
Whenever code requires an operation that demands more privilege than it has, it “calls” a lower ring to request the service. Thus, only trusted, privileged code can execute the most sensitive instructions or handle the most sensitive data. Provided a process does not trick more privileged code into performing unintended actions, rings offer strong protection. Multics originally envisioned transitions between rings via special call gates, which enforce strict checking and mediation. Code in an outer ring cannot call arbitrary inner‑ring functionality, but only predefined interfaces where the call and its arguments are validated against policy.
Although processors such as x86 still support call gates, few operating systems use them because they are relatively slow. Instead, user processes enter the operating system kernel by performing a system call via a software interrupt or trap. Typically this uses a highly optimized system‑call instruction whose name depends on the architecture (SYSCALL, SYSENTER, SVC, SCALL, etc.). Many operating systems place system call arguments in a predefined set of registers. As with call gates, traps and system calls ensure that execution continues at a predetermined kernel address, where argument checks are performed and the appropriate operation is invoked.
In addition to user processes calling the kernel, most operating systems also allow the kernel to invoke user processes. For example, Unix‑based systems support signals, which the operating system uses to notify user programs of events: errors, expired timers, interrupts, messages from another process, and so on. If a user process has registered a signal handler, the operating system pauses the current execution, saves the processor state on the stack into a so‑called signal frame, and resumes execution at the handler. When the handler returns, the process executes the sigreturn system call, causing the operating system to restore the processor state from the stack and resume execution.
The boundary between security domains, such as the operating system kernel and user‑space processes, is an ideal place to validate both system calls and their arguments. For example, in capability‑based operating systems the kernel validates capabilities, and in systems such as Minix‑3 processes are allowed to perform only certain types of calls, so any attempt to do something outside an approved list is flagged as a violation. Similarly, Windows‑ and Unix‑based systems must validate the arguments of many system calls. A particularly important example is the read system call, by which a user process requests data to be copied from a file or socket into a buffer. The operating system verifies that the process owns the target memory region. The same applies to the complementary write system call, where ownership of the write target must be checked.
After executing a system call, the operating system returns control to the calling process. Here as well, the operating system must ensure that it does not produce results that compromise security. If a process, for example, uses the memory‑mapping system call mmap to request additional memory, the operating system must ensure that the memory pages it returns do not contain sensitive data from a previous process—by initializing every byte to zero.
Initialization issues can be very subtle. For example, compilers often introduce padding bytes in data structures for alignment. Because these padding bytes are not visible at the programming‑language level, the compiler may not initialize them. A security violation occurs if the operating system returns such a data structure in response to a system call and the uninitialized padding contains confidential data from the kernel or another process.
Even the previously mentioned Unix signaling mechanism is interesting from a security perspective. In sigreturn, the processor state saved on the stack is restored. An attacker may be able to corrupt a process’s stack and place a forged signal frame there. If the attacker can then invoke sigreturn, they effectively set the entire processor state—including all register values—at once. This is a variant of return‑address manipulation attacks; see Return‑oriented programming (ROP) and SigROP (SROP).
Ring bypasses (advanced)¶
The situation regarding protection rings has become somewhat confusing, because modern processors offer virtualization instructions that allow hypervisors to access hardware functionality normally reserved for ring 0. To support this, processors have added what appears to be a level below ring 0. Since ring 0 on x86 has become synonymous with the operating system kernel (and ring 3 with user mode), this new hypervisor level is commonly referred to as ring –1. This also reflects the fact that guest operating systems running in virtual machines can execute ring‑0 instructions directly. Strictly speaking, the purpose of ring –1 differs from that of the original rings, and the name can therefore be misleading.
The situation becomes even more complex because some modern processors include additional execution modes. For example, x86 provides System Management Mode (SMM). At system boot, firmware controls the hardware and prepares the system for the operating system. When SMM is enabled, however, firmware regains control whenever a specific interrupt is delivered to the CPU. Firmware may, for instance, request an interrupt whenever the power button is pressed. In that case, normal execution halts and firmware takes over. It may save the processor state, perform necessary actions, and then return control to the operating system to continue an orderly shutdown. In some discussions, SMM is considered to be even below the other rings (ring –2). Intel later added what some refer to as ring –3 in the form of the Intel Management Engine (ME). ME is a completely independent system, present in nearly all Intel chipsets; it runs proprietary firmware on a separate microprocessor and is always active: during boot, while the system is running, while it is in sleep mode, and even when it is powered off. As long as the computer is connected to power, communication with ME over the network is possible, for example to install updates. Although extremely powerful, its functionality is largely opaque, except that it runs its own small operating system (version 11 is reportedly based on Minix‑3). Additional auxiliary processors that accompany the main CPU—whether Intel ME, Apple’s T2 chip, or Google’s Titan—raise an important question: can an operating system running on the main processor alone truly meet modern security requirements? At the very least, the trend appears to be toward complementing it with specialized security subsystems in both hardware and software.
Embedded systems and IoT devices¶
Many of the features described above are present in one form or another in most general‑purpose processor architectures. This is not necessarily the case in IoT devices or embedded systems more generally, which typically use customized operating systems. Simple microcontrollers usually lack MMUs, and sometimes even MPUs, protection rings, or other advanced features common in general‑purpose operating systems. These systems are usually small, which reduces their attack surface, and their applications are trusted (and possibly verified). Nevertheless, the embedded nature of such devices makes their security difficult to inspect or even test. Wherever such devices play a role in security‑related functions, isolation and mediation‑based security should be externally monitored by the surrounding environment. Broader IoT challenges are discussed in the context of cyber‑physical systems.