Prevention of Microarchitectural Covert Channels on an Open-Source 64-bit RISC-V Core

Nils Wistoff
ETH Zurich and RWTH Aachen
and HENSOLDT Cyber GmbH
Zurich, Switzerland
nwistoff@ethz.ch

Frank K. Gürkaynak
ETH Zurich
Zurich, Switzerland
kgf@iis.ee.ethz.ch

Luca Benini
ETH Zurich
Zurich, Switzerland
lbenini@iis.ee.ethz.ch

Gernot Heiser
UNSW Sydney and Data61 CSIRO
Sydney, Australia
gernot@unsw.edu.au

ABSTRACT
Covert channels enable information leakage across security boundaries of the operating system. Microarchitectural covert channels exploit changes in execution timing resulting from competing access to limited hardware resources. We use the recent experimental support for time protection, aimed at preventing covert channels, in the seL4 microkernel and evaluate the efficacy of the mechanisms against five known channels on Ariane, an open-source 64-bit application-class RISC-V core. We confirm that without hardware support, these defences are expensive and incomplete. We show that the addition of a single-instruction extension to the RISC-V ISA, that flushes microarchitectural state, can enable the OS to close all five evaluated covert channels with low increase in context switch costs and negligible hardware overhead. We conclude that such a mechanism is essential for security.

CCS CONCEPTS
• Security and privacy → Hardware security implementation; Operating systems security; • Computer systems organization → Reduced instruction set computing.

KEYWORDS
covert channels, timing channels, computer architecture, microarchitecture, operating systems, system security, time protection

1 INTRODUCTION
A covert channel is an information flow that uses a mechanism not intended for information transfer [20], and thereby violates a system’s security policy that the OS is meant to enforce. For example, some untrusted code, such as a mail client, may be given access to secrets but should be confined to only communicate with the outside world via an encrypted channel. A covert channel can enable the mailer to leak the raw secrets, bypassing encryption.

Covert channels that utilise OS-managed spatial resources (storage channels) can be eliminated completely, as was proved for the seL4 microkernel [23]. Harder to control are channels that target physical quantities not directly managed by the OS, such as processor temperature [22] or power draw [17]. Particularly dangerous are timing channels, which exploit information encoded in the timing of events, as they can be exploited remotely.

2 BACKGROUND
2.1 Threat Model
We examine covert-channel leakage under a confinement scenario [20]: An untrusted program possesses a secret, and the OS encapsulates the program’s execution in a security domain that only
A second, unconfined, and also untrusted security domain contains (This is a somewhat simplified description— in general it is necessary to randomise the access order to prevent interference from prefetching, but that is not an issue on our processor.) With a correctly-sized priming buffer, this leaves the hardware resource in a state where further accesses by the spy within the same address range are fast, as illustrated on the left of Figure 1.

At the end of its time slice, the OS preempts the spy and switches to the Trojan, which accesses a subset of the hardware resource to encode the secret. Given a D-cache of \( n \) lines, the Trojan can transmit a secret \( s \leq n \), the input signal, by touching \( s \) cache lines, thereby replacing the spy’s content. The resulting state is illustrated on the right of Figure 1. Obviously, more complex encodings are possible to increase the amount of data transferred in a time slice (the channel capacity), but for our purposes, the simple encoding is sufficient, as we want to prevent any leakage.

When execution switches back to the spy, it again traverses (probes) the whole buffer, observing its execution time. Each entry replaced by the Trojan’s execution leads to a cache miss, and results in an increase in probe time. If the latency of a hit is \( t_{\text{hit}} \) and that of a miss is \( t_{\text{miss}} \), the total latency increase is \( s \cdot (t_{\text{miss}} - t_{\text{hit}}) \). For our simple encoding scheme, the output signal is the total probe time, which is directly correlated to the input signal. A more sophisticated encoding scheme would have to measure the time of each individual access and perform a more complex analysis.

### 2.4 Time Protection

Time protection is a recently proposed, principled approach to eliminating timing channels [9]. While the established notion of memory protection prevents interference between security domains through unauthorised memory accesses, time protection aims to prevent interference that affects observable timing behaviour.

Time protection requires that all shared hardware resources, including non-architected ones, must be partitioned between security domains, either temporally (secure time multiplexing) or spatially. Ge et al. show that (physically-addressed) off-core caches can be effectively partitioned through cache colouring [16], which leverages the associative cache lookup to force different partitions into disjoint subsets of the cache. They demonstrate that colouring is effective in preventing cache channels in both intra-core and cross-core attacks and comes with low overhead.

Spatial partitioning is generally impossible for on-core resources for lack of hardware support. These are usually also fairly small and highly utilised by a single program, so partitioning would result in unacceptable performance degradation. Furthermore, on-core resources are accessed by virtual address, which is not under OS control, making approaches such as colouring infeasible.

This leaves temporal partitioning for on-core resources. In order to prevent any interference between security domains, each such resource must be reset to a state that is independent of execution history before handing it to a different domain. This means that the OS must be provided with the means to perform such a reset of all microarchitectural state, creating the requirement of extending the hardware-software contract to refer (in a highly abstract way) to such non-architected state [11]. The authors specifically show that contemporary Intel and Arm processors lack the mechanisms required for implementing time protection.

### 2.5 Proposed Temporal Fence

We introduce such a mechanism in the form of a temporal fence instruction, fence t, which isolates the timing of any subsequent execution from what happened before. Our fence instruction specifically applies to on-core state only, as off-core state can be spatially partitioned. We realise, however, that this definition is less abstract than one...

---

#### Figure 1: A cache before and after the Trojan encodes the secret \( s \).

![Cache Diagram](image-url)
We run each attack for a number of iterations, the visual representation of leakage, and the mutual information for quantifying channel capacity we define as the output value its probe latency, a randomly chosen secret, and knowledge of the second random variable. If both random variables are highly correlated (i.e., there exists a covert channel), the information gained by observing S is low and the mutual information becomes high. Conversely, if both random variables are uncorrelated, we have $H(T)|S) = 0$.

Zero Leakage Upper Bound $M_0$. Since all measurements are affected by noise, $M$ will never be zero, even if there is no channel. We use a Monte Carlo simulation for estimating the apparent channel produced by this noise. Specifically, we rearrange the input/output pairs into uniformly random pairs and thus remove any correlation between them, while retaining their original value ranges and spreads. Any mutual information that is measured from this data can only be due to noise. We repeat this process 1000 times and then compute the 95%-confidence interval $M_0$ for an experiment without a channel. We conclude that a channel is definitely present if $M > M_0$, else that the result is consistent with no channel.

We use the leakiEst tool [3] to compute mutual information $M$ and zero leakage upper bounds $M_0$.

### 3.2 Evaluation Platform

#### 3.2.1 Ariane. The hardware platform for evaluating channels and defences is based on Ariane, an open-source, RV64GC, 6-stage RISC-V core developed at ETH Zurich [33]. It is implemented in SystemVerilog and publicly available on GitHub [32]. It features three privilege levels and virtual memory (SV39) from the privileged ISA specification [30], and thus supports full-fledged operating systems. Its configurability, simplicity, and openness make it a good candidate for architectural exploration.

Setup. We instantiate the Ariane core on an FPGA (Digilent Genesys II), running at 50 MHz, using the standard configuration with an 8-way, 32 KiB write-through L1-D and a 4-way, 16 KiB L1-I cache. Both use 16-byte lines and a pseudo-random replacement strategy driven by an 8-bit linear-feedback shift register (LFSR). The L1-D is accessed by the load-, store-, and memory-management units, with concurrent accesses arbitrated round-robin. The branch predictor has a 64-entry BHT and a 16-entry BTB. There is a single-level, unified, fully associative, 16-entry TLB using a pseudo-LRU replacement policy. For reducing write-stalls we increase the write buffer to 40 entries. We add some off-core components, including a timer and a 512 KiB write-back L2 cache [26] that is connected to DRAM. Figure 5 shows the memory architecture.
4 COVERT-CHANNEL CAPACITIES

4.1 Baseline: No Time Protection

To establish a baseline and compare to other architectures, we apply the P&P attacks to our Ariane RV64GC core, with time protection disabled in sEL4. We observe strong channels through each of the five microarchitectural resources targeted. As shown in the Unmitigated column of Table 1, capacities range from 0.4 to 4 bits. The \( M_0 \) are all well below 1 mb, indicating that the channels are real. To put those numbers into context: Assuming the OS uses a 1 ms time slice, Trojan and spy will each execute 500 times per second. The 1.6 b capacity of the D-cache thus means the channel has a bandwidth of 833 b/s, able to leak a 1024-bit RSA key in just over a second. These channels use a rather primitive encoding scheme; more sophistication could increase the bandwidth significantly.

The channel capacities we observe agree nicely with the prior work, which showed unmitigated capacities of 0.3–4 b on Intel and 7.5 mb to 2.5 b on Arm processors [9].

Figure 6a and Figure 7a show the unmitigated channel matrices for the L1-D cache and the BHT, respectively; \( N \) is the number of iterations. The clear diagonal pattern indicates a strong correlation of output with input signals, establishing efficient channels. For example, Figure 7a shows that if the spy observes a probe time of 380 cycles, it can infer with a high confidence that the Trojan has encoded the value 48.

4.2 Using Existing Instructions Only

Ge et al. report that neither the x86 nor the Arm architectures provide sufficient mechanisms for implementing time protection, with Arm coming closer in at least providing targeted L1 cache flushes. The Intel architecture does not have those, and the authors implemented software flushing by touching all cache lines, similar to the prime phase of the P&P attack. Such an approach is expensive and obviously brittle, as it must make assumptions on the replacement policy which may not hold in reality. Unsurprisingly, they find that this defense is incomplete, leaving residual channels that the OS is unable to close.

With RISC-V, the situation is presently worse, as specification of cache management is still under discussion. While implementations generally support some cache management, this is consequently not standardised. To explore this aspect, we implement a “software only” defense, where the OS uses only mechanisms defined in the ISA as presently specified. This basically forces the OS to resort to the priming approach in an attempt to erase any microarchitectural state left by the Trojan’s execution.

Figure 6b shows the result for the L1-D cache channel, where the OS performs two priming runs per context switch. While fuzzier than in the unmitigated case, a clear diagonal pattern persists, and the measured capacity is only reduced by 70%, making this defense highly ineffective, a result that is much worse than Ge et al. observed for Intel. The reasons are for one the Ariane’s replacement policy, which uses a pseudo-random sequence with a period of 256. This makes it practically impossible to flush the cache reliably through priming. Furthermore, there is secondary state that is even harder to reset, as we will find in Section 4.3.2.

4.3 Using a Temporal Fence

As discussed in Section 2.5 we add a new fence. t. instruction to the Ariane. When fence. t. is committed, Ariane’s controller sends a flush signal to all stateful microarchitectural components. To give the operating system maximum control, an immediate value selects the components to be flushed.

Table 1: Mutual information and corresponding zero leakage upper bound in millibits.

<table>
<thead>
<tr>
<th></th>
<th>( M )</th>
<th>( M_0 )</th>
<th>( M )</th>
<th>( M_0 )</th>
<th>( M )</th>
<th>( M_0 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>L1 D$</td>
<td>1667.3</td>
<td>0.5</td>
<td>10.4</td>
<td>6.0</td>
<td>33.3</td>
<td>39.1</td>
</tr>
<tr>
<td>L1 I$</td>
<td>1905.0</td>
<td>0.5</td>
<td>8.3</td>
<td>4.9</td>
<td>37.9</td>
<td>39.4</td>
</tr>
<tr>
<td>TLB</td>
<td>408.7</td>
<td>0.1</td>
<td>5.0</td>
<td>5.9</td>
<td>3.1</td>
<td>7.7</td>
</tr>
<tr>
<td>BTB</td>
<td>3211.4</td>
<td>0.1</td>
<td>35.7</td>
<td>59.3</td>
<td>28.2</td>
<td>60.3</td>
</tr>
<tr>
<td>BHT</td>
<td>3770.6</td>
<td>0.2</td>
<td>45.2</td>
<td>58.8</td>
<td>44.1</td>
<td>60.8</td>
</tr>
</tbody>
</table>

With RISC-V, the situation is presently worse, as specification of cache management is still under discussion. While implementations generally support some cache management, this is consequently not standardised. To explore this aspect, we implement a “software only” defense, where the OS uses only mechanisms defined in the ISA as presently specified. This basically forces the OS to resort to the priming approach in an attempt to erase any microarchitectural state left by the Trojan’s execution.
5 COST

5.1 Context-Switch Latency

Time protection resets hardware state on a switch of security partition, which implies a full context switch. seL4’s IPC is essentially a user-triggered context switch [12] with roughly the same cost as a time-slice preemption, and the seL4 benchmark suite [21] provides a convenient rig for measuring its latency. We use inter-address-space IPC for evaluating flush cost. Table 2 compares the latencies of various configurations. Here “hot” measures the best-case of switching for and back in a tight loop, where the whole working set fits into the L1 caches. The cold-cache scenario is the realistic baseline for our purposes, as a security-domain switch is normally set fits into the L1 caches. The cold-cache scenario is the realistic baseline for our purposes, as a security-domain switch is normally

We include these components in the flush and re-run the experiments, the results are shown as “Final fence. t” in Table 1. All measured channel capacities are now clearly below the zero-capacity threshold, meaning that there is no evidence of a residual channel. The channel matrices confirm this: Figure 6d and Figure 7b only show patterns that appear to be random noise (confirmed by visual comparison between multiple runs).

Figure 6: L1 data cache channel matrices.

(a) Unmitigated. \( N = 10^6, M = 1667.3\text{mb}, M_0 = 0.5\text{mb} \).

(b) Attempted reset by priming twice. \( N = 10^6, M = 515.7\text{mb}, M_0 = 1.1\text{mb} \).

(c) Original fence clearing first-order state. \( N = 10^6, M = 10.4\text{mb}, M_0 = 6.0\text{mb} \).

(d) Improved fence clearing first- and second-order state. \( N = 10^6, M = 33.3\text{mb}, M_0 = 39.1\text{mb} \).

Figure 7: BHT channel matrices.

(a) Unmitigated. \( N = 10^6, M = 3770.6\text{mb}, M_0 = 0.2\text{mb} \).

(b) Improved fence. \( N = 10^6, M = 44.1\text{mb}, M_0 = 60.8\text{mb} \).

• the LFSR used for victim selection in L1 cache replacement
• the round robin arbiter of the L1 data cache
• the pseudo-LRU tree for the TLB replacement strategy.

...
Table 2: sel4 IPC latencies and standard deviations in cycles.

<table>
<thead>
<tr>
<th></th>
<th>Unmitigated</th>
<th>Mitigated</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hot</td>
<td>430 (±7.0)</td>
<td>51,877 (±256)</td>
</tr>
<tr>
<td>Cold</td>
<td>1,180 (±1.0)</td>
<td>1,502 (±0.9)</td>
</tr>
<tr>
<td>D-cache prime</td>
<td></td>
<td></td>
</tr>
<tr>
<td>fence. t</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

other than the L1-D. In contrast, the temporal fence, which we have found to be highly effective against all channels, only adds 320 cycles (less than 30%) to the cold-cache latency. With a switch rate of no more than 1 kHz, this adds negligible cost.

The dominating contribution to the direct latency of the fence. t instruction is the cache flush. A write-through cache is flushed by clearing all valid bits. This is a constant-time operation, which could in theory be performed in a single cycle. However, in Ariane’s write-through cache, the valid bits are stored together with the tags in sequentially accessible SRAM, allowing for an invalidation of only one set per cycle, and thus resulting in a latency of 256 cycles. All other state can be reset in a single cycle.

A write-back L1-D cache would be more expensive to flush, as each dirty line must we written back to the L2. Since the L2 of our platform can process up to 8 B per cycle, the theoretical latency for a write-back variant varies between 0 cycles (clean cache) and 4,096 cycles (all lines dirty). In such a case of a variable latency, the OS must pad to the worst-case latency, to prevent the flush latency becoming a covert channel [11].

5.2 Hardware Overhead

To estimate the hardware costs incurred by the fence. t instruction, we examine the resource utilization of our FPGA. The number of deployed LUTs remains within 1% of the original size. Hence, the mechanism should not cause a notable increase in chip area or power draw of the design.

6 RELATED WORK

Past work has approached the hardware mitigation of microarchitectural covert channels from different angles. Page [24] propose static partitioning of the L1 cache while Wang and Lee [28] propose locking cache lines. While spatial partitioning can certainly prevent attacks, in the case of the L1, the reduction of available cache space would have a major impact on application performance. Wang and Lee [29] instead aim to defeat attacks by dynamically remapping cache lines.

A hardware feature that aims to detect an ongoing cache-based covert channel attack is proposed by Chen and Venkataramani [2]. Fang et al. present a method to scramble information transmitted over such a channel by leveraging cache prefetchers [5, 6]. This is not applicable to Ariane, which lacks prefetching.

Fadieh et al. [4] suggest a formal method for detecting vulnerable microarchitectural components within the HW design. While such an approach could prove crucial for the systematic uncovering of microarchitectural covert channels, the question of their mitigation remains open.

Our work extends that of Ge et al., who propose time protection and the need for flushing all microarchitectural on-core state on a partition switch, and demonstrate the need for hardware support [8, 9, 11], which is what our temporal fence provides.

7 CONCLUSION

On a simple in-order application-class RISC-V processor we evaluate microarchitectural covert timing channels, previously demonstrated on x86 and Arm processors, and find that they exist with similar capacities on the RISC-V core. We confirm the finding of Ge et al. [10] that existing architectured mechanisms are insufficient for preventing those channels. Answering their request for improved hardware support that will enable a principled prevention of such channels, we propose a temporal fence instruction, fence. t.

An implementation of fence. t on our RISC-V core shows that the naive approach of just clearing all valid bits on cache lines is insufficient. Instead we find that secondary state, in our case the state machine controlling cache-line replacement, can also be exploited as a covert channel, and must be reset as well. We then demonstrate that a complete state flush is successful in eliminating all channels to well below measurement accuracy. We also find that while the (largely ineffective) attempts to close channels with existing instructions are extremely costly, the overhead of fence. t is very low, about 320 cycles on our core, which is insignificant at typical partition-switch rates of 1 kHz or lower. We similarly find that the area and power overheads of fence. t are insignificant.

Our findings show that the mechanisms requested by OS researchers for principled timing-channel prevention are feasible and low cost, and there seems to be no good reason not to include them into the architecture. This confirms that security should be seen as a hardware-software codesign problem, where OS researchers and architects must collaborate closely.

We hope our findings will support current work that aims at provably eliminating microarchitectural timing channels [13].

ACKNOWLEDGMENTS

We thank Qian Ge and Curtis Millar for their support with the covert channel measurement framework, Wolfgang Rönninger for providing us with an L2 cache, and Florian Zaruba for his help and insights on Ariane. The support of HENSOLDT Cyber and the IDEA League for Wistoff’s work at ETH is gratefully acknowledged. Heiser’s work was supported by Australian Research Council (ARC) grant DP190103743 and the US Asian Office of Aerospace Research and Development (AOARD).

REFERENCES
