Embedding Weak Memory Models within Eager Sequentialization

Ermenegildo Tomasco¹, Truc L. Nguyen¹, Bernd Fischer², Salvatore La Torre³, and Gennaro Parlato¹

¹ Electronics and Computer Science, University of Southampton, UK
² Division of Computer Science, Stellenbosch University, South Africa
³ Dipartimento di Informatica, Università di Salerno, Italy
{et1m11,tnl2g10,gennaro}@ecs.soton.ac.uk, bfischer@cs.sun.ac.za, slatorre@unisa.it

Abstract. Sequentialization is one of the most promising approaches for the symbolic analysis of concurrent programs. However, existing sequentializations assume sequential consistency, which modern hardware architectures no longer guarantee. In this paper we describe an approach to embed weak memory models within eager sequentializations (a la Lal/Reps). Our approach is based on the separation of intra-thread computations from inter-thread communications by means of a shared memory abstraction (SMA). We give details of SMA implementations for the SC, TSO, and PSO memory models that are based on the idea of individual memory unwindings, and sketch an extension to the Power memory model. We use our approach to implement a new, efficient BMC-based bug finding tool for multi-threaded C programs under SC, TSO, or PSO based on these SMAs, and show experimentally that it is competitive to existing tools.

1 Introduction

Developing correct concurrent programs is a complex and difficult task, due to the large number of possible concurrent executions that must be considered. Modern multi-core hardware architectures with weak memory models (WMMs) have made this task even harder, because they introduce additional executions that can lead to seemingly counter-intuitive results that confound the developers’ reasoning.

Testing remains the most widely used approach to finding bugs; however, it is ineffective for bugs that manifest themselves only rarely and are difficult to reproduce [26]. Such “Heisenbugs” are unfortunately more prevalent with WMMs. Static verification approaches that handle individual executions explicitly face the same state space explosion as testing, even with optimizations that eliminate redundant executions. We thus need approaches that can handle multiple concurrent executions symbolically.

However, building efficient symbolic verification tools for realistic programming languages like C is hard and extending them for concurrency is harder yet. Tools thus often fold the concurrency handling deep into their general verification approaches (see [1,3,6,8,30,31]), focussing on a specific memory model, typically sequential consistency (SC). This introduces a strong coupling between the two aspects, which makes it hard to reuse existing tools and to generalize solutions to other memory models.
Our goal here is to break this coupling and to separate the computation (i.e., individual threads) and the communication (i.e., shared memory) concerns of concurrent programs, without losing the efficiency of existing approaches.

More specifically, we develop an approach to combine eager sequentializations with different memory models in the style of a plug-and-play architecture. For this, we define and describe an interface that we call shared memory abstraction (SMA). The SMA captures the standard concurrency operations in multi-threaded programs such as shared memory reads, writes, and allocations, thread creation and termination, and synchronization operations such as thread join and mutex locking and unlocking. We then assume that all operations involving concurrency are performed by invoking the corresponding SMA operations (which can easily be achieved by rewriting non-conforming programs). In this way, we achieve the desired separation of concerns—in fact, we can even view a multi-threaded program as the composition of two independent sub-systems, one comprising all threads and one capturing the concurrency (including the memory model), which synchronize using the SMA.

As a first contribution we introduce the concepts of thread-wise equivalence and thread-asynchronous closure of transition systems. We show that reachability is preserved if we exchange the transition system of a program for a thread-wise equivalent one (assuming the SMA is thread-asynchronous) or an SMA for its thread-asynchronous closure. This has two important consequences. First, it allows us to extend existing concurrent verification algorithms to different memory models simply by implementing the corresponding different SMAs. Second, it gives us a degree of freedom in designing concurrent verification algorithms, since it allows us to rearrange the order in which the verifier explores the execution of the statements among different threads. This is implicitly exploited by some algorithms from the literature (e.g., [20,16,27]). All these algorithms can be recast in our setting and thus be extended to WMMs.

However, the way the computation and communication concerns are combined affects the scalability of the resulting verification tool. As second contribution, we thus instantiate our general approach to achieve an efficient BMC-based bug-finding tool. We give efficient SMA-implementations for SC, total store ordering (TSO), and partial store ordering (PSO) that are based on the idea of individual memory-location unwindings, and we show through experiments that our tool compares well with existing tools. We finally discuss how to extend this to other relaxed memory models such as POWER.

2 Weak Memory Models

A shared memory is a sequence of memory locations of fixed size. The content of each location can be read or written using an explicit memory operation. The semantics of read and write operations depend upon the adopted memory model. Besides SC, we also consider TSO and PSO, which are implemented in modern computer architectures.

Sequential consistency (SC). SC is the “standard model”, where a write into the shared memory is performed directly on the memory location. This has the effect that the newly written value is instantaneously visible to all the other threads [21].

Total store ordering (TSO). The behaviour of the TSO memory model can be described using a simplified architecture with explicit store buffers [22]. Each thread $t$ is equipped
with a local store buffer that is used to cache the write operations performed by $t$ according to a FIFO policy. Updates to the shared memory occur nondeterministically along the computation, by selecting a thread, removing the oldest write operation from its store buffer, and then updating the shared memory valuation accordingly. Before updating, the effect of a cached write is visible only to the thread that has performed it. A read by $t$ of a variable $y$ retrieves the value from the shared memory unless there is a cached write to $y$ pending in its store buffer; in that case, the value of the most recent write in $t$’s store buffer is returned. A thread can also execute a fence-operation to block its execution until its store buffer has been emptied.

Partial store ordering (PSO). The semantics of PSO is the same as for TSO except that each thread is endowed with a store buffer for each shared memory location.

3 Multi-Threaded Programs over Shared Memory Abstractions

In this paper, we consider multi-threaded programs with a C-like syntax (see Fig. 3 in the Appendix) including pointer arithmetics and dynamic memory allocation. We further consider POSIX-like threads with dynamic thread creation, thread join, and mutex locking and unlocking operations for thread synchronization, but no thread communication primitives: threads communicate only via the shared memory. We also assume a fence-statement that commits all pending write operations of a thread into the shared memory; for TSO and PSO this means it flushes all store buffers of a thread.

3.1 Shared memory abstractions

The semantics of multi-threaded programs ultimately depends on the underlying memory model. In order to combine existing concurrent verification techniques with different memory models we define a “concurrency interface” or shared memory abstraction (SMA) that abstracts away the shared memory operations in the syntax of multi-threaded programs. The intended meaning of the SMA’s functions is standard; note that most functions carry the calling thread $t$ as an extra argument to allow the SMA to update its internal state. In detail, the SMA API is formed of the following functions:

- init() initializes the SMA and the shared variables; this must be the first statement in the program;
- terminate($t$) ends the execution of $t$; each thread must explicitly call it;
- address($v$, $t$) returns the memory address of the shared variable $v$;
- malloc($n$, $t$) allocates a continuous block of $n$ memory locations and returns the base address of the block;
- read($v$, $t$) (resp. ind_read($a$, $t$)) returns the valuation of the shared variable $v$ (resp. memory location with address $a$) as seen by $t$;
- write($v$, val, $t$) (resp. ind_write($a$, val, $t$)) sets the valuation of the shared variable $v$ (resp. memory location with address $a$) to the value $val$;
- fence($t$) commits all pending write operations of $t$ into the shared memory;
- lock($m$, $t$) and unlock($m$, $t$) are the standard thread synchronization primitives that acquire and release a mutex $m$ for $t$; if $m$ is currently acquired, the lock operation is blocking for $t$, i.e., $t$ is suspended until $m$ is released and then acquired;
create(f, t) spawns a new thread that starts from function \( f \), and returns a fresh thread identifier for this thread;

join(t', t) pauses the execution of \( t \) until \( t' \) has terminated its execution.

### 3.2 Multi-threaded programs as composition of transition systems

The formal semantics of multi-threaded programs is often given by a transition system (see Appendix for the formal definitions) that captures the program computations by interleaving the computations of each thread. Analogously to previous work (e.g., [25]), we exploit the separation between the control flow and the shared memory aspects introduced with the notion of SMA, and give the semantics of a multi-threaded program as the composition \( C|M \) of the control-flow transition system \( C \) that captures the control flow of the program and the shared memory abstraction transition system \( M \) that implements the behaviours of the SMA. This allows us to keep the semantics of the sequential part and re-interpret it in different ways with different WMMs; it also aligns nicely with different SMA implementations.

These two transition systems are synchronized over the SMA API that defines an alphabet that labels the transitions of \( C \) and \( M \). More precisely, this alphabet \( \Sigma_{SMA} \) consists of the calls to the SMA API functions that do not return values, and the calls augmented with a parameter denoting the returned value for the others. For example, read(3, v, t) is the letter corresponding to a call read(v, t) that returns value 3.

**Control-flow transition system.** The states of the control flow transition system \( C \) are the set of tuples of thread configurations. A thread configuration consists of a program counter, an evaluation of the thread-global variables and a call stack, as usual. \( C \) has a unique initial state that corresponds to the empty configuration (i.e., no threads are active in the beginning).

The transitions correspond to the execution of any of the statements. Transitions corresponding to invocations of API functions of SMA are labeled with the corresponding letter from \( \Sigma_{SMA} \). In particular, transition from the initial state are labeled with init() and enter a state with the starting configuration of the main thread. No other transitions are labeled with init(). Transitions corresponding to SMA functions that return a value are handled as assignments of the corresponding variables with the returned values. Additionally, on a thread creation the tuple of thread configurations is augmented with the starting configuration of the newly created thread. Similarly, the effect of a transition on terminate(t) is to delete the configuration of the terminated thread. The remaining transitions labeled with \( \Sigma_{SMA} \) letters just update the program counter. Transitions corresponding to all other (i.e., sequential) statements are labeled with the empty word \( \varepsilon \) and update the configuration of the issuing thread as usual.

**Shared memory abstraction transition system.** In general, an SMA transition system \( M \) has an initial state and a state for each possible configuration of the corresponding memory model. The transitions update memory configurations to capture the memory model’s intended meaning. Note that from the initial state there are only outgoing transitions, which are all labeled with init(), and no other transition have this label.

For SC, the system \( M_{sc} \) can enter from the initial state any state that has only one thread (which must be active), has any number of shared locations (which must all have
the value of zero), and has any number of mutexes (which must all be unlocked). All other transitions update the state of $M_{sc}$ according to the meaning given in Section 2.

Note that in SC there are no fence-transitions. Further, in a transition on terminate(t), $M_{sc}$ enters a state where the status of t is terminated. From any such state only states where the t status remains terminated can be reached, and no other transitions corresponding to invocations of API functions from t are allowed. The final states are all states where all threads are terminated.

For the WMMs we denote the corresponding SMA transition system with $M_{tso}$ and $M_{pso}$, respectively. The states of both systems also account for the content of the thread store buffers, the transitions on reads and writes reflect the corresponding semantics as described in Section 2, and there are fence(t)-transitions on calls to fence by t and $\varepsilon$-transitions for store buffer updates.

4 Verification with thread-asynchronous SMAs

The basis of our approach is the separation between the intra-thread control-flows and the SMA discussed in Section 3. Conceptually, a verification tool is thus composed of an SMA implementation and a search algorithm that explores the program executions. This by itself allows for a convenient way to extend verification methods to other memory models by simply replacing the SMA implementation. However, this might not result in scalable verification tools: to preserve the correct semantics of the memory operations, these must be invoked in the same order as they appear along the run, which may be a bottleneck when we explore the state space of the program, both in case of the analysis based on summaries (e.g., BDD-based model checking) or bounded model checking.

In the former, we must keep a cross-product of the states of all threads in the configurations, which leads to state-space explosion. In the latter, since context-switches can happen at any point, we must encode into the SAT/SMT formula the code of all threads for each of the context-switch points in the underlying bounded multi-threaded program, which leads to large formulas.

Some approaches from the literature instead explore the program executions by rearranging the order in which the memory operations of the different threads are executed, e.g., by simulating each thread to completion [16,20]. Another example is the sequentialization presented in [27] where each thread is executed in isolation with respect to a memory unwinding (i.e., a sequence of writes that is guessed at the beginning).

We generalize the ad-hoc approaches above (see the Appendix for their re-formulation in our setting), and present a general framework in which to design concurrent program verification approaches. This requires that the used SMA implementation is thread-asynchronous, that is that its behaviours are insensitive to how the threads are interleaved. This allows us to freely transform the threads as long as we stay within the class of thread-wise equivalent programs, that is programs where the intra-thread ordering of the statements remains the same.

For a thread $t$, we denote with $\Sigma_{t}^{SMA}$ the maximal subset of $\Sigma_{SMA}$ containing only letters that are issued by $t$. Clearly, for threads $t$ and $t'$ with $t \neq t'$, $\Sigma_{t}^{SMA}$ and $\Sigma_{t'}^{SMA}$ are disjoint. For a thread $t$ and a word $\alpha$ over $\Sigma_{SMA}$, let $\alpha_{t}$ be the projection of $\alpha$ onto $\Sigma_{t}^{SMA}$, i.e., the word obtained from $\alpha$ by deleting all the letters that do not belong to
If \( t_1, \ldots, t_h \) are all the threads that issue at least a letter in \( \alpha \), we define \( \pi(\alpha) \) as the map \( \pi(\alpha)(t_i) = \alpha_{t_i} \) for \( i \in [1, h] \).

A language \( L \) of words over \( \Sigma_{SMA} \) is thread-asynchronous if for each \( \alpha \in L \) and for each \( \alpha' \) starting with init() s.t. \( \pi(\alpha) = \pi(\alpha') \), also \( \alpha' \in L \). The thread-asynchronous closure of a language \( L \), denoted by \( L^\#$, is the smallest thread-asynchronous language such that \( L \subseteq L^\# \).

Let \( A_1 \) and \( A_2 \) be two transition systems over the alphabet \( \Sigma_{SMA} \). We say that \( A_1 \) and \( A_2 \) are thread-wise equivalent if for each word \( \alpha \) accepted by one of them there is a word \( \alpha' \) accepted by the other one such that \( \pi(\alpha) = \pi(\alpha') \).

A standard analysis for multi-threaded programs is to search for the reachability of an error program counter of a given thread (local error state), often denoted by an error label or a false-assertion. In the following, we give two theorems stating sufficient conditions under which the reachability (in accepting runs) of local error states is preserved.

The first theorem states that if the SMA is thread-asynchronous we can transform a program \( P_1 \) into a thread-wise equivalent program \( P_2 \) such that a local error state is reachable in the resulting program \( P_2 \) if and only if it is reachable in \( P_1 \). Intuitively, this theorem holds since the fact that the SMA transition system is thread-asynchronous ensures that the interaction of each thread with the SMA is independent of how threads are interleaved; in particular, by fixing a run \( \rho \), the values returned by the read operations performed by a thread are ensured to be the same in all the possible interleavings of the projections of \( \rho \) onto each thread. Since we assume that the sequences of SMA operations issued along the runs of \( P_1 \) and \( P_2 \) may differ only as caused by different interleavings of the threads, we get that reachability is preserved.

**Theorem 1.** Let \( C_i \) be a control-flow transition system for \( i = 1, 2 \) and \( M \) be an SMA transition system. If \( C_1 \) and \( C_2 \) are thread-wise equivalent, and \( M \) is thread-asynchronous, then a local error state is reachable in \( C_1|M \) iff it is reachable in \( C_2|M \).

Theorem 1 states a crucial property for our approach: we can implement a thread-asynchronous SMA, and combine it with any transformation of the program that rearranges the interleaving among threads and still get a correct verification approach.

The second theorem shows that we can replace an SMA \( M_1 \) with another SMA \( M_2 \) that captures its thread-asynchronous closure, and still preserve reachability of local error states. The interesting case of the proof is when a sequence \( \alpha \) is accepted by \( M_2 \) but not by \( M_1 \). In this case, since the returned values are visible in \( \Sigma_{SMA} \) letters and there must be a sequence \( \alpha' \) that is accepted by \( M_1 \) such that \( \pi(\alpha) = \pi(\alpha') \), we get that the sequence of local states that are visited by any thread of any program \( P \) are the same for both sequences \( \alpha \) and \( \alpha' \). Therefore, the following theorem holds.

**Theorem 2.** Let \( C \) be a control-flow transition system and \( M_i \) be an SMA transition system for \( i = 1, 2 \). If \( L(M_2) = (L(M_1))^\#$, then a local error state is reachable in \( C|M_1 \) iff it is reachable in \( C|M_2 \).

By combining both theorems, we can easily show the correctness of WMM extensions of correct verification methods that transform programs by keeping the ordering of the sequence of the operations within each thread, such as the methods from
In fact, we just need to provide an SMA that captures the thread-asynchronous closure of the memory model.

5 Individual Memory-Location Unwindings

We now discuss an implementation of thread-asynchronous SMAs for SC, TSO and PSO. The key notion is the individual memory-location unwinding (IMU), a set containing exactly one sequence of writes for each shared memory location (location unwinding, LU for short) and such that the unique timestamps associated to each write determine a total order among all the writes of all the LUs (where each timestamp denotes the time of occurrence of a write according to a discrete-time global clock).

Precisely, an LU for a memory location v, denoted by v-LU, is a sequence of triples (t, val, d) where t and val denote the thread identifier and the value of the write and d > 0 is the associated timestamp. If Var is the set of location names and μv a v-LU for each v ∈ Var, an IMU is a set {μv | v ∈ Var} such that: a) the tuples in each LU are ordered by increasing timestamps, and b) for each pair of different location names v1, v2 ∈ Var and for each (ti, vali, di) in μvi with i = 1, 2, then also d1 ≠ d2 (thus timestamps define a total order among all the writes in the IMU).

IMU-based SMA for SC. A transition system Mimu_sc for an IMU-based implementation of SMA first guesses an IMU on the init()-transition and then executes the operations consistently with this guess. Namely, it keeps for each thread the current timestamp in the IMU (i.e., the timestamp of the last executed SMA operation) and for any input sequence α, it ensures that:

− on write(v, val, t) (resp. ind_write(a, val, t)), the next write in the v-LU (resp. the LU identified by the address a) for thread t matches the value val; the current timestamp of t is updated to the timestamp of the matched write in the next state;
− on read(val, v, t) (resp. ind_read(val, a, t)), there must be in the v-LU (resp. the LU identified by the address a) a write with timestamp d that assigns value val to v such that either d is the timestamp of the most recent (before t’s current timestamp) write to v or d is between t’s current timestamp and the timestamp of t’s next write; in the latter case t’s current timestamp is updated to d in the next state;
− for each thread, the writes are matched according to the global ordering given by the timestamps.

In order to accept α, create(t, f, t’) must occur in α for each thread t with writes guessed in the IMU and the writes in the IMU should be mapped 1-to-1 to the writes in α.

The transition system Mimu_sc is thread-wise equivalent to Ms_sc, and additionally, it can execute all computations of Ms_sc by advancing each involved thread in any order. Moreover, due to the fact that all writes are guessed in advance, the ordering in which we interleave the threads is irrelevant. We thus get the following lemma.

Lemma 1. \(L(M_{imu}) = (L(M_{sc}))^\#\).

IMU-based SMA for TSO and PSO. To capture the TSO and PSO semantics, we introduce into the IMU a second timestamp for each write. In particular, we now make
a distinction between the time a write occurs (occurrence timestamp) and the time the shared memory is updated with an occurred write (update timestamp). For correctness, we impose on the IMU that for each write the occurrence timestamp should not be greater than the update timestamp.

For TSO, in order to ensure the FIFO policy for the store buffers along any program execution, we also require that for each thread the writes must be following the same order as if ordered by non-decreasing timestamps according to either one of the sequences of timestamps (i.e., either the occurrence or the update timestamps). For PSO, instead this requirement is replaced with a weaker one that ensures a FIFO policy only for the writes of a same location performed by the same thread.

We will denote with $M_{imu}^{tsom}$ and $M_{imu}^{psom}$ the IMU-based SMA transition systems corresponding to the TSO and PSO memory models, respectively. $M_{imu}^{tsom}$ can be obtained from $M_{imu}^{scm}$ with a few changes: on the init()-transition we now guess the IMU with occurrence and update timestamps as observed above; in a read of location $v$ by a thread $t$ the position of the matching write is the last occurred write still in the store buffer of $t$ (i.e., current timestamp of $t$ is between the occurrence timestamp and the update timestamp of the last write of $v$ by $t$), if any, and the last updated write of $v$, otherwise (this case works as the read in $M_{imu}^{scm}$); the current timestamp of a thread $t$ is also updated to the occurrence timestamp of a write when this is executed; a fence(t)-transition updates the current timestamp to the largest update timestamp of the already occurred writes performed by $t$. Obtaining $M_{imu}^{psom}$ from $M_{imu}^{tsom}$ is very simple and the only difference is hidden in the properties that are required on the guessed IMU as observed above.

By the above observations we can derive that the described transition systems capture the semantics of the corresponding memory models. Moreover, since all the writes are guessed in advance, the ordering in which we interleave the threads is irrelevant. Thus, we get the following lemma:

Lemma 2. For $m \in \{tsom, psom\}$, $L(M_{imu}^{m}) = (L(M_{m}))^\#$.

Verification by IMU. By composing the transformation of the control-flow from [27] along with the SMA implementations $M_{imu}^{scm}$, $M_{imu}^{tsom}$ and $M_{imu}^{psom}$ we get new methods for the verification of multi-threaded programs under SC, TSO and PSO semantics, respectively. The correctness of such methods is a consequence of the lemmas given above, and Theorems 1 and 2.

6 IMU-based SMA implementations

In this section, we discuss concrete C-implementations of SMAs whose semantics is captured by $M_{imu}^{scm}$, $M_{imu}^{tsom}$ and $M_{imu}^{psom}$, respectively. Each of them implements the SMA API defined in Section 3. In the remainder of this section we will give some details of the implemented code; a full version is in the Appendix. Our code is optimized for an efficient analysis using BMC tools but implementations for other backends are possible.

IMU implementation for SC. The implementation is parameterized over several constants. $\mathbb{N}$ and $\mathbb{U}$ denote the number of locations with names (i.e., shared scalar variables)
Data structures and invariants. We use several scalar variables and arrays to maintain the LUs and support the implementation of the SMA operations. We sketch below the main ones that are relevant to the read and write operations; others are used to model thread creation, join, and termination, and the dynamic memory allocation (see Appendix). All are declared global such that they are visible and can be modified in all the functions. For simplicity, we assume that all data is represented by unsigned integers.

The triples \( (t, v, t) \) of the LUs are maintained by three different arrays \( \text{thread}, \text{value}, \text{tstamp} \). For every location \( v \in [0, V-1] \) and \( i \in [0, W-1] \), the triple at position \( i \) in the \( v \)-LU is stored in \( \text{thread}[v][i], \text{value}[v][i] \) and \( \text{tstamp}[v][i] \). We link the writes of a same thread in each LU by an additional array \( \text{th_next_write} \). All these arrays are nondeterministically assigned in the function \( \text{init} \) and never changed in the program execution. \( \text{init} \) also ensures that:

- timestamps are assigned in increasing order for each LU;
- no two writes in the IMU are assigned the same timestamp;
- for every location \( v \in [0, V-1] \), position \( i \in [0, W-1] \) and thread identifier \( t \in [0, T-1] \), \( \text{th_next_write}[v][i][t] \) is the first position in the \( v \)-LU after \( i \) that corresponds to a write by \( t \), if any; otherwise, it is set to \( W \), denoting that no further writes of \( v \) by \( t \) are expected;

To keep track of the execution of each thread in the IMU, we use the arrays \( \text{th_pos}, \text{last_write}, \text{cur_tstamp} \), and maintain the following invariants for every location \( v \in [0, V-1] \) and thread identifier \( t \in [0, T-1] \):

- \( \text{th_pos}[v][t] \) stores the current position of thread \( t \) in the \( v \)-LU;
- \( \text{last_write}[v] \) stores the position \( i \in [0, W-1] \) of the last executed write operation of location \( v \) in the \( v \)-LU;
- \( \text{cur_tstamp}[t] \) stores the current timestamp of thread \( t \) during its simulation.

Verification stubs. We only discuss here the implementation of the API functions \( \text{read} \) and \( \text{write} \), which is given in Fig. 1. Both functions first check whether the execution of the simulated thread has been stopped, and return immediately if this is the case.

For a read operation of thread \( t \) from location \( v \), we first jump forward into \( v \)-LU by invoking the auxiliary function \( \text{Jump} \) and then return the value of \( v \) at this new position of \( v \)-LU. \( \text{Jump} \) (cf. Fig. 1) works as follows. If the timestamp of the selected write is past the current thread timestamp, the latter is updated to this value, acknowledging the fact that the corresponding write into the shared memory has occurred. The value of \( \text{jump} \) is selected nondeterministically within a range of proper values. Namely, \( \text{jump} \) should not pass the last legal write position for \( v \) and must be strictly less than the position of the next write of \( v \) by the same thread \( t \) (that has not occurred yet). Further, we require that the timestamp at position \( \text{jump}+1 \) is greater than the current timestamp of \( t \), as we must point to a write of \( v \) that is not superseded by already occurred writes.
int read(uint v, uint t) {
    if (is_terminated(t)) return 0;
    uint jump = Jump(t, v);
    return value[v][jump];
}

void write(uint v, int val, uint t) {
    if (is_terminated(t)) return;
    uint i, jump;
    i = th_pos[v][t];
    jump = th_next_write[v][i][t];
    assume (jump <= last_write[v] && value[v][jump] == val && tstamp[v][jump] > cur_tstamp[t]);
    th_pos[v][t] = jump;
    cur_tstamp[t] = tstamp[v][jump];
}

uint Jump(uint t, uint v) {
    uint jump = *
    uint j = th_pos[v];
    ts_jump = tstamp[v][jump];
    assume (jump <= last_write[v] && jump < th_next_write[v][t][j] && tstamp[v][jump+1] > cur_tstamp[t]);
    cur_tstamp[t] = (ts_jump > cur_tstamp[t]) ? ts_jump : cur_tstamp[t];
    return jump;
}

Fig. 1. Read, write, and jump functions.

With the stated invariants we get that Jump identifies a position \( i \) in the \( v \)-LU that is correct w.r.t. the \( v \)-LU (in the sense that it is not jumping over the next write of \( v \) by \( t \)). However, note that the corresponding timestamp could be still larger then the next write by \( t \) (for a different location) but we will catch this while executing the next write of \( t \), when the current timestamp of \( t \) will be larger than the one of that write.

In a write operation, we first move forward to the position of the next write by \( t \) in the \( v \)-LU and block the execution if the value to be written differs from that stored in the \( v \)-LU at the position. We also check that the timestamp associated with the new \( v \)-LU position for \( t \) is greater than the current timestamp of \( t \); if this is not the case, we are then in the error case generated by a wrong update of the thread timestamp in a read as described above, and thus the execution is aborted. If all checks are passed, we update the current position of thread \( t \) in the \( v \)-LU and the current timestamp accordingly, thus maintaining the stated invariants.

**IMU implementation for TSO.** We give this implementation incrementally on that given for SC; the code of the functions read, fence and write is illustrated in Fig. 2. We use \( tstamp[v][i] \) to store the update timestamp concerning the write at position \( i \) in the \( v \)-LU, and \( cur_tstamp[t] \) to keep track of the current timestamp in the execution of thread \( t \) (i.e., the occurrence timestamp of the read or write that occurred last). Additionally, we use two new arrays \( btstamp \) (buffer timestamps) and \( ts_lastW \) such that:

- \( btstamp[v][i] \) is the occurrence timestamp of the write at position \( i \) in the \( v \)-LU (that is also the time at which it is stored in the local buffer of the thread that performs the write operation);
- \( ts_lastW[t] \) is the update timestamp of the write by thread \( t \) that occurred last.

For \( init \), we nondeterministically guess the initial values for \( btstamp[v][i] \) and then impose that \( btstamp[v][i] \leq tstamp[v][i] \) must hold (i.e., the update of the shared memory according to an occurred write may be delayed w.r.t. its occurrence time). Note that here we slightly diverge from the transition system \( M_{imu} \) described...
int read(uint v, uint t){
    if(is_terminated(t)) return 0;
    uint ts_jump, i;
    i = th_pos[v][t];
    uint nxt_write=th_next_write[v][i][t];
    uint fst_write=th_next_write[v][0][t];
    assume (
        (ts_jump >= cur_tstamp[t]) &&
        (ts_jump < btstamp[v][nxt_write])
    );
    cur_tstamp[t]=ts_jump;
    if( fst_write <= i &&
        tstamp[v][i] > cur_tstamp[t]
    ) return value[v][i];
    return Read_SC(v,t);
}

void fence(uint t){
    if(ts_lastW[t]>cur_tstamp[t])
        cur_tstamp[t] = ts_lastW[t];
}

void write(uint v, int val, uint t){
    if(is_terminated(t)) return;
    i = th_pos[v][t];
    jump=th_next_write[v][i][t];
    th_pos[v][t]=jump;
    assume(
        btstamp[v][jump] > cur_tstamp[t]
        && value[v][jump] == val
        && tstamp[v][jump] > ts_lastW[t]
    );
    ts_lastW[t] = tstamp[v][jump];
    cur_tstamp[t] = btstamp[v][jump];
}

Fig. 2. Functions read, fence and write for TSO.

in Section 5. In fact, since we do not require any other condition on the guessed update
timestamps, we can carry over an IMU with timestamps that may violate the FIFO
policy on the store buffers. We fix this by checking the proper ordering on matching the
writes (see below).

The fence-operation flushes the store buffer of the executing thread. We thus
need to synchronize the current thread timestamp with its last update timestamp, i.e.,
if ts_lastW[t] is larger than the timestamp of the last occurred write by t, we set
ts_lastW[t] to cur_tstamp[t]. Note that if this is not the case then the local store
buffer of t is certainly empty, since btstamp[v][i] ≤ tstamp[v][i].

The read-function first increases nondeterministically the current timestamp of
thread t such that it remains smaller than the occurrence timestamp of the next write of
v by t. Now, if at least a write of location v by t has occurred and the last write of v
by t is still in the thread buffer, then we return the value of this write. Otherwise, a read
from the shared memory is performed by invoking the auxiliary function Read_SC that
is exactly the function read from Fig. 1.

Note that the update of the current thread timestamp by read can cause this value
to be larger than the update timestamp of the last write, which is correct. To avoid that
we wrongly move the time back, in fence we make the assignment only when this is
not the case.

The write-function first updates the current position in the v-LU of thread t to
the next write provided that the time of occurrence of this write is larger than the current
thread timestamp, the value of the write matches the guessed value for it and the update
timestamp of the next write is larger than that of the last occurred write (the last one
ensures that the thread store buffers are emptied according to a FIFO policy). Note that,
in the case of a wrong guess of the update timestamps in init, this condition would
not hold and thus the execution would abort. Before returning, the update timestamp of
the last write and the current timestamp of thread t are modified consistently.

IMU implementation for PSO. We can get a PSO-SMA by slightly modifying the
TSO-version as follows. We use a new array max_tsW instead of ts_lastW to keep
for each thread $t$ the maximum update timestamp among all the occurred writes of $t$. We achieve this by replacing in `write` the update of $\text{ts}_{\text{lastW}}$ with the assignment of $\text{max}_{\text{tsW}}[t]$ with $(\text{tstamp}[v][\text{jump}] > \text{ts}_{\text{lastW}}[t]) \ ? \text{tstamp}[v][\text{jump}] : \text{max}_{\text{tsW}}[t]$.

We further modify function `write` by removing from the assume-statement the conjunct $\text{tstamp}[v][\text{jump}] > \text{ts}_{\text{lastW}}[t]$ (see Fig. 2). We recall that this conjunct was required in the TSO implementation to ensure that for each thread $t$, the guessed occurrence and update timestamps for the sequence of writes by $t$ (that may be contained in different LU’s) are indeed consistent with the store-buffer FIFO policy; in PSO, we only need to require this within each LU, which is thus ensured by the remaining constraints of `write` and `init`.

### 7 Experimental Evaluation

We have implemented our approach in the IMU-CSeq tool that analyzes C programs over the pthreads API\(^1\). It first uses modules from MU-CSeq [13,27] to transform the original multi-threaded program into a sequential one (sequentialization), then links this against an IMU-based SMA implementation, and finally verifies the resulting program with a BMC tool for sequential programs, in particular CBMC (v5.3). Depending on the chosen SMA implementations we thus obtain an efficient tool for verifying multi-threaded programs under SC, TSO, and PSO, respectively. A hybrid tool combining IMU-CSeq and MU-CSeq [29] has won the gold medal in the Concurrency-category of the TACAS Software Verification Competition (SV-COMP16) [7].

**SC benchmarks.** We first evaluated IMU-CSeq on the Concurrency-benchmarks SV-COMP16. These cover the core features of the C programming language and the basic concurrency mechanisms well, and many state-of-the-art analysis tools have been trained on them. Since we use a BMC tool as a backend, we can only show whether an error is reachable within given bounds. We therefore evaluate IMU-CSeq only on files that have a reachable error location. In particular, we used the files from the subcategories shown in Table 1: each row shows the corresponding number of files and lines of code.

The experiments were run on a dedicated machine with a Xeon E5-2650 v2 with 2.60 GHz and 132GB RAM, running Linux 4.2.0-22-generic. We set a 15GB memory limit and a 900s time limit. The files are analyzed under SC semantics. Table 1 shows the results for the SV-COMP16 versions of CBMC [4], CIVL [32], Lazy-CSeq [13,14], the

---

Table 2. Analysis runtime under TSO/PSO parameters

<table>
<thead>
<tr>
<th></th>
<th>TSO runtime (s)</th>
<th>PSO runtime (s)</th>
<th>W</th>
<th>U</th>
<th>M</th>
<th>bug?</th>
<th>files</th>
<th>IMU-CSeq</th>
<th>CBMC</th>
<th>NIDHUGG</th>
<th>bug?</th>
<th>files</th>
<th>IMU-CSeq</th>
<th>CBMC</th>
<th>NIDHUGG</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>simple</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>dekker</td>
<td>52</td>
<td>0.76</td>
<td>0.26</td>
<td>0.04</td>
<td>1 0.76</td>
<td>0.24</td>
<td>0.04</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lamport</td>
<td>78</td>
<td>0.97</td>
<td>0.33</td>
<td>0.02</td>
<td>1 0.97</td>
<td>0.26</td>
<td>0.04</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>peterson</td>
<td>40</td>
<td>0.67</td>
<td>0.28</td>
<td>0.06</td>
<td>1 0.68</td>
<td>0.23</td>
<td>0.04</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>szymanski</td>
<td>57</td>
<td>0.84</td>
<td>0.37</td>
<td>0.11</td>
<td>1 0.84</td>
<td>0.28</td>
<td>0.08</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>db_longer_safe</td>
<td>67</td>
<td>1.21</td>
<td>1.89</td>
<td>8.89</td>
<td>1 2.50</td>
<td>7.79</td>
<td>11.93</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>complex</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>psql</td>
<td>47</td>
<td>1.92</td>
<td>0.03</td>
<td>0.01</td>
<td>1 0.69</td>
<td>0.22</td>
<td>0.04</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>stack unsafe</td>
<td>110</td>
<td>1.22</td>
<td>0.35</td>
<td>0.06</td>
<td>1 1.21</td>
<td>0.26</td>
<td>0.05</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>stack unsafe</td>
<td>110</td>
<td>1.46</td>
<td>0.45</td>
<td>0.05</td>
<td>1 1.44</td>
<td>0.38</td>
<td>0.05</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

SV-COMP15 version of MU-CSeq [27]. and of IMU-CSeq on these benchmarks. We indicate with pass the number of correctly found bugs, with fail the number of unsuccessful analyses including tool crashes, memory limit hits, and timeouts, and with time the average time in seconds to find the bug. The results clearly show that our approach is competitive with existing tools; in particular, the IMU-based SMA-implementation improves over the MU-based MU-CSeq.

**WMM benchmarks.** We then compared IMU-CSeq against two tools with built-in support for analysis under weak memory models, CBMC [12], and Nidhugg [1], a bug-finding tool that combines stateless model checking with dynamic partial order reduction on relaxed memory executions. These experiments were run on a dedicated machine with a Xeon W3520 2.6GHz processor and 12GB RAM, running 64-bit Linux 3.0.6. For each tool and benchmark, we set the parameters to the minimum value needed to expose the error.

Simple benchmarks. Table 2 shows the results over a set of (relatively simple) benchmarks collected from the CBMC, Poet, and Nidhugg tools, and the SV-COMP benchmark suite. The unwind parameter was used by all the three tools considered in the comparison, while W, U, and M are used only by IMU-CSeq, as detailed in Section 6. The parameter bitwidth gives the size of integers (in bits) used in the sequential analysis.

The first block contains results for some classical mutual exclusions algorithms. The implementations are correct under SC but not under TSO and PSO (as indicated by an entry in the column "bug?"). All tools find the errors, but because of the problems' small size, Nidhugg outperforms both IMU-CSeq and CBMC on these programs.

The second block contains safe and unsafe versions of one of the fibonacci-benchmarks, where two worker threads concurrently increase two shared counters, and a main thread checks whether any of the counters can reach a defined value. A full exploration of the thread interleavings is required to identify the error (or show its absence) in this program and techniques such as partial-order reduction do not apply. Here, IMU-CSeq has the slight edge over CBMC while Nidhugg is substantially slower than both.

---

2 Note that the SV-COMP16 version of MU-CSeq is a hybrid tool that already uses IMU for the shown sub-categories. We thus use the SV-COMP15 version here.
The next block contains benchmarks derived from industrial code. pgsql is a well-known SQL bug [3]; it is correct under SC and TSO but not under PSO. *parker* models a semaphore-like synchronization class that that breaks under TSO [1], and *stack* which was taken from SV-COMP [7]. Here, all tools report the expected results; the performance differences between Nidhugg and CBMC are small, while IMU-CSeq’s performance could be improved with a better implementation, as it currently parses and unparses each file nearly 20 times.

The last block shows the average results for 5803 WMM litmus tests with 297K lines of code. For TSO, both our tool and CBMC successfully identified the 277 test cases containing a reachable error, while Nidhugg failed to find one of them. For PSO, CBMC claims that there are 971 unsafe instances while Nidhugg and IMU-CSeq both find only 968 unsafe ones. Since both tools agree, we suspect an error in CBMC. Here, symbolic methods are faster, and Nidhugg has two timeouts (given a 600s time limit).

*Safestack.* Safestack [10] is a lock-free stack implementation designed for weak-memory models. It is written in C++ but we manually translated it into C, providing simulation functions for the C11 atomic functions, and analyzed this version. It contains a rare bug that is hard to find with automatic bug-finding techniques already under SC (including random testing, Nidhugg, CIVL [32], and other approaches based on BMC) [26]. The only tool we are aware of that can automatically find a genuine counter-example is Lazy-CSeq [13], which requires a minimum of 3 loop unwindings and 4 rounds of computation and more than 7 hours to expose a bug. Both Nidhugg and CBMC failed to find the bug, while IMU-CSeq required approx. 3.5 minutes and 1.5GB of memory to find it under TSO, and approx. 17 minutes and 1.8GB of memory under PSO.

8 Discussion

**IMU-based SMA and Relaxed Memory Models.** Alglave et al. [5] introduce an axiomatic framework to capture the semantics of memory models. Our framework instead aims at a scalable verification approach that encapsulates all differences between the models within the SMA implementation such that the designs of the verification algorithm and of the memory model simulation can be developed independently. A crucial notion we use here is that of the IMU, which captures the sequence of writes that occur in each location (thus also capturing the coherence relation from [5]). To achieve the reordering of the statements that are observed in the relaxed memory models, we guess the timestamps and check their consistency with the expected behaviours. This is done while executing the statements with appropriate `assume`-statements.

Our framework can easily be extended from TSO and PSO, as described here, to more relaxed memory models such as POWER. POWER relaxes these models essentially in that: 1) the propagation of a write in the shared memory by a thread does not need to be simultaneous for all the other threads (i.e., each thread can see the write at a different time from the others); 2) the order of execution of the statements of a thread can be liberally rearranged (w.r.t. the program order) provided that the dependency relations such as data-flow, address, control and isync are fulfilled (see [25]).

The asynchronous propagation of the writes can be easily captured in the IMU by allowing for each write a different timestamp for each thread. The dependency relations
define a partial order over the statements of a thread (see [25]). In our approach we can simulate the execution of the statements of each thread according to any linearization of such partial orders. For this, we proceed with the execution of each statement according to the program order and then simulate the actual ordering of the computation by the timestamps. In particular, we keep all the timestamps of the executed statements that have no executed successors in the partial order, and make sure that time increases moving to successors. This can be modeled inside the implementation of the SMA-implementation by exploiting the IMU and thus leave, in contrast with [5,25], the control-flow part unchanged. This allows us to get rid of the additional control-flow nondeterminism that often represent a burden for verification and testing tools.

**Related Work.** The BMC approach from [4] allows to handle different memory models by adding a conjunct to the formula. The verification algorithm in [2] works on a generic relaxed memory model that can be refined into actual memory models by adding constraints. Our work differs from these both in the scope and the techniques. In particular, we work at the level of source code with code-to-code transformations and give a general approach that allows to combine different verification algorithms with different implementations of memory models, not just a specific algorithm. The development of the two parts can be done independently as long as Theorems 1 and 2 hold.

Another important aspect of our approach is to identify a class of implementations of memory models that allows for a full rearrangement of the thread interleavings in the analysis. As already observed, this is a feature that has been already exploited in verifying concurrent programs [20,27] also with weak memory model semantics [6].

The idea of sequentialization was originally proposed by Qadeer and Wu [24] but became popular with the first scheme for an arbitrary but bounded number of context switches given by Lal and Reps [20]. Several implementations and algorithms have been developed since then (see [11,19,9,15,18,17]). In particular, lazy sequentialization has been recently extended to handle TSO and PSO in the CSeq framework [28]. The reachability analysis used in our algorithm is bounded on the number of writes which is an orthogonal bounding parameter with respect to bounded context-switching [23].

**Conclusions.** We have described and evaluated a new verification approach for concurrent programs over different memory models. Our main design goal was to break the coupling between computation (i.e., individual threads) and the communication (i.e., shared memory) concerns of multi-threaded programs, without losing the efficiency of existing approaches. We have introduced shared memory abstractions, which capture the standard concurrency operations in multi-threaded programs. We have then shown that reachability is preserved if we exchange a program by a thread-wise equivalent one (assuming the SMA is thread-asynchronous) or an SMA for its thread-asynchronous closure. This allows us to generalize existing concurrent verification approaches to different memory models simply by implementing the corresponding different SMAs. We have described efficient SMA implementations for SC, TSO, and PSO based on the idea of individual memory-location unwindings, which have allowed us to instantiate our approach into an efficient BMC-based bug-finding tool. Our experiments show that the resulting tool compares well with existing ones.

The main future work is the detailed formalization of the POWER and other relaxed memory models, with their implementation in our framework.
References

The syntax of multi-threaded programs is defined by the grammar shown in Fig. 3. Terminal symbols are set in typewriter font. \((n t)^*\) represents a possibly empty list of non-terminals \(n\) that are separated by terminals \(t\); \(x\) denotes a local variable, \(y\) an identifier of a shared variable, \(p\) an identifies of a pointer variable, \(m\) a mutex identifier, \(t\) a thread identifier and \(f\) a function name. We assume expressions \(e\) to be local variables, pointer value (returned by a read of a pointer variable), and integer constants that can be combined using mathematical operators. Boolean expressions \(b\) comprise the constants \(true\), \(false\), and Boolean variables, and can be combined using standard Boolean operations.

A multi-threaded program consists of an \(\text{init()}\) invocation followed by a list of functions. \(\text{init()}\) instantiates a shared memory abstraction that captures a number of shared locations. Each function has a list of zero or more typed parameters, and its body has a declaration of local variables followed by a statement.

A statement is either a sequential or a concurrent statement, or a sequence of statements enclosed in braces (compound statement).

A sequential statement can be an assume- or assert-statement, an assignment, a call to a function that takes multiple parameters (with an implicit call-by-reference parameter passing semantics), a return-statement, a conditional statement, or a loop. All variables involved in a sequential statement are local.

A concurrent statement involves an interaction with the shared memory abstraction and thus we have a different concurrent statement for each of the functions of the SMA API (other than \(\text{init()}\) that is invoked only in the beginning).

We assume that a valid program \(P\) satisfies the usual well-formedness and type-correctness conditions. We also assume that \(P\) contains a function \(\text{main}\), which is the starting function of the only thread that exists in the beginning. We call this the \textit{main thread}. We further assume that there are no calls to \(\text{main}\) in \(P\) and that no other thread can be created that uses \(\text{main}\) as starting function.
B Transition systems

An alphabet is a set of symbols. For an alphabet \( \Sigma \), a word over \( \Sigma \) is a sequence of zero or more symbols from \( \Sigma \). The empty word, denoted by \( \varepsilon \), is the word formed of zero symbols. Recall that \( w = \varepsilon w = w \) for any word \( w \).

A transition system \( \mathcal{A} \) is a tuple \((Q, \Sigma, \Delta, Q_0, F)\) where \( Q \) is a set of states, \( \Sigma \) is an alphabet, \( \Delta \subseteq Q \times (\Sigma \cup \{\varepsilon\}) \times Q \) is a transition relation, \( Q_0 \subseteq Q \) is a set of initial states, and \( F \subseteq Q \) is a set of final states.

A run \( \pi \) of \( \mathcal{A} \) is a sequence \( q_0 \xrightarrow{\sigma_1} q_1 \xrightarrow{\sigma_2} \ldots \xrightarrow{\sigma_d} q_d \) where \( q_0 \in Q_0 \) and \((q_{i-1}, \epsilon, q_i) \in \Delta \) for each \( i \in [1, d] \). Moreover, \( \pi \) is accepting if \( q_d \in F \) and \( \sigma_1 \ldots \sigma_d \) is the corresponding word. We denote by \( L(\mathcal{A}) \) the set of all words that correspond to accepting runs of \( \mathcal{A} \).

Let \( \mathcal{A}_i = (Q_i, \Sigma, \Delta_i, Q_{0,i}, F_i) \) be a transition system for \( i \in \{1, 2\} \). The composition of \( \mathcal{A}_1 \) and \( \mathcal{A}_2 \), denoted \( \mathcal{A}_1 | \mathcal{A}_2 \), is the standard cross product, i.e., \( \mathcal{A}_1 | \mathcal{A}_2 \) is the transition system \((Q_1 \times Q_2, \Sigma, \Delta, \Delta_Q, F_1 \times F_2)\) where \( \Delta \) is the minimal set containing all tuples \(((q_1, q_2), \sigma, (q'_1, q'_2))\) such that either one of the following cases hold: 1. \( \sigma = \epsilon, (q_1, \epsilon, q'_1) \in \Delta_1, q_2 = q'_2 \); or, 2. \( \sigma = \epsilon, q_1 = q'_1, (q_2, \epsilon, q'_2) \in \Delta_2 \); or, 3. \( \sigma \neq \epsilon \), and \((q_i, \sigma, q'_i) \in \Delta_i \) for \( i \in \{1, 2\} \).

C Thread-asynchronous SMAs for thread interfaces and memory unwinding

We briefly recall the notions of thread interface [16] and memory unwinding [27], and discuss how to recast some approaches from the literature in our setting by means of the SMAs derived from these notions.

Thread interface. A thread interface for a thread \( t \) summarizes computations of \( t \) across a bounded number of context-switches. Formally, it is a sequence of pairs \( (r_1, s_1), \ldots, (r_k, s_k) \) where \( r_i, s_i \) for \( i \in [1, k] \) are valuations of the shared locations. The intended meaning is that there is a computation of \( t \) such that \( t \) starts with \( r_1 \) as valuation of the shared locations and reaches \( s_1 \), is suspended and then reactivated with shared valuation \( r_2 \), and reaches \( s_2 \), and so on.

In a bounded context switch analysis we can assume that computations of programs are arranged in \( k \) rounds where threads are always scheduled according to the same fixed round-robin schedule \( t_1, \ldots, t_n \). Thus, exploring the computations of a multi-threaded program up to \( k \) rounds corresponds to computing thread interfaces and composing them [16]. We start with thread \( t_1 \) and guess the in-valuations at rounds \( 2, \ldots, k \) (i.e., the valuations \( r_2, \ldots, r_k \); note that \( r_1 \) is the initial valuation of the program and thus known); we then compute the out-valuations (i.e., \( s_1, \ldots, s_k \)) for thread \( t_1 \) and take them as the in-valuations of the next thread \( t_2 \), and so on. In the end, in order to establish that the computed thread interfaces form a computation of the program we just need to check that the out-valuation of thread \( t_i \) at round \( i \) equals the (guessed) in-valuation of thread \( t_i \) at round \( i + 1 \).

This is the essence of the well-known sequentialization algorithm by Lal and Reps [20] and the fixed-point algorithm given in [16]. We can recast these two algorithms in our setting by means of an SMA that extends the standard SMA for SC by thread
interfaces. The resulting transition system $M_{ti}$ is as follows. On the init()-transition, $M_{ti}$ guesses a round schedule $t_1, \ldots, t_n$, a bound $k$, and for each thread $t_i$ an interface $I_i = (r_{i1}, s_{i1}), \ldots, (s_{ik}, r_{ik})$ such that $r_{ij} = s_{ij}$ for $j \in [1, k]$. $M_{ti}$ keeps for each thread the current round in the corresponding thread interface. If the current round of a thread is less than the round bound $k$, it can be increased by one by an $\varepsilon$-transition (i.e., it is nondeterministically either increased or left unmodified). Further, for any input sequence $\alpha$, $M_{ti}$ ensures that:

- on write($v, val, t$) (resp. ind_write($a, val, t$)), the out-valuation of the current round of thread $t$ is updated according to the write;
- on read($val, v, t$) (resp. ind_read($val, a, t$)), the out-valuation of the current round of thread $t$ must evaluate $v$ (resp. $a$) as $val$.

In order to accept $\alpha$, create$(t, f, t')$ must occur in $\alpha$ for each thread $t$ with a guessed interface, and the computed interfaces form a computation in the sense described above.

The transition system $M_{ti}$ is thread-wise equivalent to $M_{sc}$, and, moreover, it can execute all the computations of $M_{sc}$ by advancing each involved thread in any order. The proof of the following lemma is a consequence of the results from [16].

**Lemma 3.** $L(M_{ti}) = (L(M_{sc}))^\#$. 

We can then recast the verification technique from [20] in our setting by taking the above SMA along with the transformation of the control-flow from [20]. Lemma 3, and Theorems 1 and 2 show the correctness of the resulting verification method. Similarly, we can combine $M_{mu}$ with a control-flow part that at each transition nondeterministically selects the next thread to execute. The resulting system captures the verification technique from [16], and correctness is again ensured by Lemma 3, and Theorems 1 and 2. We remark that actual implementations of both these techniques require parameterization over the number of threads and rounds, as in the original implementations.

**Memory unwinding.** A memory unwinding (MU) [27] is a sequence of writes; each write $w$ is a triple $(t, v, val)$ where $t$ is the identifier of the thread that has performed the write operation, $v$ is the identifier of the memory location that is modified in the write and $val$ is the value of $v$ after the write. A corresponding transition system guesses an MU on the init()-transition and then executes the operations consistently with this guess. For SC, the corresponding transition system $M_{mu}$ will keep for each thread the current position in the MU and for any input sequence $\alpha$, it ensures that:

- on write($v, val, t$) (resp. ind_write($a, val, t$)), the next write in the MU for thread $t$ matches the value $val$ and variable identifier $v$ (resp. address $a$);
- on read($val, v, t$) (resp. ind_read($val, a, t$)), there must be in the MU a write at a position $i$ from the current position of $t$ through the next write of $t$, that assigns value $val$ to the location identified by $v$ (resp. $a$); the current position of $t$ is updated to $i$ in the next state;
- for each thread, the writes are matched exactly in the same order as in the MU.

In order to accept $\alpha$, create$(t, f, t')$ must occur in $\alpha$ for each thread $t$ with writes guessed in the MU and the writes in the MU should be mapped 1-to-1 to the writes in $\alpha$. 
The transition system $M^{\text{mu}}_{\text{sc}}$ is thread-wise equivalent to $M_{\text{sc}}$, and additionally, it can execute all the computations of $M_{\text{sc}}$ by advancing each involved thread in any order. Moreover, due to the fact that all writes are guessed in advance, the ordering in which we interleave the threads is irrelevant. Thus, the following lemma holds.

**Lemma 4.** $L(M^{\text{mu}}_{\text{sc}}) = (L(M_{\text{sc}}))^\#$.

**Proof.** We start showing that $L(M^{\text{mu}}_{\text{sc}}) \supseteq (L(M_{\text{sc}}))^\#$. For $\alpha \in L(M_{\text{sc}})$, denote with $\mu$ the MU that corresponds to the sequence of writes in $\alpha$ and with $\rho$ an accepting run of $M_{\text{sc}}$. We recall that $M^{\text{mu}}_{\text{sc}}$ on the init()-transition can guess any MU and is built on the top of $M_{\text{sc}}$. Thus, $M^{\text{mu}}_{\text{sc}}$ on the initial transition can enter a state storing the initial configuration $\gamma$ as in $\rho$ and $\mu$. Now, since $\mu$ and the initial configuration $\gamma$ fully capture the configurations of the shared memory along $\rho$ (memory locations that are not assigned can be neglected), $M^{\text{mu}}_{\text{sc}}$ can simulate the execution $\rho$ by arbitrarily advancing the execution of each involved thread in any order. Thus, $M^{\text{mu}}_{\text{sc}}$ accepts all words in $\{\alpha\}^\#$ and therefore, $L(M^{\text{mu}}_{\text{sc}}) \supseteq (L(M_{\text{sc}}))^\#$.

For the other direction, i.e., $L(M^{\text{mu}}_{\text{sc}}) \subseteq (L(M_{\text{sc}}))^\#$, let $\alpha \in L(M^{\text{mu}}_{\text{sc}})$ and denote with $\mu$ the MU that is guessed on an accepting run over $\alpha$. Note that for each word in $\{\alpha\}^\#$ there is an accepting run of $M^{\text{mu}}_{\text{sc}}$ such that $\mu$ is the guessed MU. Now, let $\alpha' \in \{\alpha\}^\#$ be a word where the write operations are ordered as in $\mu$ and the read operations are ordered such that for each pair of matching read and write: 1) the read follows the write, and 2) there are no other writes involving the same location between them. Clearly, $\alpha' \in L(M_{\text{sc}})$ and therefore $\alpha \in (L(M_{\text{sc}}))^\#$. \qed

We can recast the verification approach from [27] in our setting by taking the above SMA along with the transformation of the control-flow from [27]. Lemma 4, and Theorems 1 and 2 show the correctness of the resulting verification method. Again, actual implementations would require parameterization on the number of writes and threads.

**Extension to weak memory models.** The discussed verification algorithms can be extended to handle programs under weak memory model semantics by giving the corresponding shared memory abstractions. This can be done for TSO and PSO by explicitly adding the store buffers to $M^{\text{mu}}_{\text{sc}}$ and $M^{\text{mu}}_{\text{ti}}$, or for TSO by augmenting $M^{\text{mu}}_{\text{sc}}$ with guesses on the round when a write will be visible to all threads, as done in [6]. In the next section, we introduce a new implementation that refines the notion of MU and that works especially well for bounded model checking (BMC), and thus gives efficient BMC-implementations for verification under TSO and PSO program semantics.

### D IMU-based SMA encodings

Here we give full details of the SMA implementations for SC, TSO, and PSO.

#### D.1 IMU implementation for SC

**Data structures.** We use several data structures to maintain the LUs and serve the implementation of the SMA operations. They are parameterized over the constants given
in Section 6. For simplicity, we assume that all the data is maintained as an unsigned integer (uint).

The triples \((t, \text{val}, d)\) of the LUs are maintained by three different arrays \(\text{thread}, \text{value}\) and \(\text{tstamp}\). Namely, for every location \(v \in [0, V-1]\) and \(i \in [0, W-1]\), the \((i+1)\text{th}\) triple in the \(v\)-LU is stored in \(\text{thread}[v][i], \text{value}[v][i]\) and \(\text{tstamp}[v][i]\).

To keep track of the execution on the LUs we use several auxiliary variables and arrays. Namely, for every location \(v \in [0, V-1]\), position \(i \in [0, W-1]\) and thread identifier \(t \in [0, T-1]\):

- \(\text{th.pos}[v][t]\) is the current position of thread \(t\) in the \(v\)-LU;
- \(\text{last.write}[v]\) stores the position \(i \in [0, W-1]\) of the last executed write operation of location \(v\) in the \(v\)-LU. A different value for each \(v\)-LU is guessed for each simulated execution;
- \(\text{th.next.write}[v][i][t]\) is the first position after \(i\) in the \(v\)-LU that corresponds to a write by \(t\), if any; otherwise, it is set to \(W\) (denoting that no further writes of \(v\) by \(t\) are expected).

Concerning to the management of threads, we keep some additional information. Variables \(\text{max.th}\) and \(\text{th.count}\) contain respectively the total number of threads that we assume should be created in the current program execution (a different value is guessed for each simulated execution) and the counting of the threads that have been actually created (this should match the guessed total number of threads in the end of computation). Also, for each thread \(t \in [0, T-1]\):

- \(\text{cur.tstamp}[t]\) keeps track of the current timestamp of thread \(t\) during its simulation;
- \(\text{last.tstamp}[t]\) is the timestamp corresponding to the last write in the entire IMU by thread \(t\); (this value is guessed nondeterministically in the initialization and is never changed; it should match \(\text{cur.tstamp}[t]\) in the end of a computation;)
- \(\text{ret}[t]\) is set to \(1\) to mean that \(t\) has been interrupted before reaching the end of its execution;
- \(\text{terminated}[t]\) is set to \(1\) to mean that we expect that thread \(t\) will be stopped before the execution of its last statement (this value is guessed nondeterministically in the initialization and is never changed).

To handle dynamic memory allocation and pointer arithmetics, for each location \(v \in [0, V-1]\) and for each \(i \in [0, M-1]\) we use:

- \(\text{address}[v]\) to store the physical memory address of \(v\);
- \(\text{mallocP}[i]\) to store the base address for each memory block that can be allocated dynamically;
- \(\text{mallocPallocated}[i]\) to track the dynamically allocated memory blocks.

**IMU initialization.** All the variables and arrays introduced above are declared global. On initializing the IMU we impose several constraints on them (see function \(\text{init()}\) in Fig. 4).
```c
void init(){
    bool ts_used[V*W] = [0];
    int v=0,w=0,t=0;
    th_count = 0;
    max_th = *; 
    assume( max_th <= T );
    init_address(V);
    init_malloc(M);
    while (v<V) {
        last_write[v] = *
        assume( last_write[v] < W );
        w=0;
        while (w<W){
            tstamp[v][w] = *
            assume( tstamp[v][w] < V*W && 
                    !(ts_used[tstamp[v][w]]) );
            ts_used[tstamp[v][w]]=1;
            if(w>0)
                assume(tstamp[v][w]>tstamp[v][w-1]);
            thread[v][w] = *
            assume(thread[v][w] < max_th);
            w=w+1;
        }
        v=v+1;
    }
    v=0;
    while (v<V) {
        w=W-2;
        while (w>=0) {
            t=1;
            while (t<T) {
                if(thread[v][w+1] == t)
                    th_next_write[v][W-1][t] = w+1;
                else
                    th_next_write[v][w][t] = th_next_write[v][w+1][t];
                t=t+1;
            }
            w=w-1;
        }
        v=v+1;
    }
    while (t<T) {
        terminated[t] = *
        last_tstamp[t] = *
        assume (last_tstamp[t] < V*W);
        t=t+1;
    }
    v=0;
    while (v<V){
        t=0;
        while (t<T) {
            th_next_write[v][W-1][t] = W;
            t=t+1;
        }
        v=v+1;
    }
}
```

Fig. 4. IMU initialization.

Function *init_address* ensures that array *address* is nondeterministically initialized with increasing values (i.e., *address[i] < address[i+1]* for *i ∈ [0, V−2]*). Function *init_malloc* ensures the same for array *mallocP* and additionally imposes that the address guessed for the last named location is less than the one assigned to the base location of the first memory allocation (i.e., *address[N−1]<mallocP[0]*). Functions *init_malloc()* and *init_address()* are illustrated in Fig. 5.

In the first while-block of Fig. 4, arrays *last_write*, *tstamp* and *thread* are nondeterministically assigned to legal values. Additionally, for each LU, timestamps are nondeterministically assigned in increasing order. The local array *ts_used* is used to ensure that different timestamps are assigned to each write in the IMU.

Legal values of *terminated* and *last_tstamp* are nondeterministically guessed in the second while-block. The rest of *init* initializes *th_next_write* such that for each thread *t* and each location *v*, all the writes from *t* in the *v*-LU are linked in the proper order (value *W* is used as a sentinel to denote the end of each LU).

**Auxiliary functions.** We make use of two auxiliary functions illustrated in Fig. 6.

Function *is_terminated* returns 1, if *ret[t]* is already set to 1, and nondeterministically chooses either to set *ret[t]* to 1 and then return 1, or to return 0. The purpose of function *Jump* is to determine the position *jump* in the *v*-LU of the write that determines the current value contained in *v*. If the timestamp of the selected write
void init_address()
{
    int i=0;
    while (i<V){
        address[i] = *;
        if(i>0)
            assume( address[i] > address[i-1]);
        i=i+1;
    }
}

void init_malloc()
{
    int i=0;
    while (i<M){
        mallocPallocated[i]=0;
        mallocP[i] = *;
        if(i>0)
            assume( mallocP[i] > mallocP[i-1]);
        i=i+1;
    }
    assume( mallocP[0] > address[N-1]);
}

bool is_terminated(uint t){
    if(ret[tid]||nondet()) {ret[tid]=1; return 1;}
    return 0;
}

uint Jump(uint t, uint v){
    uint _jump=*
    ts_jump = tstamp[v][_jump];
    assume( _jump <= last_write[v])
        && (_jump < th_next_write[v][t][th_pos[v]])
        && (tstamp[v][_jump+1] > cur_tstamp[t]);
    cur_tstamp[t] = (ts_jump > cur_tstamp[t]) ? ts_jump : cur_tstamp[t];
    return _jump;
}

Thread creation, termination, and join. The implementations of functions create, terminate and join are shown in Fig. 7.

In function create, if the maximal number of allowed threads is reached, the procedure immediately returns \(-1\) meaning that this thread will never be scheduled. Otherwise, the count of the created threads is incremented and the current timestamp and LU positions of the new created thread are initialized such that: they coincide with those of the parent thread.

The assume statement ensures that no write operations are entitled to the new created thread before its creation. Since we update the positions of each thread in the LUs forward only, this will ensure also that each thread will not use any LU position corresponding to a write operation that is supposed to occur before its creation.

Function terminate checks that all write operations guessed for thread \(t\) have been done (while-loop). Furthermore, the concluding assume checks that the values guessed by function init for terminated\([t]\) and last_tstamp\([t]\) are consis-
void create(void *f, uint pt) {
    if (th_count >= max_th) then return -1; fi
    th_count++;
    uint v = 0;
    if (pt == 0) {
        while (v < V) {
            th_pos[v][th_count] = 0;
            v = v + 1;
        }
        cur_tstamp[th_count] = 0;
    } else {
        cur_tstamp[th_count] = cur_tstamp[pt];
        while (v < V) {
            th_pos[v][th_count] = th_pos[v][pt];
            assume(th_next_write[v][i][th_count] >= th_pos[v][th_count]);
            v = v + 1;
        }
    }
    return th_count;
}

void terminate(uint t) {
    uint i, v = 0;
    while (v < V) {
        th_pos[v][i][t] = th_pos[v][i][t] > last_write[v];
        v = v + 1;
    }
    assume(ret[t] == terminated[t] &&
           last_tstamp[t] == cur_tstamp[t]);
}

void join(uint t1, uint t2) {
    if (is_terminated(t1)) then return;
    uint v;
    assume(v < V);
    Jump(t1, v);
    assume((terminated[t2] == 1) ||
           (cur_tstamp[t1] > last_tstamp[t2]));
}

Fig. 7. Functions create, terminate and join.

Read and write operations. The implementation of functions read and write is illustrated in Fig. 8. For a read operation, the thread under simulation t first jumps forward into the v-LU corresponding to the variable given as parameter by invoking the auxiliary procedure Jump described above and then returns the valuation of the variable at the new position from matrix value.

In a write operation, the thread first jumps to its next write operation for that variable and blocks the simulation if the value disagrees with that in the memory sequence at the new position. Furthermore, we also check that the timestamp associated to the new position is greater than the actual timestamp of t; this to prevent to simulate already simulated write operations. Then we update the current position of thread t in the v-LU and the current timestamp.

Address and malloc operations. Method address is used to recover the address of a given location \( v \in [V - 1] \). The implementation for this method is given in Fig. 9. If \( v \) corresponds to a scalar variable the method returns the value from address[v]; otherwise it simulates the read operation at that location.

During its execution a thread can require a block of \( n \) consecutive unallocated locations by invoking malloc(n). When malloc is invoked, say with argument \( n \), a block is chosen non deterministically, and it is allocated if its size is at least \( n \) by returning its
int read(uint v, uint t) {
    if(is_terminated(t)) return 0;
    uint jump = Jump(t, v);
    return value[v][jump];
}

void write(uint v, int val, uint t) {
    if(is_terminated(t)) return;
    uint i, jump;
    i = th_pos[v][t];
    jump = th_next_write[v][i][t];
    assume((jump <= last_write[v])
             && (value[v][jump] == val)
             && (tstamp[v][jump] > cur_tstamp[t]));
    th_pos[v][t] = jump;
    cur_tstamp[t] = tstamp[v][jump];
}

int ind_read(uint addr, uint t) {
    if(is_th_terminated(t)) return 0;
    uint pos;
    assume(pos < V);
    assume(address(pos, t) == addr);
    return read(pos, t);
}

void ind_write(uint addr, int val, uint t) {
    if(is_th_terminated(t)) return;
    uint pos;
    assume(pos < V);
    assume(address(pos, t) == addr);
    write(pos, val, t);
}

int address (uint v, uint t) {
    if(is_th_terminated(t)) return 0;
    if(v < N) return address[v];
    return read(v, t);
}

int malloc(uint n, uint t) {
    uint pos;
    if(is_th_terminated(t)) return 0;
    assume(pos < M);
    assume(!mallocPallocated[pos]);
    assume(mallocP[pos] + n < mallocP[pos + 1]);
    mallocPallocated[pos] = 1;
    return mallocP[pos];
}

Fig. 8. Read and write functions.

Fig. 9. Functions address and malloc.

base address. The malloc procedure is implemented as shown in Fig. 9. We first find a position pos that corresponds to a not sill allocated block, by checking the value of mallocPallocated at that position. We recall that addresses stored in mallocP are ordered in ascending order; then in order to know if there is enough space we simply check that mallocP[pos] + n < mallocP[pos + 1]. Then we set mallocP[pos] to true to indicate that the address at position pos has been allocated. Finally, we return the base address corresponding to the position pos.

Ind_read and ind_write operations. When a read or write operation is performed using a memory address, i.e. *p = 3 for a pointer variable p, we invoke ind_read and ind_write methods. The implementation of the these procedures are straightforward (see Fig. 8). We first search for the location corresponding to that whose address corresponds to the given parameter and then simulate the read/write operation at that location.

Lock and unlock mutex variables. A thread can take or release a lock on a shared mutex variable by calling the procedure lock and unlock, respectively; their implementations are provided by Fig. 10. For a mutex variable, we assign value 0 when the lock is not acquired by any thread, and we assign value t if the mutex is held by thread t.
void lock(uint mut, uint t)
write(mut, t, t);
assume(ret[t] ||
value[mut][th_pos[mut][t]-1]==0);
}

void unlock(uint mut, uint t)
write(mut, 0, t);
assume(ret[t] ||
value[mut][th_pos[mut][t]-1]==t);
}

For efficient implementation, we modify the value of variable mut using a write operation. For a lock operation we first write the value of t in mut; however, it may be the case that the mutex was already held by some thread. Thus, we check that the previous value of mut was 0. The implementation of the method unlock procedure is similar, the only difference is that we write 0 in to the mut variable. Note that, two consecutive write operations of mut are performed by the same thread (lock and unlock). Furthermore, the value written at the even positions of the mut-LU are always 0. These constrains can be added in the init function to reduce the number of runs to consider.

D.2 IMU implementation for TSO

We give this implementation incrementally on that given for SC.

To be consistent with the notation used in the implementation for SC, we use tstamp[v][i] to store the update timestamp concerning the (i+1)th write of location v, and cur_tstamp[t] to keep track of the current timestamp in the execution of thread t (i.e., the occurrence timestamp of the last occurred read or write). Additionally, we use two new arrays btstamp (buffer timestamps) and ts_lastW such that:

- btstamp[v][i] is the occurrence timestamp of the (i+1)th write of v (that is also the time at which it is stored in the local buffer of the thread that performs the write operation);
- ts_lastW[t] is the update timestamp of the last occurred write by thread t.

To implement the SMA API, we only need to give an implementation of fence and modify those given for SC of init, read, write, lock and unlock. The rest of the implementation is the same as for SC.

For init, we add to the implementation given for SC the following. We nondeterministically guess initial values for btstamp[v][i] and then impose that btstamp[v][i] ≤ tstamp[v][i] must hold (i.e., the update of the shared memory according to an occurred write may be delayed w.r.t. its occurrence time).

Note that here we slightly diverge from the transition system $\mathcal{M}_\text{imu}$ described in Section 5. In fact, since we do not require any other condition on the guessed update timestamps, we can carry over an IMU with timestamps that may violate the FIFO policy on the store buffers. This is fixed by checking the proper ordering on matching the writes (we return on this when discussing the write implementation).

Function lock from Fig. 10 is modified such that the write is done by a routine Write_SC that is exactly the write given for SC instead of the write for TSO. This ensures that lock acquisition is immediately visible to all the other threads. For function
int read(uint v, uint t){
  if(is_terminated(t)) return 0;
  uint ts_jump, i;
  i = th_pos[v][t];
  uint nxt_write = th_next_write[v][i][t];
  uint fst_write = th_next_write[v][0][t];
  assume (
    (ts_jump >= cur_tstamp[t]) &&
    (ts_jump < btstamp[v][nxt_write])
  );
  cur_tstamp[t] = ts_jump;
  if( (fst_write <= i) &&
    (tstamp[v][i] > cur_tstamp[t])
  )
    return value[v][i];
  return Read_SC(v,t);
}

void fence(uint t){
  if(ts_lastW[t] > cur_tstamp[t])
    cur_tstamp[t] = ts_lastW[t];
}

void write(uint v, int val, uint t){
  if(is_terminated(t)) return;
  i = th_pos[v][t];
  jump = th_next_write[v][i][t];
  th_pos[v][t] = jump;
  assume(
    (btstamp[v][jump] > cur_tstamp[t]) &&
    (value[v][jump] == val) &&
    (tstamp[v][jump] > ts_lastW[t])
  );
  ts_lastW[t] = tstamp[v][jump];
  cur_tstamp[t] = btstamp[v][jump];
}

Fig. 11. Functions read, fence and write for TSO.

unlock, we do the same and further before returning we call fence. This way, we
make immediately visible to all the other threads all the writes that occurred in the
critical section.

The code of functions fence, read and write are illustrated in Fig. 11.

A memory fence flushes the store buffer of the thread executing it and thus we need
to synchronize the current thread timestamp with its last update timestamp. Namely, if
ts_lastW[t] is larger than the timestamp of the last occurred write by t, we assign
ts_lastW[t] to cur_tstamp[t]. Note that if this is not the case then the local store
buffer of t is certainly empty (recall btstamp[v][i] <= tstamp[v][i]).

Function read first updates nondeterministically the current timestamp of thread
t such that it is not smaller than the current timestamp of t and is smaller than the
update timestamp of the next write of t. Now, if at least a write of location v by t has
occurred and the last write of v by t is still in the thread buffer, then we return the value
of this write. Otherwise, a read from the shared memory is performed by invoking the
auxiliary function Read_SC that is exactly the function read from Fig. 8.

Observe that the update of the current thread timestamp by read can cause this
value to be larger than the update timestamp of the last write and this may be correct.
To avoid that we wrongly move the time back, in fence we make the assignment only
when this is not the case.

Function write first updates the current position in the v-LU of thread t to the
next write provided that the time of occurrence of this write is larger than the current
thread timestamp, the value of the write matches the guessed value for it and the update
timestamp of the next write is larger than that of the last occurred write (the last one
ensures that the thread store buffers are emptied according to a FIFO policy). Note that,
in the case of a wrong guess of the update timestamps in init, this condition would
not hold and thus the execution would abort. Before returning, the update timestamp of
the last write and the current timestamp of thread t are modified consistently.
D.3 IMU implementation for PSO

We can give an implementation of SMA for PSO by slightly modifying the implementation given for TSO as follows.

We use a new array \( \text{max}_t \text{SW} \) in substitution of \( \text{ts}_\text{lastW} \) and change a few lines in the implementation of function \( \text{write} \). Array \( \text{max}_t \text{SW} \) maintains for each thread \( t \) the maximum update timestamp among all the occurred writes of \( t \).

In function \( \text{write} \) (Fig. 12), we do not require any more that the update timestamp of the current write is larger than the update timestamp of the previous write by \( t \). Recall that this was required in the TSO implementation in order to ensure that for each thread \( t \), the guessed occurrence and update timestamps for the sequence of writes by \( t \) (that may be contained in different LU’s) are indeed consistent with the FIFO policy of a store buffer; in PSO we only need to ensure that the FIFO policy holds for each of the maximal subsequences containing all the writes of a same location which is ensured by the remaining constraints and function \( \text{init} \). Moreover, the update of \( \text{ts}_\text{lastW}[t] \) is replaced with the update of \( \text{max}_t \text{SW}[t] \) as follows:

\[
\text{max}_t \text{SW}[t] = \begin{cases} 
\text{tstamp}[v][\text{jump}] > \text{max}_t \text{SW}[t] \text{ ? } \text{tstamp}[v][\text{jump}] : \text{max}_t \text{SW}[t]; 
\end{cases}
\]

Fig. 12. Function \( \text{write} \) for PSO.