Avoiding Locks: Read And Write Ordering

Sometimes it is possible to avoid locking. Consider the following case from the 2.2 firewall code, which inserted an element into a single linked list in user context:

        new->next = i->next;
        i->next = new;
    

Here the author (Alan Cox, who knows what he's doing) assumes that the pointer assignments are atomic. This is important, because networking packets would traverse this list on bottom halves without a lock. Depending on their exact timing, they would either see the new element in the list with a valid next pointer, or it would not be in the list yet.

Of course, the writes must be in this order, otherwise the new element appears in the list with an invalid next pointer, and any other CPU iterating at the wrong time will jump through it into garbage. Because modern CPUs reorder, Alan's code actually read as follows:

        new->next = i->next;
        wmb();
        i->next = new;
    

The wmb() is a write memory barrier (include/asm/system.h): neither the compiler nor the CPU will allow any writes to memory after the wmb() to be visible to other hardware before any of the writes before the wmb().

As i386 does not do write reordering, this bug would never show up on that platform. On other SMP platforms, however, it will.

There is also rmb() for read ordering: to ensure any previous variable reads occur before following reads. The simple mb() macro combines both rmb() and wmb().

Dropping or gaining a spinlock, and any atomic operation are all defined to act as memory barriers (ie. as per the mb() macro).

There is a similar, but unrelated, problem with code like the following:

        if (!(ctrack->status & IPS_CONFIRMED)) {
                spin_lock_bh(&ip_conntrack_lock);
                if (!(ctrack->status & IPS_CONFIRMED)) {
                        clean_from_lists(h->ctrack);
                        h->ctrack->status |= IPS_CONFIRMED;
                }
                spin_unlock_bh(&ip_conntrack_lock);
        }
    

In this case, the author has tried to be tricky: knowing that the CONFIRMED bit is set and never reset in the status word, you can test it outside the lock, and frequently avoid grabbing the lock at all. However, the compiler could cache the value in a register, rather than rereading it once the lock is obtained, creating a subtle race. The way to get around this is to declare the status field `volatile', or use a temporary volatile pointer to achieve the same effect in this one place.