Anatomy of a Bug


Disclaimer: The following article requires at least some knowledge of how the Internet Protocol (IP) and how hardware timers in microcontrollers work, as well as some passing knowledge of the C programming language or at least any programming language.

Introduction

A bug in Neighbor Discovery or RPL?

At the beginning of April 2020, Hauke, a colleague of mine, contacted me as one of the experts on network infrastructure of the operating system RIOT, if I could help him with a problem of his. During his experiments to run the routing protocol RPL over Bluetooth Low Energy (BLE) using RIOT he encountered an increased multicast traffic and forwarding problems for his data traffic after roughly 70 minutes. The main source of the multicast traffic to him seemed to be the Neighbor Discovery Protocol so I, as the author of that part in RIOT, was his first point of contact. In need of some change of tapestry due to the ongoing physical distancing in Germany at that time, I decided to take the job in hunting that bug.

NDP

The Neighbor Discovery Protocol (NDP) of Internet Protocol, version 6 (IPv6) has—among others—two major tasks:

For address resolution and neighbor unreachable detection, two message types are used: Neighbor Solicitations (NSs) and Neighbor Advertisements (NAs). Whenever an IPv6 address for the next hop is used that is unknown to the sending node, the latter sends a multicast NS to a group, the searched node is expected to have been joined at start-up. This NS contains the link-layer address of the sending node as an option. The searched node then replies with a NA that contains its respective link-layer address as an option as well. This way both nodes now know their respective link-layer addresses and the sending node is ensured that the searched node is reachable.

There exists an optimization for neighbor discovery when using IPv6 over Low-Power Wireless Personal Area Networks (6LoWPANs). This optimization assumes that if a the searched IPv6 address is a link-local unicast address

  1. the address was generated from the link-layer address and
  2. the node with this address is reachable

As such, this allows the sending node just resolves the address by reversing the algorithm for 1. and directly unicasting the addressed packet without requiring the NS/NA handshake. This allows ND to mostly avoid power-intensive multicast and just using unicast.

For router and prefix discovery is also using two message types: Router Solicitations (RSs) and Router Advertisements (RAs). On start-up, routers periodically advertise themselves using multicast RAs to the link-local all nodes group. Those RAs also contain information about the prefixes of the subnet which they are routing to from the Internet backbone. Nodes within the network may send multicast RS to the [link-local all routers group][special multicast address] to not having to wait for the next RA from the upstream router.

In the 6LoWPANs optimization, again to avoid power-intensive multicast messages, only unicast RAs are sent and only in reaction to a soliciting RS (multicast or unicast). As such, downstream nodes must use multicast router solicitations to find their upstream routers. Likewise, routers only start advertising themselves to downstream routers only whenever they received an RA from an upstream router. The only router advertising itself from start-up on is the border router to the wider Internet, making it the main controlling node from NDP's perspective. Once a node found an upstream router it only uses unicast RS to update the information it received from the upstream router via the RA.

RPL

RPL (typically pronounced like “ripple”) is the IPv6 Routing Protocol for Low-Power and Lossy Networks (LLNs). It organizes the nodes of a 6LoWPAN (or other IPv6-based LLNs) into destination-oriented directed acyclic graph (DODAG). In contrast to a conventional directed acyclic graph (DAG) all paths of a DODAG end in a single node. This destination node is called “the root” in RPL. Commonly, in a 6LoWPAN the root is the same node as the border router, but it does not has to be. In fact, RPL allows for construction of multiple DODAGs with different roots over the same network.

The RPL DODAG is constructed by using several message types. DODAG Information Objects (DIOs) carries information about the DODAG and is disseminated from the root over the downstream routers throughout the network. Nodes within the network can use DODAG Information Solicitations (DISs) to solicit DIOs from upstream routers. Based on the information in the DIOs a node decides which nodes are his parents in the DODAG and sets its default route accordingly. Among other information, similar to the RAs in NDP, there is also the prefix of the sub-net the DODAG is representing.

Destination Advertisement Objects (DAOs) are used by routers to advertise their downstream routes to their parents (in RPL storing mode) or the root (in RPL non-storing mode), respectively. The addressee of the DAO acknowledges the reception of the DAO with a Destination Advertisement Object Acknowledgement (DAO-ACK).

xtimer and evtimer

To multiplex and abstract a platform's hardware timer, RIOT uses the module xtimer. xtimer allows a user to allocate their own xtimer object. Those objects are managed by the xtimer module as a list. xtimer abstracts the hardware timer as a 64-bit microsecond timer, so a big enough time interval to not overflow for nearly 585 000 years. The keen observer might also notice however that with a 32-bit microsecond timer you have an interval of about 71 minutes before it overflows, awfully close to the 70 minutes mark Hauke reported.

To better manage something like network control protocols with their multiple timeouts and send timers that are more in the millisecond or even higher scales rather than the microsecond scale of xtimer, RIOT also offers the event timer evtimer. evtimer operates on a 32-bit millisecond timer and organizes timeouts and other timing events as entries in an event queue separated by relative time offsets. The evtimer implementation wraps around xtimer and each evtimer object contains exactly one xtimer object.

Both the NDP and the RPL implementation in RIOT's default network stack GNRC use evtimer to time their events.

Isolating the Behavior

Together with Cenk, the original author of the RPL implementation, I first started to try to reproduce this bug. While we knew that Hauke was able to reproduce it with 4 BLE nodes and even 15 IEEE 802.15.4 nodes these numbers from my experience are way to big to trace such a bug. So we tried to reproduce at least parts of the bug first in simpler setups. We tried:

  • 3 IEEE 802.15.4 nodes with router solicitations enabled and disabled,
  • 2 IEEE 802.15.4 nodes with router solicitations enabled and disabled, and
  • 2 Ethernet (RIOT native) nodes with router solicitations enabled and disabled

Each had one RPL root, with the remaining nodes being normal RPL routers and there was no 6LoWPAN border router configured.

In contrast to Hauke's setup none of those were sending any data traffic and were mostly idling with the occasional control (NDP or RPL) message being sent automatically by the respective protocol implementation.

For all of our setups we saw on average that after ~70 minutes that the RPL traffic increased drastically. However, the most reliable and most minimal setup was two IEEE 802.15.4 nodes with RS enabled, so we decided to use that for all our later steps. Mind, that to isolate this we had to run each setup for longer than 70 minutes until the bug occurred, so it took us some time to come to that minor checkpoint on our road to a solution.

Below you can see a particularly bad example from a run of two nodes with RSs enabled on RIOT's release candidate 1 for the 2020.04 release:

Control traffic in 2020.04-RC1

These kind of plots we mostly used to check the occurrence of the bug.

We see three plots, one for the sent traffic of both nodes, the remaining for each node, determined by the link-layer source address of the sent packets.

The plots are histograms, binned at every minute of the run's duration on the x-axis and the number of packets sent within this minutes of each type—distinguished by color—on the y-axis, capped of at a total of 50 packets. The plot was generated from the PCAP data of an IEEE 802.15.4 sniffer from RIOT's application repository we ran during our test runs. 86:0c:77:ec:23:33:62:78 was configured to be the RPL root, 2a:ab:dc:15:54:01:64:78 joins the RPL DODAG as a downstream router. As expected with 6LoWPAN optimization for NDP the only NDP messages we see as we have no 6LoWPAN border router, are multicast router solicitations, which by a truncated exponential backoff are sent in a more rapid succession in the first minute but than cap of at once per minute per node later.

The timing of multicast RPL DIOs is controlled by a trickle algorithm, so there are many multicast DIOs expected during the first minute from each node as well, with later multicast DIOs being spaced out over time the longer our run lasts. Unicast DIOs should only be sent by upstream routers—the root in our case—in reaction to a unicast or multicast DIS from downstream routers. Multicast DIS should only be visible during the beginning of the run, when the nodes have no parent yet, while later on, unicast DIS should only be sent every five minutes. The downstream routers report their downstream routes every minute via a unicast DAO which are acknowledged by a DAO-ACK. So for our setup we should only see one DAO from the downstream router per minute with an answering DAO-ACK.

However, as seen in our plot, the downstream router starts spamming DAOs and DISs at some point around the ~70 minute mark, confirming some of the traffic patterns Hauke observed. While we did not test for route loss and did not see any multicast neighbor discovery messages, we figured for now that these were side-effects of the DIS being spammed, since after three unanswered DISs the route to the parent is dropped. This in turn leads NDP to fall back to classic address resolution for global addresses—as the prefix is still configured by the DIOs and only timing out later marks the addresses as on-link, leading to multicast traffic by NDP.

Looking into the PCAPs of the sniffer in detail, we also observed that DAOs and DISs seemed to synchronize. The control traffic of RPL before the bug occurred looked like this:

RPL message flow before the bug

DAOs and DISs were sent separately and responded by the addressee accordingly

After the bug occurred however, it looked more like this:

RPL message flow when the bug occurs

DAOs and DISs were always sent together and the respective responses came only after the DIS was sent.

While we did not dismiss the fact outright that the ~70 minutes were awfully close to an integer overflow for a 32-bit microsecond timer, we decided to look first into this synchronization behavior which due to the differing sending times for DAOs and DISs was just as likely to happen by chance exactly around the 70 minutes mark. We felt assured as well by the fact that it did not happen exactly around the 71 minutes mark, but only some minutes later and that the timers we worked with were a 64-bit microsecond timer and a 32-bit millisecond timer after all…

The Hunt for Offset 0

So we tried to figure out first, why this synchronization of RPL messages happened and how those messages then showed up in the spamming behavior. After a lot more runs with strategically placed debug messages within the code to not disrupt the expected flow, we figured that whenever the send timer for the DAO messages fired, the reset of that timer using evtimer_del() while that event was handled, somehow resulted in the offset calculation for the new queue head of the evtimer to be 0.

static void _update_head_offset(evtimer_t *evtimer)
{
    if (evtimer->events) {
        evtimer_event_t *event = evtimer->events;
        event->offset = _get_offset(&evtimer->timer);
        DEBUG("evtimer: _update_head_offset(): new head offset %" PRIu32 "\n", event->offset);
    }
}

We confirmed that with a high likelihood this new event queue head was the send timer for the DIS message as it has a similar degree of periodicity. As such, that event fired immediately after the DAO send event was handled by the RPL implementation.

That explained the synchronization, but why was the offset suddenly calculated to be 0? When we looked at earlier references of the offset in the log it was as we expect it to be. So we dug deeper and eventually were able to confirm that indeed, the evtimer thought that the time to trigger this event had already passed, so it had to be triggered:

static uint32_t _get_offset(xtimer_t *timer)
{
    uint64_t now_us = xtimer_now_usec64();
    uint64_t start_us = _xtimer_usec_from_ticks64(
                        ((uint64_t)timer->long_start_time << 32) | timer->start_time);
    uint64_t target_us = start_us + _xtimer_usec_from_ticks64(
                        ((uint64_t)timer->long_offset) << 32 | timer->offset);

    if (target_us <= now_us) {
        return 0;
    }

In the snippet above target_us was indeed way lesser than now_us when the bug occurred, in fact, after some more digging, it seemed like the whole upper half of the 64-bit number target_us seemed to be missing.

Further investigation then showed, that timer->long_start_time member of the evtimer's xtimer which was used to calculate start_us which in turn was part of the sum to calculate target_us was always 0 for this particular timer at this particular time. However, we knew, that timer->long_start_time was derived from the current system time and now_us had long passed the 232 mark, so it was out of a 32-bit number's range. Even when looking at the xtimer when being set we saw, that timer->long_start_time had a value greater than 0, just at the point when it was important for that value to be correct it was not.

At that point we basically decided to finish up our current run, summarize our results and hand it over to the xtimer developers, as this seemed to be a bug in xtimer. However, just when this run was about to be stopped, we had the idea to use gdb, which up to this point we were not able to use due to the long run-time and apparent timing constraints we already faced when using the print messages for the bug to occur. Using gdb we put a watch-point timer->long_start_time when it becomes 0 and now_us was greater than 232. Of course this found some instances when the timer was reset or when it fired. This expected, since xtimer is supposed clean itself up. Yet, we also saw it happen here in the xtimer function _update_short_timers():

else {
    timer->offset -= elapsed;
    timer->start_time = (uint32_t)*now;
    timer->long_start_time = (uint32_t)(*now >> 32);
    timer = timer->next;
}

Now it was only a matter of minutes to find out why *now suddenly became lesser than 232 and it was obvious when we saw it. Just in the line above of the snippet shown above—within the same iteration—the pointer value was set to a new value but using the 32-bit version of the system time getter _xtimer_now() instead of _xtimer_now64():

if (timer->offset < elapsed || timer->offset - elapsed < XTIMER_ISR_BACKOFF) {
    /* ... */
    timer = timer_list_head;
    /* update current_time */
    *now = _xtimer_now();
}

In contrast to that, that pointer was initialized originally using _xtimer_now64():

    now = _xtimer_now64();

update:
    /* update short timer offset and fire */
    _update_short_timers(&now);

So that explained, why only certain timers (in our case RPL's evtimer) were affected by this bug and only at certain times. It would only show up when enough xtimers were registered so that the iteration was able to change both *now and the following timer's timer->long_start_time and only if _xtimer_now64() would return a value > 232. It was an integer overflow after all, because the wrong typed function was used to set the current time to update the 64-bit start time of a timer. The compiler did not warn us about the usage of the wrong function, as both versions return an unsigned number and the value space of 32-bit numbers fits nicely into the 64-bit number space. (Damn you implicit casting!)

When we replaced _xtimer_now() with _xtimer_now64() and reran our setup, the control traffic now behaved, with only one DIS/DIO pair 5 minutes after the last received DIO and one DAO/DAO-ACK pair every minute:

Control traffic after the fix

Reinvigorated by this discovery I tried to reproduce the bug in an isolated test case.

Reproducing the Bug

I quickly wrote up a test that set the system clock to a time past 232 and set two xtimers a few seconds apart. However, nothing happened, when I reverted our fix. When I checked the long_start_time member of each timer however the second was wrong as we expected, so I added a check for that, made a PR with our fix and the test and waited for a review.

Since the bug was highly anticipated—apart from Hauke, both the professors of Cenk and mine got involved over the days this bug hunt stretched out and we also contacted Kaspar, the original author of evtimer for help—reviews came quickly in, but the reviewers were not able to reproduce the bug with my test… Probably, the debug output I still had in my code from previous test runs, skewed the results in favor of the test.

But then I noticed that the xtimer list head will always return to the right state due to these lines:

    _update_short_timers(&now);
    /* update long timer offset */
    _update_long_timers(&now);
    /* update current time */
    now = _xtimer_now64();

    if (timer_list_head) {
        /* make sure we're not setting a time in the past */
        uint32_t elapsed = (uint32_t)now - timer_list_head->start_time;
        if (timer_list_head->offset < elapsed ||
            timer_list_head->offset - elapsed < XTIMER_ISR_BACKOFF) {
            goto update;
        }
        else {
            timer_list_head->offset -= elapsed;
            timer_list_head->start_time = (uint32_t)now;
            timer_list_head->long_start_time = (uint32_t)(now >> 32);
        }
    }

Since we only had two timers: The first timer fired in _update_short_timers() and the second got broken there. However, the second then became the new timer_list_head and long_start_time got fixed again.

So all the test needed was a third timer that would also got broken in _update_short_timers() but not become the head (as the second would already be the head) and that did the trick.

/*
 * Copyright (C) 2020 Freie Universität Berlin
 *
 * This file is subject to the terms and conditions of the GNU Lesser
 * General Public License v2.1. See the file LICENSE in the top level
 * directory for more details.
 */

/**
 * @{
 *
 * @file
 * @author  Martine Lenders <m.lenders@fu-berlin.de>
 */

#include <stdio.h>

#include "fmt.h"
#include "kernel_defines.h"
#include "msg.h"
#include "test_utils/expect.h"
#include "thread.h"
#include "xtimer.h"

#define MAIN_MSG_QUEUE_SIZE     (4U)
#define TIMERS_NUMOF            (3U)

msg_t _main_msg_queue[MAIN_MSG_QUEUE_SIZE];
static const uint64_t _timers_offsets[TIMERS_NUMOF] = {
    /* MUST ASCEND */
    1 * US_PER_SEC,
    2 * US_PER_SEC,
    3 * US_PER_SEC,
};

int main(void)
{
    xtimer_t timers[TIMERS_NUMOF];
    msg_t msgs[TIMERS_NUMOF];
    uint64_t start;

    expect(ARRAY_SIZE(_timers_offsets) == TIMERS_NUMOF);
    msg_init_queue(_main_msg_queue, MAIN_MSG_QUEUE_SIZE);
    /* ensure that xtimer_now64() is greater than UINT32_MAX */
    _xtimer_current_time = (2LLU << 32U);
    xtimer_init();
    print_str("Setting ");
    print_u32_dec(TIMERS_NUMOF);
    print_str(" timers:\n");
    for (unsigned i = 0; i < TIMERS_NUMOF; i++) {
        msgs[i].content.value = i;
        print_str(" #");
        print_u32_dec(i);
        print_str(" in ");
        print_u64_dec(_timers_offsets[i]);
        print_str(" usec\n");
    }
    print_str("now=");
    start = xtimer_now64().ticks64;
    print_u64_dec(start);
    print_str("\n");
    expect(start > UINT32_MAX);
    /* set timers after all were printed for better timing */
    for (unsigned i = 0; i < TIMERS_NUMOF; i++) {
        xtimer_set_msg64(&timers[i], _timers_offsets[i], &msgs[i],
                         thread_getpid());
        expect(timers[i].long_start_time > 0);
    }
    while (1) {
        msg_t msg;

        msg_receive(&msg);
        print_str("#");
        print_u32_dec(msg.content.value);
        print_str(":now=");
        print_u64_dec((uint64_t)xtimer_now64().ticks64);
        print_str("\n");
        for (unsigned i = 0; i <= msg.content.value; i++) {
            /* all fired timers expired */
            expect(timers[i].long_start_time == 0);
        }
        for (unsigned i = (msg.content.value + 1); i <= TIMERS_NUMOF; i++) {
            /* upper half of remaing_timers' start_time stays above 0 as it is
             * based on xtimer_now64() during the timer's callback execution */
            expect(timers[i].long_start_time > 0);
        }
    }
    return 0;
}

/** @} */

And so we were able to merge this fix on Good Friday, April 10th, 2020 into RIOT's master branch—the release fix is still pending due to the holidays as of the writing of this article.

Just out of curiosity, I went back to our original setup and used gdb again to look and see what the third timer was there: besides RPL's evtimer and the evtimer controlling NDP there was also an xtimer controlling the trickle algorithm for the multicast DIOs. So that explained also why the bug was harder to reproduce with RS disabled: Without periodic RS and the 6LoWPAN optimization there was not much else to do for NDP, and it probably never added its evtimer.

Conclusion

I found it interesting, how what at first seemed to be two IPv6 control protocols—RPL and NDP—not wanting to play nice together, turned out to be a simple typo in the timer implementation. Moreover, how the specific problem revealed itself more and more with every layer we went deeper: First there seemed to be two control protocols destroying each others states, then it turned out that this was just due to RPL misbehaving and spamming its control messages which in turn was happening due what seemed to be an offset miscalculation in the underlying event timer queue evtimer which in turn came into being due to a typo within the underlying xtimer which only triggered the bug under certain conditions when enough other xtimer objects were set.

However, it shows, that with perseverance and methodology even the weirdest bug can be fixed. Those who know my work already should know that I quite enjoy hunting for such obscure bugs (if I only had the time to hunt for all of them ☹️), but it was a lot of fun falling down this particular rabbit hole.

Update: By now, Hauke confirmed that this solved the reported problem.