How to Investigate the Cause of a Bug with Modality

# Summary

Modality generative testing lets you quickly find the cause of bugs, whether you want to zoom straight into problematic areas, or get guidance from Modality's system analysis.

Pro This guide makes use of features available in the full version of Modality.

# Compared to Traditional Debugging

When a bug is discovered through traditional methods, you often only know that an error was noticed at a certain surface. When Modality finds a potential bug through generative testing, it describes both the entire system state and the specific problematic behavior. You start this guide pre-equipped with a Modality objective to trace the cause of your bug.

Traditional debugging involves many cycles of asking a specific question about system state, and then re-running your system in hopes of finding an informative result. With Modality, generative testing has already captured a tremendous number of test cases, so you don't have to re-run your system at all.

Traditional debugging requires you to hook up your debugger, tweak break points, and add print statements every time you want to interrogate a different region of your system. With Modality instrumentation, all system events are recorded, so you can simply query whatever system behavior interests you.

Traditional debugging methods provide little guidance when a bug's behavior seems unrelated to its source. Modality can scan your whole system for statistically-suspicious events, and point to specific lines in your code.

Writing regression tests after traditional debugging is a separate chore. With Modality objectives, your investigative queries can be easily rolled into your regression test suite.

# Step 1: Describe the Bug

This guide assumes that you have discovered a new potential bug with Modality generative testing, which can happen directly during your search for undiscovered bugs, or automatically as part of your continuous integration (CI) workflow.

You start this guide equipped with an instance of a Modality objective, describing the measurements and types of mutations that Modality ran on your system during generative testing. We'll pull out the relevant elements for your investigation.

  1. Open your terminal and call modality sut use <system-under-test-name> to select your default system under test.

    • Most users have a single system under test; call modality sut list for a list of system names.
  2. Call modality session use <session-name> to make your Modality session of interest from bug discovery the default for the rest of this guide.

    • If you don't recall your <session-name>, call modality objective instances <objective-name> -v for a list of relevant sessions.

    • If you have multiple sessions from bug discovery, such as part of a CI testing suite, use the session that contains the objective instance where your measurements failed.

  3. Call modality objective inspect <objective-name> <objective-instance-number> to see the names of the failing measurement(s).

    • In the example below, the failing measurements' names are Consumer heartbeat timing and Monitor heartbeat timeouts.
     $ modality objective inspect example-objective 1
     Name: example-objective
     SUT: example-sut
     Session: example-session
     Created at: 2021-05-26 09:33:31 UTC
     Created by: example-user
     Stopped at: 2021-05-26 09:35:04 UTC
     Stopping Conditions Reached: true
     Stopping Conditions Progress:
       Time: 00:01:33
       Passing Measurements: 10
       Failing Measurements: 3
     Passing Measurements: 8
       Consumer heartbeat timing: 10
       Consumer to monitor heartbeat: 13
       Monitor checks consumer heartbeats: 13
       Monitor checks producer heartbeats: 13
       Monitor heartbeat timeouts: 10
       Producer lifecycle: 13
       Producer to consumer communications: 13
       Producer to monitor heartbeat: 13
     Failing Measurements: 2
       Consumer heartbeat timing: 3
       Monitor heartbeat timeouts: 3
     Mutations: 15
  4. Call modality objective inspect <objective-name> to look up the exact query text of your failing measurement. We will focus on investigating Consumer heartbeat timing. The failing measurement query text is everything underneath that measurement's Check: and above the next Measurement Name:

    $ modality objective inspect example-objective
    Name: example-objective
    Created at: 2021-06-26 09:33:31 UTC
    Created by: example-user
    SUTS: [example-sut]
    Instances By Session:
      Session: example-session
        Instances: [1, 2]
    Tags: [example]
    Stopping Conditions:
      Attempts: 
      Time: 2min
      Passing Measurements: 10
      Failing Measurements: 10
    Mutation Constraints:
      Max Concurrent Mutations: 1
      Mutation Precondition: 
      Mutator Tags: []
    Mutator Constraints: 2
      Mutator: heartbeat-delay-mutator
        Probe: CONSUMER
        Mutation Precondition: 
        Parameter Constraints: 1
          Parameter: heartbeat-delay
          Range: [500, 2000]
      Mutator: sample-data-mutator
        Probe: PRODUCER
        Mutation Precondition: SAMPLED @ PRODUCER
        Parameter Constraints: 1
          Parameter: sample
          Range: [0, 100]
    Measurements: 8
      Measurement Name: Consumer heartbeat timing
        Should Pass: true
        Stopping Conditions:
          Passing Measurements: 
          Failing Measurements: 
        Check:
          NEVER
              # 2 == HEARTBEAT_ID_CONSUMER
              RECVD_MSG @ MONITOR (_.payload = 2) 
              -> AFTER 600ms
              RECVD_MSG @ MONITOR (_.payload = 2)
    ...
  5. Copy and paste the text of your <failing-measurement-query> into a blank text file with the .tq extension. For example, failing-measurement-query.tq.

    • If you plan on only doing a brief investigation, you can simply copy this whole <failing-measurement-query> string to your clipboard, since Modality query expressions are literally a string of text describing a question to Modality about system behavior. However, we strongly recommend you save it in a text file for future use and modification.
    NEVER
          # 2 == HEARTBEAT_ID_CONSUMER
          RECVD_MSG @ MONITOR (_.payload = 2) 
          -> AFTER 600ms
          RECVD_MSG @ MONITOR (_.payload = 2)
    
  6. You now have a query file with a failing measurement, which is a Modality query expression describing how your system was supposed to behave, but failed in the case of this bug. You're fully equipped to move on to the next step: Investigate Failing Measurements.

# Step 2: Investigate Failing Measurements

You can use Modality queries to see the landscape of events around a failing measurement, which can often point to your bug's root cause. Modality lets you expand the exact sequence of events and notice that a necessary event is absent, or you can zoom to see that an unexpected event is intruding from some other part of system code. Even if the cause isn't apparent, you'll gain a powerful counter-query to combine with statistical analysis in the following steps.

  1. Call modality check -v --file <failing-measurement-query.tq> to see what part of your measurement is having problems.

    • If you didn't save the text of your <failing-measurement-query> into a text file such as failing-measurement-query-example.tq , call modality check -v '<failing-measurement-query>' instead.

    • In this example, we see that the AFTER 600ms constraint is violated, and we also see that the generative tests saw values up to 3182ms beyond the limit. Together, this gives us a hypothesis that some part of the system is creating a bottleneck.

     
     $ modality check -v --file failing-measurement-query.tq 
     The given check had no matches. Analyzing why…
     
     Violation: Exists
         NEVER
             RECVD_MSG @ MONITOR(_.payload = 2)
             -> AFTER 600ms
             ^^^^^^^^^^^^^^
             |
             Changing this limit to 3182ms causes this assertion to pass
             RECVD_MSG @ MONITOR(_.payload = 2)
  2. Call modality check -vv --file <failing-measurement-query.tq> for a counterexample and counter-query.

    • modality check -vv --file <failing-measurement-query.tq> returns one sample counterexample, showing details from a test case where this measurement query failed. There's always a chance that this first counterexample will demonstrate an obvious cause. If not, press on.
    
    $ modality check -vv --file failing-measurement-query.tq 
    The given check had no matches. Analyzing why…
    
    Violation: Exists
        NEVER
            RECVD_MSG @ MONITOR(_.payload = 2)
            -> AFTER 600ms
            ^^^^^^^^^^^^^^
            |
            Changing this limit to 3182ms causes this assertion to pass
            RECVD_MSG @ MONITOR(_.payload = 2)
        
        Counter-Example:
            *  RECVD_MSG@MONITOR(393359338:65560:23:6, RECVD_MSG@MONITOR, payload=2)
              
            :  (5 elided events)
              
            *  RECVD_MSG@MONITOR(393359338:131073:0:6, RECVD_MSG@MONITOR, payload=2)
        
        Counter-Query:
            RECVD_MSG @ MONITOR(_.payload = 2)
            -> AFTER 600ms
            RECVD_MSG @ MONITOR(_.payload = 2)
  3. Save the text of the provided <counter-query> to a separate query text file, such as counter-query.tq.

    • This <counter-query> is an extremely helpful counterpart to your <failing-measurement-query>. While your failing measurement query describes a property that should hold throughout your system, this corresponding <counter-query> asks for all of the instances where that property is violated.
    RECVD_MSG @ MONITOR(_.payload = 2)
    -> AFTER 600ms
    RECVD_MSG @ MONITOR(_.payload = 2)
    
  4. Call modality query --file <counter-query.tq>, and scan this list for an overview of all the places where your measurement is failing.

    • Thanks to the many test cases from generative testing, you can see whether this measurement failure is part of a consistent violation, or a singular edge case.
    • If your <counter-query> describes a sequence of events, you can call modality query --file <counter-query.tq> --expand to see the full chain of events within each result.
     
     $ modality query --file counter-query.tq  
     Result 1:  
     ═════════
     *  RECVD_MSG@MONITOR((_.payload = 2))(950382630:131078:5:6, RECVD_MSG @ MONITOR, payload=2)
       
     :  (11 elided events)
       
     *  RECVD_MSG@MONITOR((_.payload = 2)) [2](950382630:131081:9:6, RECVD_MSG @ MONITOR, payload=2)
     ...
     Result 16:
     ══════════
     *  RECVD_MSG@MONITOR((_.payload = 2))(950382630:720918:22:6, RECVD_MSG @ MONITOR, payload=2)
       
     :  (6 elided events)
       
     *  RECVD_MSG@MONITOR((_.payload = 2)) [2](950382630:786433:1:4, RECVD_MSG @ MONITOR, payload=2)
  5. Pick one of the results, and next to one of the labeled events, note the <causal-coordinates>, which is a string of numbers separated by colons.

    • In the example above, 950382630:786433:1:4 are the coordinates for the second RECVD_MSG@MONITOR event in Result 16.
  6. Call modality log --from <causal-coordinates> --radius 6. This will give you the sequence of events around your event of interest. Since Modality's log of events is based on causal order and not wall clock, this can catch unexpected behaviors that might be obscured by distributed components or different internal clocks.

    • Seeing the sequence and nature of events around a failed measurement often points to a bug cause. For example, an event that should be happening before a failed measurement may be missing, or there's an unexpected event intruding from some other system side effect.
    • If you want a wider view, increase the --radius number for more events on either side.
    $ modality log --from 950382630:786433:1:4 --radius 6
         *  (950382630:786432:0:3, INIT @ MONITOR)
         
     »    CONSUMER merged a snapshot from PRODUCER
         
     *      (808713209:786436:2:3, SAMPLED @ PRODUCER, payload=5)
         
       »  MONITOR merged a snapshot from CONSUMER
         
       *    (946301812:786435:2:6, RECVD_MSG @ CONSUMER, payload=4)
         
         *  (950382630:786433:1:4, RECVD_MSG @ MONITOR, payload=2)
         
       *    (946301812:786435:2:7, TASK_LOOP @ CONSUMER)
         
         *  (950382630:786433:1:5, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
         
         *  (950382630:786433:1:6, CONSUMER_REGISTERED @ MONITOR)
         
     ════»  MONITOR merged a snapshot from PRODUCER
         
     *      (808713209:786441:6:3, SAMPLED @ PRODUCER, payload=4)
         
         *  (950382630:786434:2:6, RECVD_MSG @ MONITOR, payload=1)
         
     *      (808713209:786441:6:4, SAMPLED @ PRODUCER, payload=3)
         
         *  (950382630:786434:2:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
         
     *      (808713209:786441:6:5, SAMPLED @ PRODUCER, payload=4)
         
         *  (950382630:720918:22:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
         
         *  (950382630:720918:23:4, DEINIT @ MONITOR)
         
  7. If an event seems problematic to you, or if you want to see exactly where these events live in your source code, add -v to the call modality log --from <causal-coordinates> --radius 6 -v

    • This allows you to easily move from events to source code, even if that code is unfamiliar.
    • Look for discrepancies between your source code and observed behavior, as well as see the precise timing of events.
    • This is also a good way to notice events that were triggered from an unexpected region of your code base.
     
     $ modality log --from 950382630:786433:1:4 --radius 6 -v
         *  (950382630:786432:0:3, INIT @ MONITOR)
             tags=[monitor]
             src="c-example/src/monitor/main.c#L78"
         
     »    CONSUMER merged a snapshot from PRODUCER
         
     *      (808713209:786436:2:3, SAMPLED @ PRODUCER, payload=5)
             tags=[measurement, producer, sample]
             src="c-example/src/producer/main.c#L131"
             clock_id=1, clock_resolution_ns=1, time_ns=2057782859635
         
       »  MONITOR merged a snapshot from CONSUMER
         
       *    (946301812:786435:2:6, RECVD_MSG @ CONSUMER, payload=4)
             tags=[communication, consumer, measurement]
             src="c-example/src/consumer/main.c#L176"
             clock_id=1, clock_resolution_ns=1, time_ns=2057682812210
         
         *  (950382630:786433:1:4, RECVD_MSG @ MONITOR, payload=2)
             tags=[communication, heartbeat, monitor]
             src="c-example/src/monitor/main.c#L150"
             clock_id=1, clock_resolution_ns=1, time_ns=2057633424101
         
       *    (946301812:786435:2:7, TASK_LOOP @ CONSUMER)
             tags=[consumer, process-loop]
             src="c-example/src/consumer/main.c#L128"
         
         *  (950382630:786433:1:5, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
             tags=[EXPECTATION, SEVERITY_10, heartbeat, monitor]
             src="c-example/src/monitor/main.c#L159"
             clock_id=1, clock_resolution_ns=1, time_ns=2057633424101
         
         *  (950382630:786433:1:6, CONSUMER_REGISTERED @ MONITOR)
             tags=[heartbeat, monitor]
             src="c-example/src/monitor/main.c#L189"
         
     ════»  MONITOR merged a snapshot from PRODUCER
         
     *      (808713209:786441:6:3, SAMPLED @ PRODUCER, payload=4)
             tags=[measurement, producer, sample]
             src="c-example/src/producer/main.c#L131"
             clock_id=1, clock_resolution_ns=1, time_ns=2058183815739
         
         *  (950382630:786434:2:6, RECVD_MSG @ MONITOR, payload=1)
             tags=[communication, heartbeat, monitor]
             src="c-example/src/monitor/main.c#L150"
             clock_id=1, clock_resolution_ns=1, time_ns=2058092566069
         
     *      (808713209:786441:6:4, SAMPLED @ PRODUCER, payload=3)
             tags=[measurement, producer, sample]
             src="c-example/src/producer/main.c#L131"
             clock_id=1, clock_resolution_ns=1, time_ns=2058183817939
         
         *  (950382630:786434:2:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
             tags=[EXPECTATION, SEVERITY_10, heartbeat, monitor]
             src="c-example/src/monitor/main.c#L159"
             clock_id=1, clock_resolution_ns=1, time_ns=2058092566069
         
     *      (808713209:786441:6:5, SAMPLED @ PRODUCER, payload=4)
             tags=[measurement, producer, sample]
             src="c-example/src/producer/main.c#L131"
             clock_id=1, clock_resolution_ns=1, time_ns=2058183818239
         
         *  (950382630:720918:22:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
             tags=[EXPECTATION, SEVERITY_10, heartbeat, monitor]
             src="c-example/src/monitor/main.c#L159"
             clock_id=1, clock_resolution_ns=1, time_ns=2056996842325
         
         *  (950382630:720918:23:4, DEINIT @ MONITOR)
             tags=[monitor]
             src="c-example/src/monitor/main.c#L96"
         
  8. If you've noticed a problem, Modality is giving you the exact line in code to start fixing your bug. If the bug cause is not apparent yet, move on to the next step: Trace Suspicious System Patterns

# Step 3: Trace Suspicious System Patterns

If there's no obvious cause to a bug, Modality lets you zoom out to look at the system as a whole. From this altitude, Modality analyzes all of your generative tests and identifies statistically-suspicious patterns. Modality can then zoom in to show the relationship between a suspicious pattern of events and your bug's failing measurement. Modality can point you to the exact code that needs addressing, and in the final steps lets you easily improve your regression tests.

In this step, you'll combine your Modality queries from prior steps to search in a powerful new way. Modality will give you a list of suspicious events, ranked by their correlation to a failing measurement, like your <failing-measurement-query> from Step 1. Modality can show you the exact path between a failing measurement and a suspicious event, pointing to specific lines of code.

  1. Call modality metrics fault-localization --mode spectrum --objective-name <objective-name> --objective-instance <objective-instance>. Behind the scenes, Modality will search all of the test cases for this objective instance, comparing failing and passing cases, and finding suspicious events that are correlated with failing cases.

     
       $ modality metrics fault-localization --mode spectrum --objective-name example-objective --objective-instance 1
       Found 3 distinct problems in 13 cases.
       Problem: At least one failure detected out of:
         [Expectation CONSUMER_TIMEOUT @ MONITOR, ObjectiveMeasurement Consumer heartbeat timing for example-objective, ObjectiveMeasurement Monitor heartbeat timeouts for example-objective]
           Failing cases: example-session [epoch 14 for objective example-objective instance 1, epoch 16 for objective example-objective instance 1, epoch 18 for objective example-objective instance 1]
           Passing cases: example-session [epoch 2 for objective example-objective instance 1, epoch 4 for objective example-objective instance 1, epoch 6 for objective example-objective instance 1, epoch 8 for objective example-objective instance 1, epoch 10 for objective example-objective instance 1, epoch 12 for objective example-objective instance 1, epoch 20 for objective example-objective instance 1, epoch 22 for objective example-objective instance 1, epoch 24 for objective example-objective instance 1, epoch 26 for objective example-objective instance 1]
           Suspicious Single Elements:
             Score: 0.866
             Occurred in 3 / 3 failing cases and 1 / 10 passing cases
             Element:
               *  (SENT_MEASUREMENT @ PRODUCER (_.payload >= 50))
               
             Score: 0.707
             Occurred in 3 / 3 failing cases and 3 / 10 passing cases
             Element:
               *  (RECVD_MSG @ CONSUMER (_.payload >= 39))
         ...
  2. You will receive a sorted list of suspicious program elements, with the most suspicious elements first. Pick one of these suspicious elements, and under Unigram:, note the <suspicious-pattern> in parentheses. They normally follow the pattern (<EVENT_NAME> @ <PROBE_NAME> AND <OTHER_CONDITIONS> )

    • In the example above, the <suspicious-pattern> is (SENT_MEASUREMENT @ PRODUCER (_.payload >= 50))
  3. Create a blank text file to write your query in. For example, correlate-events-query.tq

  4. Start with the <suspicious pattern>.

      SENT_MEASUREMENT @ PRODUCER(_.payload >= 50)
    
  5. Next, add an arrow -> and then insert your <counter-query> from [Step 2: Investigate Failing Measurements].

    • This indicates that you want to match sequences where the <suspicious pattern> is followed by the <counter-query>.

    • The correlate-events-query.tq file would now look like this:

    SENT_MEASUREMENT @ PRODUCER(_.payload >= 50)
    -> 
    RECVD_MSG @ MONITOR(_.payload = 2)
    -> AFTER 600ms
    RECVD_MSG @ MONITOR(_.payload = 2)
    
  6. Call modality query --file <correlate-events-query.tq>. This will show you how many times the suspicious event was later followed by your failing behavior in your session.

  7. If there aren't enough results to suggest a relationship between the suspicious pattern and your counter-query, call modality metrics fault-localization --mode spectrum --objective-name <objective-name> --objective-instance <objective-instance> again and choose a different suspicious program element pattern.

    • Adjust your <correlate-events-query.tq> file by replacing the first suspicious pattern with this second suspicious pattern
    • Move through Step 3 Investigate Suspicious System Patterns with Modality Fault Localization again.
  8. If there are enough results to suggest a relationship, call modality query --file <correlate-events-query.tq> --expand . Adding --expand will map out the full path of events between the suspicious event and the failing/buggy behavior, so you can see every potentially-disruptive step along the way.

  9. Once you've found the problem in the path of events, call modality query --file <correlate-events-query.tq> --expand -v . Adding -v shows you where each event lives in your source code, so you can move directly to fixing the bug.

# Follow up - If the bug's cause is not yet apparent:

  • If you have the benefit of more than one run of generative testing, such as from prior field tests or CI tests, call modality metrics fault-localization --mode spectrum --objective-name <objective-name> --objective-instance <other-objective-instance> and compare to see if there has been a noticeable recent change in the suspiciousness rating for any events between the two runs.

  • Call modality query --file <counter-query.tq> -vv --expand and look for signs that your embedded code might need additional Modality instrumentation to capture the full flow of events.

  • Run modality metrics coverage --using <session-name> to see if your test scripts are exercising enough of your system to produce meaningful results in your generative tests. See how to write better tests with Modality.

  • Run a broader session of generative testing for a longer time to re-replicate the issue and collect more statistically-significant data for Modality to crunch. See how to find an undiscovered bug with Modality generative testing for how to expand the parameters of your generative testing objective.

# When you've found the bug's cause

  • First, congrats on finding the cause! Modality is designed so your efforts in one area are easily forwarded to other parts of development.

  • If you are using Modality for your regression testing, we strongly recommend you add the query from your <correlate-events query> as an additional measurement in your regression test objectives. This will be extremely helpful to you in several ways.

    • Looping <correlate-events query> into a Modality objective validates your system against this bug throughout further development.

    • If something similar to this type of bug happens in the future, even out in production, the next round of troubleshooting will have a huge head-start, because the <correlate-events query> will point to the issue with much greater precision than the original measurement query did.

  • If you followed any of the follow-up steps, there's a good chance you've also permanently improved the instrumentation of your system, broadened your generative tests, or strengthened the robustness of your test scripts.