# How to Investigate the Cause of a Bug with Modality Generative Testing

# Summary

Modality generative testing lets you quickly find the cause of bugs, whether you want to zoom straight into problematic areas, or get guidance from Modality's system analysis.

# Compared to Traditional Debugging

When a bug is discovered through traditional methods, you often only know that an error was noticed at a certain surface. When Modality finds a potential bug through generative testing, it describes both the entire system state and the specific problematic behavior. You start this guide pre-equipped with a Modality objective to trace the cause of your bug.

Traditional debugging involves many cycles of asking a specific question about system state, and then re-running your system in hopes of finding an informative result. With Modality, generative testing has already captured a tremendous number of test cases, so you don't have to re-run your system at all.

Traditional debugging requires you to hook up your debugger, tweak break points, and add print statements every time you want to interrogate a different region of your system. With Modality instrumentation, all system events are recorded, so you can simply query whatever system behavior interests you.

Traditional debugging methods provide little guidance when a bug's behavior seems unrelated to its source. Modality can scan your whole system for statistically-suspicious events, and point to specific lines in your code.

Writing regression tests after traditional debugging is a separate chore. With Modality objectives, your investigative queries can be easily rolled into your regression test suite.

# Step 1: Describe the Bug

This guide assumes that you have discovered a new potential bug with Modality generative testing, which can happen directly during your search for undiscovered bugs, or automatically as part of your continuous integration (CI) workflow.

You start this guide equipped with an instance of a Modality objective, describing the measurements and types of mutations that Modality ran on your system during generative testing. We'll pull out the relevant elements for your investigation.

  1. Open your terminal and call modality sut use < system under test name > to select your default system under test.

    • Most users have a single system under test; call modality sut list for a list of system names.
  2. Call modality session use < session name > to make your Modality session of interest from bug discovery the default for the rest of this guide.

    • If you don't recall your < session name >, call modality objective instances < objective name > -v for a list of relevant sessions.

    • If you have multiple sessions from bug discovery, such as part of a CI testing suite, use the session that contains the objective instance where your measurements failed.

  3. Call modality objective inspect < objective name > < objective instance number > to see the names of the failing measurement(s).

    • In the example below, the only failing measurement's name is Heartbeat message receive interval
     
     $ modality objective inspect demo-objective 1
     Name: demo-objective
     SUT: example-sut
     Session: long-running-experiment
     Created at: 2021-05-24 12:50:22 UTC
     Created by: example-user
     Stopped at: 2021-05-24 13:02:26 UTC
     Stopping Conditions Reached: true
     Stopping Conditions Progress:
       Time: 00:12:04
       Passing Measurements: 80
       Failing Measurements: 18
     Passing Measurements: 4
       Consumer lifecycle: 98
       Heartbeat message receive interval: 80
       Heartbeat sender IDs are checked: 98
       Producer to consumer communications: 98
     Failing Measurements: 1
       Heartbeat message receive interval: 18
     Mutations: 100
     
     
  4. Call modality objective inspect < objective name > to look up the exact query text of your failing measurement. The failing measurement query text is everything underneath that measurement's Check: and above the next Measurement Name:

    
    $ modality objective inspect demo-objective
    Name: demo-objective
    Created at: 2021-05-24 12:48:44 UTC
    Created by: example-user
    SUTS: [example-sut]
    Instances By Session:
      Session: long-running-experiment
        Instances: [1, 2]
      Session: nominal
        Instances: []
    Stopping Conditions:
      Attempts: 
      Time: 30sec
      Passing Measurements: 5
      Failing Measurements: 5
    Mutation Constraints:
      Max Concurrent Mutations: 1
      Mutation Precondition: 
      Mutator Tags: [sample]
    Mutator Constraints: 1
      Mutator: sample-data-mutator
        Probe: PRODUCER
        Mutation Precondition: MATCH SAMPLED @ PRODUCER
        Parameter Constraints: 1
          Parameter: sample
          Range: [-120, 120]
    Measurements: 4
    ...
      Measurement Name: Heartbeat message receive interval
        Should Pass: true
        Stopping Conditions:
          Passing Measurements: 
          Failing Measurements: 
        Check:
          NEVER
              # 2 == HEARTBEAT_ID_CONSUMER
              (RECVD_MSG @ MONITOR AND payload = 2) AS RecvdMsgA,
              (RECVD_MSG @ MONITOR AND payload = 2) AS RecvdMsgB
          WHEN
              RecvdMsgA -> RecvdMsgB AFTER 600ms
    
    ...
    
    
  5. Copy and paste the text of your < failing measurement query > into a blank text file with the .tq extension. For example, failing-measurement-query.tq

    • If you plan on only doing a brief investigation, you can simply copy this whole < failing measurement query > string to your clipboard, since Modality query expressions are literally a string of text describing a question to Modality about system behavior. However, we strongly recommend you save it in a text file for future use and modification.
    NEVER
        # 2 == HEARTBEAT_ID_CONSUMER
        (RECVD_MSG @ MONITOR AND payload = 2) AS RecvdMsgA,
        (RECVD_MSG @ MONITOR AND payload = 2) AS RecvdMsgB
    WHEN
        RecvdMsgA -> RecvdMsgB AFTER 600ms
    
  6. You now have a query file with a failing measurement, which is a Modality query expression describing how your system was supposed to behave, but failed in the case of this bug. You're fully equipped to move on to the next step: Investigate Failing Measurements.

# Step 2: Investigate Failing Measurements

You can use Modality queries to see the landscape of events around a failing measurement, which can often point to your bug's root cause. Modality lets you expand the exact sequence of events and notice that a necessary event is absent, or you can zoom to see that an unexpected event is intruding from some other part of system code. Even if the cause isn't apparent, you'll gain a powerful counter-query to combine with statistical analysis in the following steps.

  1. Call modality check -v --file < failing-measurement-query.tq > to see what part of your measurement is having problems.

    • If you didn't save the text of your < failing measurement query > into a text file such as failing-measurement-query-example.tq , call modality check -vv '< failing measurement query >' instead.

    • In this example, we see that the AFTER 600ms constraint is voilated, and we also see that the generative tests saw values up to 3182ms beyond the limit. Together, this gives us a hypothesis that some part of the system is creating a bottleneck.

     
     $ modality check -v --file failing-measurement-query.tq 
     The given check had no matches. Analyzing why…
     
     Violation: Exists
         NEVER
             (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
             (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
         WHEN
             RecvdMsgA -> RecvdMsgB AFTER 600ms
                                    ^^^^^^^^^^^
                                    |
                                    Changing this limit to 3182ms causes this assertion to pass
     
     
  2. Call modality check -vv --file < failing-measurement-query.tq > for a counterexample and counter-query.

    • modality check -vv --file < failing-measurement-query.tq > returns one sample counterexample, showing details from a test case where this measurement query failed. There's always a chance that this first counterexample will demonstrate an obvious cause. If not, press on.
    
    $ modality check -vv --file failing-measurement-query.tq 
    The given check had no matches. Analyzing why…
    
    Violation: Exists
        NEVER
            (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
            (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
        WHEN
            RecvdMsgA -> RecvdMsgB AFTER 600ms
                                   ^^^^^^^^^^^
                                   |
                                   Changing this limit to 3182ms causes this assertion to pass
       
        Counter-Example:
            *  RecvdMsgA(393359338:65560:23:6, RECVD_MSG @ MONITOR, payload=2)
              
            :  (5 elided events)
              
            *  RecvdMsgB(393359338:131073:0:6, RECVD_MSG @ MONITOR, payload=2)
    
    Counter-Query:
        MATCH
          (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
          (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
        WHEN
          RecvdMsgA -&gt; RecvdMsgB AFTER 600ms
    
  3. Save the text of the provided < counter-query > to a separate query text file, such as counter-query.tq.

    • This < counter-query > is an extremely helpful counterpart to your < failing measurement query >. While your failing measurement query describes property that should hold throughout your system, this corresponding < counter-query > asks for all of the instances where that property is voilated.
    MATCH
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
    WHEN
        RecvdMsgA -> RecvdMsgB AFTER 600ms
    
  4. Call modality query --file < counter-query.tq >, and scan this list for an overview of the all the places where your measurement is failing.

    • Thanks to the many test cases from generative testing, you can see whether this measurement failure is part of a consistent violation, or a singular edge case.
    • If your < counter query > describes a sequence of events, you can call modality query --file < counter-query.tq > --expand to see the full chain of events within each result.
     
     $ modality query --file counter-query.tq  
     Result 1:  
     ═════════
     *  RecvdMsgA(393359338:65560:23:6, RECVD_MSG @ MONITOR, payload=2)
       
     :  (5 elided events)
       
     *  RecvdMsgB(393359338:131073:0:6, RECVD_MSG @ MONITOR, payload=2)
    

    ...

    Result 160: ═══════════

    • RecvdMsgA(393359338:6553624:23:6, RECVD_MSG @ MONITOR, payload=2)
      : (5 elided events)
    • RecvdMsgB(393359338:6619137:0:6, RECVD_MSG @ MONITOR, payload=2)
  5. Pick one of the results, and next to one of the labeled events, note the < causal coordinates > , which is a string of numbers separated by colons.

    • In the example above, 393359338:6619137:0:6 are the coordinates for event RecvdMsgB in Result 160.
  6. Call modality log --from < causal coordinates > --radius 6. This will give you the sequence of events around your event of interest. Since Modality's log of events is based on causal order and not wall clock, this can catch unexpected behaviors that might be obscured by distributed components or different internal clocks.

    • Seeing the sequence and nature of events around a failed measurement often points to a bug cause. For example, an event that should be happening before a failed measurement may be missing, or there's an unexpected event intruding from some other system side effect.
    • If you want a wider view, increase the --radius number for more events on either side.
     
     $ modality log --from 393359338:6619137:0:6 --radius 6
       *      (393359338:6619136:0:3, INIT @ MONITOR)
           
         «  CONSUMER merged a snapshot from PRODUCER
           
           *  (957217087:6619137:0:9, SENT_HEARTBEAT @ PRODUCER, payload=1)
           
       «    MONITOR merged a snapshot from CONSUMER
           
         *    (561128184:6619138:1:4, RECVD_MSG @ CONSUMER, payload=-1)
           
       *      (393359338:6619137:0:6, RECVD_MSG @ MONITOR, payload=2)
           
         *    (561128184:6619138:1:5, TASK_LOOP @ CONSUMER)
           
       *      (393359338:6619137:0:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
           
     ══════»  PRODUCER merged a snapshot from modalityd
           
       *      (393359338:6619137:0:8, CONSUMER_REGISTERED @ MONITOR)
           
       «══  MONITOR merged a snapshot from PRODUCER
           
           *  (957217087:6619139:1:5, SAMPLED @ PRODUCER, payload=-2)
           
       *      (393359338:6619138:1:5, RECVD_MSG @ MONITOR, payload=1)
           
       *      (393359338:6619138:1:6, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
           
       *      (393359338:6553624:23:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS)
           
       *      (393359338:6553624:24:4, DEINIT @ MONITOR)
           
     
     
  7. If an event seems problematic to you, or if you want to see exactly where these events live in your source code, add -v to the call modality log --from < causal coordinates > --radius 6 -v

    • This allows you to easily move from events to source code, even if that code is unfamiliar.
    • Look for discrepancies between your source code and observed behavior, as well as see the precise timing of events.
    • This is also a good way to notice events that were triggered from an unexpected region of your code base.
     
     $ modality log --from 393359338:6619137:0:6 --radius 6 -v
       *      (393359338:6619136:0:3, INIT @ MONITOR, tags=[monitor], src="c-example/src/monitor/main.c#L64")
           
         «  CONSUMER merged a snapshot from PRODUCER
           
           *  (957217087:6619137:0:9, SENT_HEARTBEAT @ PRODUCER, payload=1, tags=[communication, heartbeat, producer], src="c-example/src/producer/main.c#L191", clock_id=1, clock_resolution_ns=1, time_ns=1610176333461709)
           
       «    MONITOR merged a snapshot from CONSUMER
           
         *    (561128184:6619138:1:4, RECVD_MSG @ CONSUMER, payload=-1, tags=[communication, consumer, measurement], src="c-example/src/consumer/main.c#L163", clock_id=1, clock_resolution_ns=1, time_ns=1610176338595119)
           
       *      (393359338:6619137:0:6, RECVD_MSG @ MONITOR, payload=2, tags=[communication, heartbeat, monitor], src="c-example/src/monitor/main.c#L119", clock_id=1, clock_resolution_ns=1, time_ns=1610176333514916)
           
         *    (561128184:6619138:1:5, TASK_LOOP @ CONSUMER, tags=[consumer, process-loop], src="c-example/src/consumer/main.c#L128")
           
       *      (393359338:6619137:0:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS, tags=[EXPECTATION, SEVERITY_10, heartbeat, monitor], src="c-example/src/monitor/main.c#L128", clock_id=1, clock_resolution_ns=1, time_ns=1610176333514916)
           
     ══════»  PRODUCER merged a snapshot from modalityd
           
       *      (393359338:6619137:0:8, CONSUMER_REGISTERED @ MONITOR, tags=[heartbeat, monitor], src="c-example/src/monitor/main.c#L158")
           
       «══  MONITOR merged a snapshot from PRODUCER
           
           *  (957217087:6619139:1:5, SAMPLED @ PRODUCER, payload=-2, tags=[measurement, producer, sample], src="c-example/src/producer/main.c#L128", clock_id=1, clock_resolution_ns=1, time_ns=1610176433594272)
           
       *      (393359338:6619138:1:5, RECVD_MSG @ MONITOR, payload=1, tags=[communication, heartbeat, monitor], src="c-example/src/monitor/main.c#L119", clock_id=1, clock_resolution_ns=1, time_ns=1610176343641931)
           
       *      (393359338:6619138:1:6, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS, tags=[EXPECTATION, SEVERITY_10, heartbeat, monitor], src="c-example/src/monitor/main.c#L128", clock_id=1, clock_resolution_ns=1, time_ns=1610176343641931)
           
       *      (393359338:6553624:23:7, HEARTBEAT_ID_CHECK @ MONITOR, outcome=PASS, tags=[EXPECTATION, SEVERITY_10, heartbeat, monitor], src="c-example/src/monitor/main.c#L128", clock_id=1, clock_resolution_ns=1, time_ns=1610174802709422)
           
       *      (393359338:6553624:24:4, DEINIT @ MONITOR, tags=[monitor], src="c-example/src/monitor/main.c#L82")
           
     
     
  8. If you've noticed a problem, Modality is giving you the exact line in code to start fixing your bug. If the bug cause is not apparent yet, move on to the next step: Trace Suspicious System Patterns

# Step 3: Trace Suspicious System Patterns

If there's no obvious cause to a bug, Modality lets you zoom out to look at the system as a whole. From this altitude, Modality analyzes all of your generative tests and identifies statistically-suspicious patterns. Modality can then zoom in to show the relationship between a suspicious pattern of events and your bug's failing measurement. Modality can point you to the exact code that needs addressing, and in the final steps lets you easily improve your regression tests.

In this step, you'll combine your Modality queries from prior steps to search in a powerful new way. Modality will give you a list of suspicious events, ranked by their correlation to a failing measurement, like your < failing measurement query > from Step 1. Modality can show you the exact path between a failing measurement and a suspicious event, pointing to specific lines of code.

  1. Call modality metrics fault-localization spectrum-based 'session = "< session name >"'. Behind the scenes, Modality will search all of the test cases in this session, comparing failing and passing cases, and finding suspicious events that are correlated with failing cases.

    • If you have multiple sessions from bug discovery, such as part of a CI testing suite, add those sessions into the region expression at the end of this command, like this: modality metrics fault-localization spectrum-based 'session = "< session name >" OR session = "< second session name >" OR session = "< more session names >"'
     
       $ modality metrics fault-localization spectrum-based 'session="long-running-experiment"'
       Scope Failure Summary:
         Scope: StimulusScriptRunning
         Failing scope instances: [long-running-experiment: [5, 12, ..., 86, 78]]
         Passing scope instances: [long-running-experiment: [1, 2, ..., 101, 100]]
         Failures:
           Expectation: CONSUMER_TIMEOUT @ MONITOR
       Suspicious Unigram Program Elements:
         
         Score: 0.594
         Occurred in failing scope instances: [long-running-experiment: [5, 12, ..., 86, 78]]
         Occurred in passing scope instances: [long-running-experiment: [3, 4, ..., 100, 99]]
         Unigram:
           *  (SENT_MEASUREMENT @ PRODUCER AND payload >= 7)
         
         Score: 0.493
         Occurred in failing scope instances: [long-running-experiment: [5, 12, ..., 86, 78]]
         Occurred in passing scope instances: [long-running-experiment: [1, 3, ..., 100, 99]]
         Unigram:
           *  (RECVD_MSG @ CONSUMER AND payload >= 4)
     
     
  2. You will receive a sorted list of suspicious program elements, with the most suspicious elements first. Pick one of these suspicious elements, and under Unigram:, note the < suspicious pattern > in parenthesis. They normally follow the pattern (< EVENT NAME > @ < PROBE NAME > AND < OTHER CONDITIONS > )

    • In the example above, the < suspicious pattern > is (SENT_MEASUREMENT @ PRODUCER AND payload >= 50)
  3. Create a new blank text file with the .tq extension to write your query in. For example, matching-events-query.tq

  4. Start with MATCH and then supply the < suspicious pattern >. Use AS to give this suspicious pattern a label.

    • The example suspicious pattern (SENT_MEASUREMENT @ PRODUCER AND payload >= 50), would look like this:
      MATCH
      (SENT_MEASUREMENT @ PRODUCER AND payload >= 50) AS SuspiciousSentMeasurement,
      
    
  5. Next, insert your < counter-query > from [Step 2: Investigate Failing Measurements]

    • Taking your counter-query.tq text file from Step 2:
    MATCH
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
    WHEN
        RecvdMsgA -> RecvdMsgB AFTER 600ms
    
    • The correlate-events-query.tq file would now look like this:
    MATCH
        (SENT_MEASUREMENT @ PRODUCER AND payload >= 50) AS SuspiciousSentMeasurement,
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
    WHEN
        RecvdMsgA -> RecvdMsgB AFTER 600ms
    
  6. Underneath WHEN, insert an additional line with the label for the suspicious event, followed by an arrow -> , followed by the label for the failed behavior. If there already was a line under WHEN, add AND to the end of your inserted line.

    • This line indicates that you want to match sequences where the suspicious event is followed by the failing behavior.

    • The example correlate-events-query.tq text file would now look like this:

    MATCH
        (SENT_MEASUREMENT @ PRODUCER AND payload >= 50) AS SuspiciousSentMeasurement,
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgA,
        (RECVD_MSG@MONITOR AND (payload = 2)) AS RecvdMsgB
    WHEN
        SuspiciousSentMeasurement ->  RecvdMsgB AND
        RecvdMsgA -> RecvdMsgB AFTER 600ms
    
  7. Call modality query --file < correlate-events-query.tq > This will show you how many times the suspicious event was later followed by your failing behavior in your session.

  8. If there aren't enough results to suggest a relationship between the suspicious pattern and your counter-query, call modality metrics fault-localization spectrum-based 'session = "< session name >"' again and choose a different suspicious program element pattern.

    • Adjust your < correlate-events-query.tq > file by replacing the first suspicious pattern with this second suspicious pattern
    • Move through Step 3 Investigate Suspicious System Patterns with Modality Fault Localization again.
  9. If there are enough results to suggest a relationship, call modality query --file < correlate-events-query.tq > --expand . Adding --expand will map out the full path of events between the suspicious event and the failing/buggy behavior, so you can see every potentially-disruptive step along the way.

  10. Once you've found the problem in the path of events, call modality query --file < correlate-events-query.tq > --expand -v . Adding -v shows you where each event lives in your source code, so you can move directly to fixing the bug.

# Follow up - If the bug's cause is not yet apparent:

  • If you have the benefit of more than one session, such as from prior field tests or CI tests, call modality metrics fault-localization spectrum-based 'session = "< other session name >"' and compare to see if there has been a noticeable recent change in the suspiciousness rating for any events between the two sessions.

  • Call modality query --file < counter-query.tq > -vv --expand and look for signs that your embedded code might need additional Modality instrumentation to capture the full flow of events.

  • Run modality metrics coverage 'session = "< session name >"' to see if your test scripts are exercising enough of your system to produce meaningful results in your generative tests. See how to write better tests with Modality.

  • Run a broader session of generative testing for a longer time to re-replicate the issue and collect more statistically-significant data for Modality to crunch. See how to find an undiscovered bug with Modality generative testing for how to expand the parameters of your generative testing objective.

# When you've found the bug's cause

  • First, congrats on finding the cause! Modality is designed so your efforts in one area are easily forwarded to other parts of development.

  • If you are using Modality for your regression testing, we strongly recommend you add the query from your < correlate-events query > as an additional measurement in your regression test objectives. This will be extremely helpful to you in several ways.

    • Looping < correlate-events query > into a Modality objective validates your system against this bug throughout further development.

    • If something similar to this type of bug happens in the future, even out in production, the next round of troubleshooting will have a huge head-start, because the < correlate-events query > will point to the issue with much greater precision than the original measurement query did.

  • If you followed any of the follow-up steps, there's a good chance you've also permanently improved the instrumentation of your system, broadened your generative tests, or strengthened the robustness of your test scripts.