Using RR to Precisely Locate Linked List Corruption
When debugging complex programs, linked list corruption is a common and tricky problem. When you discover that a linked list has been corrupted, but don’t know exactly when or where it happened, traditional debugging methods often fall short.This is especially challenging in production environments where the corruption might occur long before any symptoms appear. Traditional debugging approaches like adding print statements or using breakpoints are ineffective because they require you to know where to look in advance. This article will introduce how to use RR time-travel debugging capabilities, combined with a binary search strategy, to precisely locate the exact instruction that corrupted the linked list.
Problem Background
LuaJIT’s GC objects are linked in a list. We discovered that a GC object that should have been on the list couldn’t be found there. By the time we detected this problem, we were far from the actual memory corruption event. In this situation, we need to use recording and replay technology to analyze such issues. Memory corruption issues like this are particularly insidious because:
- They often manifest far from the actual point of corruption
- They can cause intermittent failures that are difficult to reproduce
- The symptoms might appear in seemingly unrelated parts of the program
- By the time an error is detected, the program state has changed significantly
In our specific case, a LuaJIT garbage collection object was mysteriously missing from its linked list. This caused crashes during garbage collection cycles, but the actual corruption happened much earlier in the execution flow.
Recording the Target Program
using rr record /path/to/my/program --args
to record the target program.
For example,
1
rr record /usr/local/openresty/nginx/sbin/nginx
Replaying the Record Snapshot
Replay the process with command rr replay
in the recording directory.
1
rr replay
After running the above command, you’ll be in the GDB terminal where you can execute standard GDB commands.
The RR provides a GDB-compatible interface with additional time-travel capabilities. This means you can use familiar GDB commands along with powerful time-travel commands like:
reverse-continue
: Run the program backward until a breakpoint is hitreverse-step
: Step backward one source linereverse-next
: Step backward, skipping over function callsreverse-nexti
: Step backward one machine instruction
These commands allow you to navigate backward through the program’s execution history, which is crucial for tracking down the exact moment when memory corruption occurs.
Debugging Strategy: Instruction-level Binary Search
Our methodology comprises two principal components:
- Utilize gdb to execute the program until reaching the nearest known breakpoint to the problematic area
- Implement a binary search algorithm that leverages the
nexti
andreverse-nexti
commands to identify with precision the specific instruction responsible for the linked list corruption
Implementation Steps
1. Count the number of lua_resume calls
To use the binary search method, we first need to count how many times lua_resume() is called. We use the following Python script to count the total number of calls
The lua_resume function is a key entry point in the LuaJIT execution flow, making it an ideal landmark for our binary search. By counting these calls, we establish reference points that help narrow down our search space efficiently.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import gdb
# Register GDB command to start the counter
class LuaResumeCounter(gdb.Command):
def __init__(self):
super(LuaResumeCounter, self).__init__("count-lua-resume", gdb.COMMAND_USER)
def invoke(self, arg, from_tty):
gdb.execute("delete")
# Go to the beiginning of the record
gdb.execute("reverse-continue")
# Create a new breakpoint
bp = gdb.Breakpoint("lua_resume")
# Make it silent
bp.silent = True
# Set up a GDB variable to track the count
gdb.execute("set $lua_resume_count = 0")
# Define what to do when the breakpoint is hit
command_script = """
set $lua_resume_count = $lua_resume_count + 1
continue
"""
# Set the commands for the breakpoint
bp.commands = command_script
print("lua_resume counter started, calls will be counted silently")
# Automatically start execution
gdb.execute("continue")
count = int(gdb.parse_and_eval("$lua_resume_count"))
print(f"lua_resume has been called {count} times")
# Initialize commands
LuaResumeCounter()
2. Implement the check function
After counting the number of lua_resume() calls, we need to determine whether the target node is in the linked list when the breakpoint is triggered. Therefore, we use the following function for detection.
This check function traverses the linked list of GC objects to determine if our target object is present. We’ll use this as our “oracle” during the binary search to determine whether we’re examining a point in time before or after the corruption occurred.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def check_address_in_list(start_addr, target_addr):
"""
Check if a target address is present in the linked list of GCobj objects.
Args:
start_addr: The starting address as an integer
target_addr: The address to look for as an integer
Returns:
True if the target address is found in the list, False otherwise
"""
current_addr = start_addr
print(f"Searching for 0x{target_addr:x} in GC list starting from 0x{start_addr:x}")
# Set a reasonable limit to prevent infinite loops
max_iterations = 100000
iterations = 0
while current_addr != 0 and iterations < max_iterations:
# Check if this is our target
if current_addr == target_addr:
print(f"Found target address 0x{target_addr:x} in GC list!")
return True
try:
# Cast the address to a GCobj pointer
gcobj = gdb.parse_and_eval(f"(GCobj *)0x{current_addr:x}")
# Get the next pointer as an integer
next_ptr = int(gcobj["gch"]["nextgc"]["gcptr64"])
# If we've reached the end or a cycle, break
if next_ptr == 0:
break
#print("End of list reached")
# Move to next object
current_addr = next_ptr
except Exception as e:
print(f"Error accessing object at 0x{current_addr:x}: {e}")
break
iterations += 1
if iterations >= max_iterations:
print(f"Reached maximum iterations ({max_iterations}), search stopped")
#print(f"Target address 0x{target_addr:x} not found in GC list")
return False
3. Implement the binary search algorithm
Next, we implement binary search to locate the problematic instruction:
The binary search algorithm is particularly effective for this debugging scenario because:
- It reduces the search space by half with each iteration
- It has logarithmic time complexity (O(log n)), making it efficient even for large execution traces
- It systematically narrows down the exact point of corruption without requiring manual inspection of the entire execution
Our implementation leverages GDB’s Python API to automate the search process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class BinarySearch(gdb.Command):
def __init__(self):
super(BinarySearch, self).__init__("binary-search", gdb.COMMAND_USER)
def invoke(self, arg, from_tty):
# Set up GDB variables to track the count and target
total_count = int(gdb.parse_and_eval("$lua_resume_count"))
low = 0
high = total_count
gdb.execute("set pagination off")
gdb.execute("delete")
while low < high:
mid = (low + high) // 2
gdb.execute("reverse-continue")
gdb.execute("set $lua_resume_count = 0")
gdb.execute(f"set $target_resume_count = {mid}")
print(f"check on {mid} hits")
# Create a new breakpoint with proper GDB commands
bp = gdb.Breakpoint("lua_resume")
# Define the breakpoint commands using proper GDB syntax
commands = """
silent
set $lua_resume_count = $lua_resume_count + 1
if $lua_resume_count < $target_resume_count
continue
else
printf "Stopped at lua_resume call #%d\\n", $lua_resume_count
end
"""
# Set the commands for the breakpoint
bp.commands = commands
# The following addresses is just for show
exist = check_address_in_list(0x7ffff7bccaf8, 0x12345678)
if exist:
low = mid + 1
else:
high = mid - 1
gdb.execute("delete")
gdb.execute("continue")
# Initialize commands
BinarySearch()
Locating the Specific Instruction
We have already located the lua_resume call closest to where the problem occurs, but this is not sufficient to troubleshoot the issue. We need to pinpoint the specific machine instruction. We can use the same binary search method as above to find the exact instruction.
At this point, we need to combine the nexti
command to achieve instruction-level precision. Since the algorithm is similar to the steps above, we will not provide the specific code here, leaving it for interested individuals to implement.
It’s important to note that at this stage, we need to limit the nexti
search range between two consecutive lua_resume calls.
To implement the instruction-level binary search:
- Identify the two consecutive lua_resume calls that bracket the corruption point
- Count the number of instructions between these calls (this can be done with a simple script)
- Apply the binary search algorithm at the instruction level using
nexti
andreverse-nexti
- At each step, check if the target node is in the linked list
- Narrow down until you find the exact instruction that causes the corruption
This approach can pinpoint the exact assembly instruction responsible for the corruption, even in complex codebases with millions of instructions between relevant points.
The power of this approach is that it works regardless of the complexity of the code path between the two lua_resume calls. Even if there are function calls, loops, or conditional branches, the binary search will methodically identify the exact instruction that causes the corruption.
Real Debugging Example
Here is an example of our analysis using RR:
- First, we load the record file with RR.
- Use the source command to load the Python script.
- Execute count-lua-resume to count the number of calls.
- Execute binary-search to find the nearest lua_resume function call where the problem occurs.
This example demonstrates how we applied our methodology to a real-world problem in a production environment. The issue had been evading traditional debugging approaches for weeks, but we were able to identify the root cause within hours using this technique.
$ rr replay
start 1> source luajit-debug.py
lua_resume counter script loaded
Use 'count-lua-resume' command to start counting and continue execution
Use 'binary-search' command to search the nearest position
start 1> count-lua-resume
0x00007ffff730de1a in epoll_wait (epfd=9, events=0x6fc260, maxevents=512, timeout=timeout@entry=1000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 in ../sysdeps/unix/sysv/linux/epoll_wait.c
Breakpoint 1 at 0x7ffff7f03630: file lj_api.c, line 1276.
lua_resume counter started, calls will be counted silently
Program received signal SIGSTOP, Stopped (signal).
Have reached end of recorded history.
0x00007ffff730de1a in epoll_wait (epfd=9, events=0x6fc260, maxevents=512, timeout=timeout@entry=1000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 in ../sysdeps/unix/sysv/linux/epoll_wait.c
lua_resume has been called 31 times
binary-search
Have reached start of recorded history.
0x00007ffff730de1a in epoll_wait (epfd=9, events=0x6fc260, maxevents=512, timeout=timeout@entry=1000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 in ../sysdeps/unix/sysv/linux/epoll_wait.c
check on 15 hits
Breakpoint 2 at 0x7ffff7f03630: file lj_api.c, line 1276.
Searching for 0x12345678 in GC list starting from 0x7ffff7bccaf8
...
After identifying the specific lua_resume call where the corruption occurs, we proceeded with the instruction-level binary search. The instruction-level search process is similar to the search for the nearest lua_resume call described above, so we have not provided the specific implementation.
Conclusion
RR’s time-travel debugging capabilities, when integrated with an instruction-level binary search methodology, constitute a sophisticated toolset for resolving complex issues such as linked list corruption. This analytical framework is not exclusively applicable to linked lists but can be extended to diagnose corruption across diverse data structures.
By identifying with precision the exact instruction at which corruption manifests, developers can ascertain the fundamental cause of the issue rather than addressing only its symptomatic manifestations. While this debugging approach may present initial complexity, it substantially diminishes the time and resources required when confronting problems that conventional debugging methodologies cannot effectively resolve.
Broader Applications
The methodology described in this article extends beyond linked list corruption and can be applied to various challenging debugging scenarios:
- Race conditions: By recording multiple executions and comparing them, you can identify non-deterministic behavior
- Memory leaks: Trace object allocations and deallocations over time to identify patterns
- Stack corruption: Pinpoint exactly when stack memory is overwritten
- Heap corruption: Identify buffer overflows or use-after-free issues with precision
- API misuse: Detect incorrect sequences of function calls that lead to undefined behavior
Performance Considerations
While recording does introduce some overhead, modern implementations like RR are optimized to minimize impact:
- Typical recording overhead is 2-5x normal execution speed
- Storage requirements vary based on program complexity and recording duration
- For targeted debugging sessions, the benefits far outweigh the temporary performance impact
Best Practices
To get the most out of this debugging approach:
- Start with a focused recording that captures just enough context around the suspected issue
- Use application knowledge to select appropriate landmarks (like lua_resume in our example)
- Combine binary search with other debugging techniques like memory watchpoints
- Document your findings to help others understand similar issues in the future
- Consider implementing automated tests that can detect similar corruptions early
This article aims to provide valuable insights into advanced debugging techniques. Should you require additional clarification or further debugging strategies, please submit your inquiries in the discussion section below.