I was using LLVM to do some analysis. Today a weird segmentation fault bug drove me nuts and changed my impression that segfault is relatively straightforward.

Essentially the gadget that’s causing problem is trying to process the subprogram(function) debug information inside a module, put into vector and then sort(because the original one is not always sorted, and we need to do a lot of search later) based on the tuple(directory, filename, line number) ordering. Here’s an outline of the snippet:

It already “works well” for some time. The largest bc file it processes is 130M+. But today when testing on a smaller input(84M), it segfaults.

Fair enough. Got the backtrace:

#0 0x0855be14 in llvm::MDNode::getNumOperands (this=0×40009) at /home/ryan/Projects/llvm/src/include/llvm/Metadata.h:142
#1 0x086ad22a in llvm::DIDescriptor::getUInt64Field (this=0x21adb79c, Elt=0) at DebugInfo.cpp:70
#2 0xb7fd5caa in llvm::DIDescriptor::getUnsignedField (this=0x21adb79c, Elt=0) at /home/ryan/Projects/llvm/src/include/llvm/Analysis/DebugInfo.h:69
#3 0xb7fd5ced in llvm::DIDescriptor::getVersion (this=0x21adb79c) at /home/ryan/Projects/llvm/src/include/llvm/Analysis/DebugInfo.h:99
#4 0xb7fd63ca in llvm::DISubprogram::getDirectory (this=0x21adb798) at /home/ryan/Projects/llvm/src/include/llvm/Analysis/DebugInfo.h:540
#5 0xb7fd34e5 in cmpDISP (SP1=…, SP2=…) at Matcher.cpp:8
#6 0xb7fd980e in std::__unguarded_partition<__gnu_cxx::__normal_iterator<llvm::DISubprogram*, std::vector<llvm::DISubprogram> >, llvm::DISubprogram, bool (*)(llvm::DISubprogram const&, llvm::DISubprogram const&)> (__first=…, __last=..
., __pivot=…, __comp=0xb7fd34c0 <cmpDISP(llvm::DISubprogram const&, llvm::DISubprogram const&)>) at /usr/include/c++/4.5/bits/stl_algo.h:2232

 

The definition of getNumOperands

Wat? How could this cause seg fault?  The class instance? No:

After adding some more guarding check of NULL MDNode in the code but the problem still persists, I started to consult my friend Google. But unfortunately, not much related info. Go back to the trace. One of the call is getDirectory  from the compare function. Could it be DISubprogram? Then I vaguely remember a clue that in the class definition there’s some warning in comment about using container for DIxx. And indeed in DISubprogram’s base class there’s such a comment:

/// DIDescriptor – A thin wraper around MDNode to access encoded debug info.
/// This should not be stored in a container, because the underlying MDNode
/// may change in certain situations.

I thought I finally nailed the reason. So the fix is to use a new class to store some of the DISubprogram’s fields so that even if the MDNode gets changed, it won’t affect the later sort and search. Compare function is also changed accordingly. The result is disappointing, still didn’t go away. The bt is a bit different though. Because the compare function now accesses directory(NULL) field instead of getDirectory, which is more typical and “understandable”. Adding a check? No use. It segfaults on accessing SP2.directory which is not NULL, but is invalid to access.

Calm down and do more retrospection. Could it be I used StringRef in class fields could have some dangling pointer problem(as stated in the programmer’s manual). So I change them to std::string. Again, ooops…

At this point. I almost run out of tries. One thing is, if we don’t sort, then no segfault. But skip sorting here will bring a lot of trouble here, and we cannot dump it to disk using some external program to do the job. Because the DISubprogram also contains the Function class pointer which is essential to our use later. The last clue is on sorting.

Could STL has bug inside sort?

Threw some keyword in Google and one of the result in SO(http://stackoverflow.com/questions/1541817/sort-function-c-segmentation-fault) makes a point in one issue of sort: the compare function needs to have strict weak ordering(http://www.sgi.com/tech/stl/StrictWeakOrdering.html). Going back to check the compare function(in the beginning). Indeed the stupid mistake is that  the boundary case when the last cmp is 0, false should be returned instead of true. After fixing this, it’s finally back to normal! Double checking the data, there are several tuples with the same three elements(empty, empty, 0) because the debug info cannot be obtained.

Now, it’s surprising that sort will segfault on buggy compare function instead of just giving bogus result.

So Let’s see why. Here is part of   sort implementation in GNU libstdc++v3:

Three key variables’ values are:

(gdb) p __pivot
$1 = (const DISPCopy &) @0xb7c58008: {directory = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7fa47bc “”}}, filename = {stat
ic npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7fa47bc “”}}, name = {static npos = <optimized out>, _M_dataplus = {<std::allocator<c
har>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7fa47bc “”}}, linenumber = 0, lastline = 0, function = 0×0}

(gdb) p *__first
$4 = (DISPCopy &) @0xb7c58050: {directory = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2760a72c “/home/ryan/Projects/MySQL/m
ysql-server/5.1/mysys”}}, filename = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2760a774 “array.c”}}, name = {static npos =
<optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2760a78c “pop_dynamic”}}, linenumber = 179, lastline = 0, function = 0x90ff8d8}

(gdb) p *__last
$7 = (DISPCopy &) @0xb7c57ff0: <error reading variable>

As we see here: the pivot’s value is the very value(empty) with multiple occurrence.  The partition code will slides forward the __first iterator till the first value that’s not less than pivot and then slides  the __last iterator backwards till the first value that pivot’s not less than. The underlying assumption is compare function should have strict weak ordering property, which cmpDISP fails to have for the cases of exactly same values, so that  sliding won’t over step. I guess the name unguarded_partition mean that unless compare function provide the guarantee, the sliding is not guarded to check boundary crossing.

Note that it will not always segfault on input with several same values. The bigger file that worked before actually also have a few same empty values. Only when that value is picked as pivot and some values lie in the boundaries of the container.

In hindsight, the mistake is really due to careless coding practice, and the diagnosis could be sped up by looking into the backtrace a bit further(#6 is the actual scene). But a few other lessons to me are:

  1. Segfault sometimes is not as straightforward as I thought it was.
  2. It might *happen* in standard library that we tend to ignore.
  3. Be rigorous in the first place.