Connectivity Issue?

No Comments

In playing with Azure, I tried to deploy a simple ASP .NET website from Microsoft Visual Studio. The database connection part got me stuck: I created a SQL server in Azure. Also firewall rule was added as suggested to allow my client access. But with the server the manage URL (as in dashboard) and the right credentials, I kept getting the following error messages when testing the connections:

 

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connection. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server)

 

The message is clear and contains suggestions. I was carried away with the impression that it must be my incorrect firewall setting as I’m in an intranet. After messing up with both the Azure side setting and my Windows Firewall rules, nothing really helps.  Following the troubleshooting guide at http://social.technet.microsoft.com/wiki/contents/articles/1719.windows-azure-sql-database-connectivity-troubleshooting-guide.aspx, I turned to use SQLCMD to test.

Good thing about the SQLCMD is it can tell explicitly the whether client IP (and what it is) is allowed to access the server. Indeed the IP automatically detected in the Azure web portal, which I added to the allow-list rule in Azure, is not correct. After fixing that, I can use SQLCMD to connect to the SQL server. But in visual studio, the issue is still not resolved, which is really strange.

 

In the end, recalling one small mistake that I wasn’t paying attention made the day. In the beginning of using SQLCMD,  I typed

There’s complaint:

Sqlcmd: Error: ‘-’ or ‘/’ does not have an associated argument.

This is because Windows cmd uses /x as option x for a given command, so the URL confuses the cmd.  I removed the scheme part and everything works.

I finally realized I should supply the host name instead of URL. So it’s essentially a format error instead of connectivity issue. Well, in the strict sense, it is still connectivity. But connectivity is a broad, abmbigous issue  that can have multiple causes. It would be nice to have a “Principle of Most Superficial” for reporting errors: checks the causes that are superficial, simple first to make the troubleshooting more targeted. Reading the error message from Visual Studio again, I found it actually said “verify the instance name”. But as it’s mixed with connectivity issue, I got too obsessed with the “advanced” cause :)…

 

 

Runaway cron jobs

No Comments

Today I finally nailed the (unusual) root cause of “runaway” cron jobs that puzzle me for a while…

The story:

In TAing CSE 120 this quarter, we provide autograder that runs periodically (every 10 minutes) to grade students’ try-submissions to give them initial feedback before the final turn in. The job is scheduled using crontab. The grading machine is from the department. I SSHed to the machine to set it up from my desktop.  Also all my terminals are usually in GNU Screen session.

Occasionally I need to reboot my desktop, so the terminals that kept the SSH session will be closed. This shouldn’t be a problem for crontab as it’s background job. Afterwards,  I re-SSHed to the machine from my desktop.  When I list the cron job with crontab -l , it says `no crontab for xxx’. WTF…This shouldn’t happen for crontab. Now, of course, I cannot edit the job crontab -e or delete the job with crontab -r either.

But the job is actually still scheduled on time (based on the log from the grading script). So it appears to me the cron job is runaway and I can’t control it. My immediate thought was crontab was probably doing some detection to bind to the SSH session that creates it. It runs away if the session is closed. But this problem is actually not deterministic. I mean, I tested several times. Sometimes I can still control it after restarting. Also weird is even when I keep my SSH session that started the job and I SSHed from laptop, I can’t view it sometimes. I even started to suspect whether it’s related to GNU Screen. But this possibility is eliminated when I tested without Screen.

As I don’t have root privilege in the department machine. Each time such problem happens, I have to ask the administrator to remove the runaway job. It’s a bit annoying. I was really curious what special mechanism crontab uses to detect which terminal to be visible and invisible and planned to read the source code some time.

Today in diagnosing an interesting problem  in the autograder, I accidentally discovered the secret. It turns out the root cause actually has nothing to do with crontab…

The root cause:

Basically the grading machine I used from the department is a load-balancer. When I SSHed to the address xxx.ucsd.edu, which is actually a portal, it will dynamically redirected me to one of the sub-machines (e.g, S1, S2, S3) without my knowledge (I found the crux in the shell prompt [user@xxx-S1]). And the cron job is only local to a sub-machine. So when I setup the job in one SSH session, say when in sub-machine S1, the job will be local in S1. Later my SSH session may be re-directed to S2, of course I can’t see the cron job in S2 now…

Load-balancer is a great thing. As a user, I don’t need to worry about which sub-machine to choose and remember their addresses. But this transparency also applies to failure. It can create the illusion that some local sub-machine problem is a global one, especially when the user is unaware of the such redirection.

Fun, right? :)

 

 

 

 

__cxa_demangle doesn’t update buffer length

No Comments

Compiler/library bug is rare and often the last few things you would suspect. I was lucky to hit one.

In C++, the source code identifiers are “mangled” when compiled object code to carry additional information such as type, arguments in the names. For instance, the function  int foo(int a, int b)  will be referred as  _Z3addii in the object code. The reverse process is “demangling” the encoded identifier to its original form. We can use a command line tool c++filt to do so:  c++filt _Z3addii will output  add(int, int) .

GCC provides an API to programmatically demangle an identifier: abi::__cxa_demangle. It takes a mangled name and returns the demanged string to an output buffer. If the output buffer is NULL, the API will allocate a buffer long enough to hold the result. Otherwise, if the output buffer length is not long enough, the API will expand the buffer and the new length will be returned.

In my program, I wrote a simple wrapper for the API to reduce some allocation overhead as demangling will be used many times in my program:

But debug information shows that after a while, the buffer is reallocated every time, even if the previous buffer is already large enough. One of the triggering input is

The demangled output is  store_trigger(THD*, st_table*, st_mysql_lex_string*, st_mysql_lex_string*, st_mysql_lex_string*, trg_event_type, trg_action_time_type, st_mysql_lex_string*, unsigned long, st_mysql_lex_string*, st_mysql_lex_string*, st_mysql_lex_string*, st_mysql_lex_string*) (OMG…, 260 characters).

The reason turns out to be the API’s implementation problem. The buggy version libstdc++, the non-null buffer length pointer isn’t updated to the reallocated buffer length but zero. So after the pathological input, MBUF_LEN is always zero, which the API takes as the length of the passed in buffer, causing the buffer to be reallocated every time. This bug is fixed in GCC 4.5.

 

Joke of the day

No Comments

I was trying to install an IM application called Fetion, which is mainly used in China for sending short messages, in my virtual machine, but the installer kept complaining it detects another Fetion is running and I need to close that process. But that VM is a clean installation with only a few apps installed. It shouldn’t have Fetion installed. From task manager, there’s no other process related to Fetion except this installer. It confused me for a while. Then I realized, damn, after downloading, I renamed the installer to fetion.exe. The checking is based on whether there’s a process called fetion.exe. But, oooops, it never expect itself could be the target…Bazinga!

What you wish to know if you could start over?

No Comments

Today when I was serving in the graduate student panel for new students orientation, someone threw an interesting what-if question:

What you wish to know if you could start over your graduate school life?

A senior panelist quickly gave a very good answer: fail early and fail fast (a Silicon Valley motto). Graduate life can be stressful, you may screw up and things may not work out well as you planned. But it will turn out that’s not the end of the world. Got tangled up? Pick up the lessons and just tango on!  It’s not only OK but also a good thing for you to fail early, greatly and fast. A related quote from Michael Jordan: 

I’ve missed more than 9000 shots in my career. I’ve lost almost 300 games. 26 times, I’ve been trusted to take the game winning shot and missed. I’ve failed over and over and over again in my life. And that is why I succeed.

From personal experience, I found this absolutely true.  There was a time I felt very discouraged and overwhelmed by several consecutive rejections of the projects that I believe in. My advisor told me that’s the process of accumulation and growing, and I shouldn’t let the external random factors affect self-confidence. As things gradually became smooth, I figured out that, in hind sight, the skills gained and lessons learned from these failures added to valuable experience, and what’s more they make me more thoughtful to be well prepared in the future. Also failures are signs that I didn’t dream away the time but have acted and tried. If I chose a easy way in the beginning, probably I would have encountered fewer failure and been less “painful” in short terms. But in the long run, there could be much more obstacles awaiting.

The topic shifted to another question before I could organize my reply. Three major (perhaps vague) points that I wanted to add to the “wish list” are:

  • Be more introspective.
  • Communicate more with advisor and fellow graduate students.
  • Be more courageous and bold.

Weeks ago I finished reading a recent popular memoir of PhD life–The PhD Grind–by Philip Guo. The book is not a cliche about how to excel in graduate school. It’s a real story detailing the trajectory of what it takes to get a PhD, including the “dark side” and struggles that’re seldom revealed in publication. The experiences in this book is just one perspective and may be too specific that you may not echo with.  You may be very lucky to have a caring advisor and work on projects you’re excited. But the level of introspection and the relentless drive reflected in the book are the qualities worth possessing, no matter what situation you are in.

The last item is an unsubstantiated thought. I realized I gradually became less audacious to think bold, due to pragmatism or personality. Whichever, this is bad especially given that I’m doing research, which is supposed to cutting-edge. Without the edges, how could it sharpen the boundaries. In particular, I wish I should have taken more charge of the exploration. After all, PhD is not about just blindly executing what advisor said into publishable papers.

Of course, this is the beginning of my third year. I still have time to keep grinding to prevent these wishes from being just wishes (lessons).

 

Do they assume too much

No Comments

On a machine that you don’t have root access and have to build lots of software and maybe essential ones(compiler, linker, etc.) from ground up, many bizarre problems will emerge, which made people want to kick the machine occasionally:

  • Dependent packages missing;
  • Environment variables (CC, xxPATH, ) not set properly(Or properly but not respected);
  • Versions not compatible with others.
  • UNKNOWN

With error message, plus a bit search, many of the glitches can be solved. Some remain mysterious and unsolved, because they are too specific to your machine and environment, which the generalist Google can’t handle well. e.g. in compiling LLVM with binutils and GCC, it takes me a lot of time to roll out the gold combination that works in my *particular* desktop but not laptop and I still can’t Google out why even provided many possibly related keywords.

I have to admit having a package management system is indeed a bullet that saved you from most of these troubles(although it’s not silver and has lot of space for improvement).

This time,  in a machine, where GNU Autotools(autoconf, automake, libtool) are not installed or outdated, I once again have to go through this building process. Usually I’ll put each S/W’s source in ~/software/younameit/src and the install prefix to ~/software/younameit/build. Fortunately this time, there isn’t any issue in building and installing the Autotools. After putting them into PATH, it seems fully set up and ready to go. But when invoking build script for a project, the following error occurs:

It seems libtool is not recognized. But exporting LIBTOOL doesn’t help. Some Google results suggest it’s version problem, but I used the combination of the three tools’ version  from a machine that has Autotools installed and works like a charm. A few workarounds about modifying configure.in as suggested or passing the definition when building , but this is not desired to me.  Also tried to invoke the bootstrap executable in their source, which didn’t make  a difference. Checked the README and INSTALL of each one, nothing special about configuration.

For this kind of problem, except for trying this and that, there isn’t really a systematic “diagnose” procedure.  Somehow,  I started to think about the layout thing. Because the only difference I can think of between the working machine and this machine is which libtool gives /usr/bin/libtool in the former and  ~/software/younameit/build/bin/libtool. And the other two also lies in /usr/bin, but  autoconf and automake didn’t like in the same bin as libtool. This indeed turns out to be caveat:

I have to install them to the same location!

The reason? They are sharing the same *share* directory(containing macros), so if installed in separate locations the share can’t really be shared by others.  Finally I found this is actually explained in a book(http://sources.redhat.com/autobook/autobook/autobook_244.html#SEC244). But it’s a one sentence in a How to install Autotools in Cygwin section. How could it expect people’s heads-up?

Software or documents are sometimes really assuming too much. I don’t know next time whether I’m lucky enough to hit the right direction again. Configuration and compatibility is a pain in the ass, especially when developers or document writers are constantly relying on users to stumble onto the solutions.

 

 

Sometimes, segmentation fault is not that straightforward

1 Comment

I was using LLVM to do some analysis. Today a weird segmentation fault bug drove me nuts and changed my impression that segfault is relatively straightforward.

Essentially the gadget that’s causing problem is trying to process the subprogram(function) debug information inside a module, put into vector and then sort(because the original one is not always sorted, and we need to do a lot of search later) based on the tuple(directory, filename, line number) ordering. Here’s an outline of the snippet:

It already “works well” for some time. The largest bc file it processes is 130M+. But today when testing on a smaller input(84M), it segfaults.

Fair enough. Got the backtrace:

#0 0x0855be14 in llvm::MDNode::getNumOperands (this=0×40009) at /home/ryan/Projects/llvm/src/include/llvm/Metadata.h:142
#1 0x086ad22a in llvm::DIDescriptor::getUInt64Field (this=0x21adb79c, Elt=0) at DebugInfo.cpp:70
#2 0xb7fd5caa in llvm::DIDescriptor::getUnsignedField (this=0x21adb79c, Elt=0) at /home/ryan/Projects/llvm/src/include/llvm/Analysis/DebugInfo.h:69
#3 0xb7fd5ced in llvm::DIDescriptor::getVersion (this=0x21adb79c) at /home/ryan/Projects/llvm/src/include/llvm/Analysis/DebugInfo.h:99
#4 0xb7fd63ca in llvm::DISubprogram::getDirectory (this=0x21adb798) at /home/ryan/Projects/llvm/src/include/llvm/Analysis/DebugInfo.h:540
#5 0xb7fd34e5 in cmpDISP (SP1=…, SP2=…) at Matcher.cpp:8
#6 0xb7fd980e in std::__unguarded_partition<__gnu_cxx::__normal_iterator<llvm::DISubprogram*, std::vector<llvm::DISubprogram> >, llvm::DISubprogram, bool (*)(llvm::DISubprogram const&, llvm::DISubprogram const&)> (__first=…, __last=..
., __pivot=…, __comp=0xb7fd34c0 <cmpDISP(llvm::DISubprogram const&, llvm::DISubprogram const&)>) at /usr/include/c++/4.5/bits/stl_algo.h:2232

 

The definition of getNumOperands

Wat? How could this cause seg fault?  The class instance? No:

After adding some more guarding check of NULL MDNode in the code but the problem still persists, I started to consult my friend Google. But unfortunately, not much related info. Go back to the trace. One of the call is getDirectory  from the compare function. Could it be DISubprogram? Then I vaguely remember a clue that in the class definition there’s some warning in comment about using container for DIxx. And indeed in DISubprogram’s base class there’s such a comment:

/// DIDescriptor – A thin wraper around MDNode to access encoded debug info.
/// This should not be stored in a container, because the underlying MDNode
/// may change in certain situations.

I thought I finally nailed the reason. So the fix is to use a new class to store some of the DISubprogram’s fields so that even if the MDNode gets changed, it won’t affect the later sort and search. Compare function is also changed accordingly. The result is disappointing, still didn’t go away. The bt is a bit different though. Because the compare function now accesses directory(NULL) field instead of getDirectory, which is more typical and “understandable”. Adding a check? No use. It segfaults on accessing SP2.directory which is not NULL, but is invalid to access.

Calm down and do more retrospection. Could it be I used StringRef in class fields could have some dangling pointer problem(as stated in the programmer’s manual). So I change them to std::string. Again, ooops…

At this point. I almost run out of tries. One thing is, if we don’t sort, then no segfault. But skip sorting here will bring a lot of trouble here, and we cannot dump it to disk using some external program to do the job. Because the DISubprogram also contains the Function class pointer which is essential to our use later. The last clue is on sorting.

Could STL has bug inside sort?

Threw some keyword in Google and one of the result in SO(http://stackoverflow.com/questions/1541817/sort-function-c-segmentation-fault) makes a point in one issue of sort: the compare function needs to have strict weak ordering(http://www.sgi.com/tech/stl/StrictWeakOrdering.html). Going back to check the compare function(in the beginning). Indeed the stupid mistake is that  the boundary case when the last cmp is 0, false should be returned instead of true. After fixing this, it’s finally back to normal! Double checking the data, there are several tuples with the same three elements(empty, empty, 0) because the debug info cannot be obtained.

Now, it’s surprising that sort will segfault on buggy compare function instead of just giving bogus result.

So Let’s see why. Here is part of   sort implementation in GNU libstdc++v3:

Three key variables’ values are:

(gdb) p __pivot
$1 = (const DISPCopy &) @0xb7c58008: {directory = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7fa47bc “”}}, filename = {stat
ic npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7fa47bc “”}}, name = {static npos = <optimized out>, _M_dataplus = {<std::allocator<c
har>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7fa47bc “”}}, linenumber = 0, lastline = 0, function = 0×0}

(gdb) p *__first
$4 = (DISPCopy &) @0xb7c58050: {directory = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2760a72c “/home/ryan/Projects/MySQL/m
ysql-server/5.1/mysys”}}, filename = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2760a774 “array.c”}}, name = {static npos =
<optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2760a78c “pop_dynamic”}}, linenumber = 179, lastline = 0, function = 0x90ff8d8}

(gdb) p *__last
$7 = (DISPCopy &) @0xb7c57ff0: <error reading variable>

As we see here: the pivot’s value is the very value(empty) with multiple occurrence.  The partition code will slides forward the __first iterator till the first value that’s not less than pivot and then slides  the __last iterator backwards till the first value that pivot’s not less than. The underlying assumption is compare function should have strict weak ordering property, which cmpDISP fails to have for the cases of exactly same values, so that  sliding won’t over step. I guess the name unguarded_partition mean that unless compare function provide the guarantee, the sliding is not guarded to check boundary crossing.

Note that it will not always segfault on input with several same values. The bigger file that worked before actually also have a few same empty values. Only when that value is picked as pivot and some values lie in the boundaries of the container.

In hindsight, the mistake is really due to careless coding practice, and the diagnosis could be sped up by looking into the backtrace a bit further(#6 is the actual scene). But a few other lessons to me are:

  1. Segfault sometimes is not as straightforward as I thought it was.
  2. It might *happen* in standard library that we tend to ignore.
  3. Be rigorous in the first place.

 

MySQL debug notes

No Comments

Here are some notes when I wast playing with MySQL to diagnose some of its bugs.

I. Logistics

  • Recent alpha, beta, dev release:
    • Most are not available in the archive site above. Google is your friend.
    • Search using the query “mysql-X.Y.Z.tar.gz” or “mysql-X.Y.Z-alpha.tar.gz” usually has higher hit rate.

II. Get it running

  • Compile and install:
    • $ path_to_install=/path/to/install/mysql/X.Y.Z
    • $ {optional CC=gcc CFLAGS=”-O0″ CXX=gcc CXXFLAGS=”-O0 -felide-constructors -fno-exceptions -fno-rtti”} ./configure –prefix=$path_to_install –with-debug=full –with-extra-charsets=complex
      • need to add –with-plugins=innobase and –with-partition for 5.1+, otherwise, install db will complain
      • need to add –with-pthread –with-named-thread-libs=-lpthreadfor old versions(e.g. 4.0)
        • add the following in the beginning of sql_class.ccif make still complains:
    • $ make
    • $ make install
  • copy the default configuration file(e.g /etc/my.cnf) to a place you have write access(e.g /home/ryan/mysql/5.0.89/my.cnf).
  • Modify the following entries(adapt to your own needs, e.g. port):
    [client]
    port            = 3309
    socket          = /home/ryan/mysql/5.0.89/mysqld.sock
    [mysqld_safe]
    socket          = /home/ryan/mysql/5.0.89/mysqld.sock
    nice            = 0
    [mysqld]
    user            = ryan
    socket          = /home/ryan/mysql/5.0.89/mysqld.sock
    port            = 3309
    basedir         = /home/ryan/mysql/5.0.89
    datadir         = /home/ryan/mysql/5.0.89/data
  • Initialize default db with install_db:
    • $ ./bin/mysql_install_db  –defaults-file=my.cnf
  • Start from mysqld_safe with the cnf:
    • $ ./bin/mysqld_safe –defaults-file=my.cnf –one-thread &
  • Connect using the socket file:
    • $ mysql -S /home/ryan/mysql/5.0.89/mysqld.sock
  • Doing experiment as you wish:
    • $ mysql > select version();  // to make sure it’s connected to the right server.
  • Stop mysqld_safe(remember!!):
    • $ ./bin/mysqladmin -S mysqld.sock -u root shutdown

III. Debugging

  • Debug with gdb(http://dev.mysql.com/doc/refman/5.0/en/using-gdb-on-mysqld.html):
    1. gdb libexec/mysqld pidof mysqld
    2. start the client and execute the query you want to test
  • Debug with mtr{Using this option, no need to use the above step to start mysql server}(http://forge.mysql.com/wiki/How_to_Run_MySQL_With_a_Debugger):
    1. $ cd /home/ryan/mysql/5.0.89/mysql-test
    2. $ ./mtr(or mysql-test-run) –gdb t/xxx.test
    3. Write the input query as a testcase and put in t directory. Optionally you can write the expected result in r directory.
    4. For existing test case, may need to change delimiter: delimiter |; (note this is to change the default delimiter to |, the ending ; has to be the current delimiter for this statement to be valid.)
    5. If it’s not possible to spawn the xterm window, replace –gdb with –manual-gdb
    6. sometimes the var dir path maybe too long, in that case, make a new var dir, and run mtr with specify var dir: –vardir=path