Multiple Access Gateway http processes going into high utilisation after period of time and not dropping back to normal

  • 7016929
  • 21-Oct-2015
  • 21-Oct-2015

Environment

NetIQ Access Manager 4.1.1
NetIQ Access Gateway Service on RHEL 6.6

Situation

Working Access Manager environment recently upgraded from 3.1 to 4.1.1. Everything working fine and no users complaining about any issues, but monitoring the server shows that multiple httpd process (running 10 in total) go into high utilisation eg. all 10 httpd processes start running with amost 0% util. After a while, one of the httpd processes will climb towards 100%, then another and so on. On occasions, we had multiple httpd process running at 300%, another few at 100%, and the remaining ones at between 1-3%. The server-status output shows that the server is not doing much work, with most threads waiting for work. The netstat output shows we have about 300 established connections and we have a lot of memory available. No users seem to be impacted.

To reduce concerns, the AG servers are restarted daily to temporarely drop utilisation back to normal.

Resolution

Moved all character rewriter profiles into word based profiles. Many search and replace strings were in the character based rewriter profile setup with NAM 3.1. With the AG changes in 3.2 onwards, the same changes can be merged into the word based profile with added performance gains.

Cause

One particular Application seemed to be sending back binary data. The character based profile would parse this binary data and go into high utilisation. As the thread never finished, the httpd process would simply stay in that state. The word based rewriter profile can handle this much better and should be used as a best practice.

Additional Information

Forced a coredump on one of the servers (gcore $httpd_pid) for three processes (one at 300% util, and two at 100% util). In all cores, we seem to be doing almost nothing yet the AG code on the stack is always the same ie. we are in the char_rewriter profile for one proxy service. This is probably more than a coincidence and I have asked for more details on profile. The data being processed also looks like it is binary data, which one would not expect with text/html or most content-type returned.

Here's what we see in the core (versus core2/core3 dumps that show very similar info) - can see binary data in response (frame 2 below -> pos)

 
// there are two threads in MAG code (similar to other cores), and both in rewriter code ...
 
Thread 966 (LWP 18602):
#0  0x00007f511c71f57e in doPartial (chRb=0x7f4de8066c48) at ../rewriter.cpp:1723
        bytesLeft = 0
        bytesInString = 7
#1  0x00007f511c729c20 in rewriter_char_filter (f=<optimized out>, bb=0x7f4de8066bc8)
    at ../rewriter_filter.cpp:522
        chRb = 0x7f4de8066c48
        endofBucket = {bucket = 0x7f4e4c10cf28,
          pos = 0x7f4e4c10d164 "ry:\r\n * Date       Author          CR/Bug ID", ' ' <repeats 11 times>, "Change Description\r\n * ----       ", '-' <repeats 12 times>, " ", '-' <repeats 19 times>, " ", '-' <repeats 24 times>, "\r\n * 24-03-2004 Neil Samanta     --", ' ' <repeats 16 times>...}
        pData = 0x7f4de804dbf8 "\264Y\357S\333\070\023\376\316\f\377\203\316\067\020\270\222\330JH(\275$7\201pW\336)\264C\350t\356\375r#l%qql\237$\227\346\375\353\337\325\017+\016v\002\264\306CG\312J\332}v\367\221\274r\373\277\064\233\273;\356ou>\273;\350\067t\236\244K\026\316\346\002\035\370\207\250\355y\035\364\071\016\005\r\320(dQ\030S~\204Fi\032\205>\021a\022s4\246\337h\224\244\v\032\v\324D\023\301\210\240\263eK)ð\036]\264\320(\232%\361\277Y\030\243\233\240\205.\242{\364\027K\276Qt\233<\304|\036\246G\225\262\313\b\354%!\030\354y\236wr\204>OFZ\357(\212\220\002\311\021\243\234\262o"...
        len = 0
        t1 = 1443622209046644
        wscore = 0x7f511a62f2a0
#2  0x00007f511c720a30 in rewrite (f=0x7f4dd401d3e8, bb=0x7f4dd401d530) at ../rewriter.cpp:2186
        wpRb = 0x7f4de8062278
        endofBucket = {bucket = 0x7f4eb643e630,
          pos = 0x7fff0c7788bc <gettimeofday+44> "H\213K\bH\272\317\367S㥛\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a1\300[A\\\311\303H\213\005\216\375\377\377\213@\030A\211\004$H\213\005\200\375\377\377\213@\034A\211D$\004\353ڸ`"}
        pData = 0x7f4de802fdf8 "orkRo\177"
        len = 5
:
:

Thread 546 (LWP 15462):
#0  0x00007f511c71f57e in doPartial (chRb=0x7f4e4406a1f8) at ../rewriter.cpp:1723
        bytesLeft = 0
        bytesInString = 7
#1  0x00007f511c729c20 in rewriter_char_filter (f=<optimized out>, bb=0x7f4e4406a178)
    at ../rewriter_filter.cpp:522
        chRb = 0x7f4e4406a1f8
        endofBucket = {bucket = 0x7f4e4401b1a8,
          pos = 0x7f4e4401b3e4 "ry:\r\n * Date       Author          CR/Bug ID", ' ' <repeats 11 times>, "Change D
escription\r\n * ----       ", '-' <repeats 12 times>, " ", '-' <repeats 19 times>, " ", '-' <repeats 24 times>, "
\r\n * 24-03-2004 Neil Samanta     --", ' ' <repeats 16 times>...}
        pData = 0x7f4e5407b198 "\264YkS\333\070\027\376\316\f\377A\353\035\blIl\347J\372&\331I\b<U+0772>\003\264C\350\356\364\375\262#l%\361\342\330^I)\315\376\372=\272Xv\260\023\240\004\017\035)\222\374\234\347\234\363\350\346\366~\252V\367\367\354_v\371\354\357\241_\320Y\234\254h0\233st\344\035\243\272\343\064З(\340\304GÀ\206AD\330\t\032&I\030x\230\aq\304И|#a\234,H\304Q\025M8Ŝ\314V5\t\346\302\373輆\206\341,\216\376Y\006\021\272\361k\350<\274G\277\321\370\033A\267\361C\304\346ArR\332v\021\202\275\070"
        len = 0
        t1 = 1443658713332470
        wscore = 0x7f511a600680
#2  0x00007f511c720a30 in rewrite (f=0x7f4e0804bf00, bb=0x7f4e140acfb0) at ../rewriter.cpp:2186
        wpRb = 0x7f4e44065828
        endofBucket = {bucket = 0x7f4fbcde2630,
          pos = 0x7fff0c7788bc <gettimeofday+44> "H\213K\bH\272\317\367S㥛\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a1\300[A\\\311\303H\213\005\216\375\377\377\213@\030A\211\004$H\213\005\200\375\377\377\213@\034A\211D$\004\353ڸ`"}
        pData = 0x7f4e08035b18 "ixedW"
        len = 5
In both cases, we can see that there is binary data in the buffer, which could be causing us issues as we are trying to rewrite this.  If we start analysing the first thread above, we can gather the exact URL, etc
(gdb) thread 966
[Switching to thread 966 (LWP 18602)]
#0  0x00007f511c71f57e in doPartial (chRb=0x7f4de8066c48) at ../rewriter.cpp:1723
1723    ../rewriter.cpp: No such file or directory.
(gdb) bt full
#0  0x00007f511c71f57e in doPartial (chRb=0x7f4de8066c48) at ../rewriter.cpp:1723
        bytesLeft = 0
        bytesInString = 7
#1  0x00007f511c729c20 in rewriter_char_filter (f=<optimized out>, bb=0x7f4de8066bc8)
    at ../rewriter_filter.cpp:522
        chRb = 0x7f4de8066c48
        endofBucket = {bucket = 0x7f4e4c10cf28,
          pos = 0x7f4e4c10d164 "ry:\r\n * Date       Author          CR/Bug ID", ' ' <repeats 11 times>, "Change Description\r\n * ----       ", '-' <repeats 12 times>, " ", '-' <repeats 19 times>, " ", '-' <repeats 24 times>, "\r\n * 24-03-2004 Arindam Samanta     --", ' ' <repeats 16 times>...}
        pData = 0x7f4de804dbf8 "\264Y\357S\333\070\023\376\316\f\377\203\316\067\020\270\222\330JH(\275$7\201pW\336)\264C\350t\356\375r#l%qql\237$\227\346\375\353\337\325\017+\016v\002\264\306CG\312J\332}v\367\221\274r\373\277\064\233\273;\356ou>\273;\350\067t\236\244K\026\316\346\002\035\370\207\250\355y\035\364\071\016\005\r\320(dQ\030S~\204Fi\032\205>\021a\022s4\246\337h\224\244\v\032\v\324D\023\301\210\240\263eK)ð\036]\264\320(\232%\361\277Y\030\243\233\240\205.\242{\364\027K\276Qt\233<\304|\036\246G\225\262\313\b\354%!\030\354y\236wr\204>OFZ\357(\212\220\002\311\021\243\234\262o"...
        len = 0
        t1 = 1443622209046644
        wscore = 0x7f511a62f2a0
#2  0x00007f511c720a30 in rewrite (f=0x7f4dd401d3e8, bb=0x7f4dd401d530) at ../rewriter.cpp:2186
        wpRb = 0x7f4de8062278
        endofBucket = {bucket = 0x7f4eb643e630,
          pos = 0x7fff0c7788bc <gettimeofday+44> "H\213K\bH\272\317\367S㥛\304 H\211\310H\301\371?H\367\352H\301\372\aH)\312H\211S\bM\205\344u\a1\300[A\\\311\303H\213\005\216\375\377\377\213@\030A\211\004$H\213\005\200\375\377\377\213@\034A\211D$\004\353ڸ`"}
        pData = 0x7f4de802fdf8 "orkRo\177"
        len = 5
#3  0x00007f511c7298d8 in rewriter_word_filter (f=0x7f4dd401d3e8, bb=0x7f4dd401d530)
    at ../rewriter_filter.cpp:702
        t1 = 1443622209046642
        status = <optimized out>
        t2 = <optimized out>
        wscore = <optimized out>
#4  0x00007f5123b53d7a in inflate_out_filter (f=0x7f4dd401d3c0, bb=0x7f4df0037358) at mod_deflate.c:1169
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) f 1
#1  0x00007f511c729c20 in rewriter_char_filter (f=<optimized out>, bb=0x7f4de8066bc8)
    at ../rewriter_filter.cpp:522
522     ../rewriter_filter.cpp: No such file or directory.
(gdb) p chRb
$1 = (CHrbType *) 0x7f4de8066c48
(gdb) p *chRb
$2 = {r = 0x7f4de8069318, reqContext = 0x7f4de806d510, c = 0x7f51287cf310, d = 0x7f5128972110,
  filter = 0x7f4dd401d448, pPendingList = 0x0, pOutputList = 0x7f4de8066d08,
  BufferEndPos = 0x7f4de804dbf8 "\264Y\357S\333\070\023\376\316\f\377\203\316\067\020\270\222\330JH(\275$7\201pW\336)\264C\350t\356\375r#l%qql\237$\227\346\375\353\337\325\017+\016v\002\264\306CG\312J\332}v\367\221\274r\373\277\064\233\273;\356ou>\273;\350\067t\236\244K\026\316\346\002\035\370\207\250\355y\035\364\071\016\005\r\320(dQ\030S~\204Fi\032\205>\021a\022s4\246\337h\224\244\v\032\v\324D\023\301\210\240\263eK)ð\036]\264\320(\232%\361\277Y\030\243\233\240\205.\242{\364\027K\276Qt\233<\304|\036\246G\225\262\313\b\354%!\030\354y\236wr\204>OFZ\357(\212\220\002\311\021\243\234\262o"..., wpBD = 0x7f5127e161d8, current_bkt_start = 0x7f4de8031158, currentPos = {
    bucket = 0x7f4de8031158,
    pos = 0x7f4de804dbf8 "\264Y\357S\333\070\023\376\316\f\377\203\316\067\020\270\222\330JH(\275$7\201pW\336)\264C\350t\356\375r#l%qql\237$\227\346\375\353\337\325\017+\016v\002\264\306CG\312J\332}v\367\221\274r\373\277\064\233\273;\356ou>\273;\350\067t\236\244K\026\316\346\002\035\370\207\250\355y\035\364\071\016\005\r\320(dQ\030S~\204Fi\032\205>\021a\022s4\246\337h\224\244\v\032\v\324D\023\301\210\240\263eK)ð\036]\264\320(\232%\361\277Y\030\243\233\240\205.\242{\364\027K\276Qt\233<\304|\036\246G\225\262\313\b\354%!\030\354y\236wr\204>OFZ\357(\212\220\002\311\021\243\234\262o"...}, flush_from = {bucket = 0x7f4de8031298, pos = 0x7f5127e17310 "'/mars';"}, matchStart = {
    bucket = 0x0, pos = 0x0}, pendingBucket = 0x7f4de8031298, subpool = 0x7f4de804bbc8, bb = 0x7f4de8066bc8,
  currentSR = 0x7f5127e17280, currentSRPos = 0x7f5127e17311 "/mars';", matchLength = 1, parseFlags = 2,
  state = 2, nextState = 2, rc = 0}
(gdb) p *chRb->r->server
$4 = {process = 0x7f51278ec270, next = 0x7f512894c848,
  defn_name = 0x7f5127b18538 "/etc/opt/novell/apache2/conf/vhosts.d/mars.conf", defn_line_number = 39,
  server_admin = 0x7f5127de87f8 "you@example.com", server_hostname = 0x7f512896cda0 "mars.novell.com", port = 0,
  error_fname = 0x0, error_log = 0x0, loglevel = 6, is_virtnovell = 1, module_config = 0x7f5128829b98,
  lookup_defaults = 0x7f51287cfc90, addrs = 0x7f512896c4f8, timeout = 120000000, keep_alive_timeout = 300000000,
  keep_alive_max = 100, keep_alive = 1, path = 0x0, pathlen = 0, names = 0x7f5128831f28,
  wild_names = 0x7f5128831f68, limit_req_line = 8190, limit_req_fieldsize = 8190, limit_req_fields = 100,
  server_scheme = 0x0}
(gdb) p chRb->r->hostname
$6 = 0x7f4de806b078 "mars.novell.com"
(gdb) p chRb->r->uri
$9 = 0x7f4de806a990 "/jsp/reports/rptFixedWorkRosterDetail.jsp"
(gdb) p chRb->r->method
$15 = 0x7f4de806a928 "POST"

Dump2 was rewriter code in /jsp/eportal/mainEmplCalendar.jsp (GET request).