F5 Unexpected behaviors.
Working on support, there are many times end customer report us that something unexpected has happened. Regarding F5 ltm world the most common issues regarding unexpected behavior are two, unexpected reboots and unexpected failover, let´s go to see how to diagnose.
Regarding Unexpected reboots, what we need to check mandatory are both, host and SCCP/AOM logs, meanwhile for HA unexpected failover we just need to check host logs.
Here you are some of the most important LTM’s logs:
/var/log/ltm – Messages from daemons (Including TMM) and iRules.
/var/log/tmm – Messages from TMM process only.
/var/log/kern.log – Linux kernel messages (ie. HDD errors).
/var/log/secure – Login attempts (View of dictionary attacks).
/var/log/httpd/* - Connections to the GUI.
/var/log/tomcat(4)/* - Java errors.
/var/log/messages on SCCP/AOM subsystems (SCCP/Hardware Watchdog, Interface errors).
/var/core/ - The directory where core files are being dumped. On a healthy system it should be empty.
So once here, we need to search the root cause, and the best way to begin troubleshoot is establishing the sequence of events.
SEARCH FOR BOOT MARKER IN LOGS (various logs): Check in various files – [/var/log/ltm, /var/log/secure, /var/log/tmm]. The message is added AFTER the device comes back up Jul 14 08:28:49 boot_marker: ---===[ HD1.1 - BIG-IP 10.2.4 Build 591.0 ]===--- NOTE: On VIPRION, you will see it only on the blade that actually rebooted.
SEARCH FOR FAILOVER STATE TRANSITIONS (Active, Standby) in /var/log/ltm Search “Active” and “Standby” words in /var/log/ltm Nov 26 21:11:36 local/tus1-lpi-dns-1 notice sod[3183]: 010c0019:5: Active Nov 13 21:54:49 local/lb02b-01-snc4 notice sod[4823]: 010c0018:5: Standby
SEARCH FOR SIG*** in /var/log/tmm TMM process is the main process responsible for processing the client’s traffic. If TMM dies or gets killed it causes a Failover (if the units are in a HA pair). There are several types of TMM crashes, here you are the most important: SIGSEGV (Mostly bugs): Core file necessary to troubleshoot. SIGSEGV with EIP Feb 2 16:45:07 local/f5a notice ** SIGSEGV ** Feb 2 16:45:07 local/f5a notice fault addr: 0x512fe2 Feb 2 16:45:07 local/f5a notice fault code: 0x1 Feb 2 16:45:07 local/f5a notice fault time: Wed Feb 02 16:45:07 CET 2011 Feb 2 16:45:07 local/f5a notice version: default TMM Version 10.1.0.3341.0 Feb 2 16:45:07 local/f5a notice ticks since last clock update: 0 Feb 2 16:45:07 local/f5a notice ticks since start of poll: 0 Feb 2 16:45:07 local/f5a notice EAX=0 EBX=0x1dff844 E/CX=0x1610a08 EDX=0 Feb 2 16:45:07 local/f5a notice ESI=0 EDI=0x1dff950 EBP=0x1dff838, ESP=0x1dff820 Feb 2 16:45:07 local/f5a notice EIP=0x512fe2 SIGFPE – (Float. Point. Exception) Sometimes SIGSEGV-like root cause. May log Panic or Assertion string – Important for troubleshooting. SIGFPE with Panic string Jun 17 10:50:12 local/bigip01 notice panic: Invalidated saved PE: 0xa006b7d Jun 17 10:50:12 local/bigip01 notice cmp state: 0x0Jun 17 10:50:12 local/bigip01 notice ** SIGFPE ** Jun 17 10:50:12 local/bigip01 notice fault addr: 0x512fe2 Jun 17 10:50:12 local/bigip01 notice fault code: 0x1 Jun 17 10:50:12 local/bigip01 notice fault time: Thu Jun 17 10:50:12 BST 2010 Jun 17 10:50:12 local/bigip01 notice version: default TMM Version 10.1.0.3341.0 Jun 17 10:50:12 local/bigip01 notice ticks since last clock update: 0 Jun 17 10:50:12 local/bigip01 notice ticks since start of poll: 0 Jun 17 10:50:12 local/bigip01 notice EAX=0 EBX=0x1dfe5c4 ECX=0x1610a08 EDX=0 Jun 17 10:50:12 local/bigip01 notice ESI=0xa1a1ddc EDI=0x9f8eaa4 EBP=0x1dfe5b8, SIGFPE with Assertion string Dec 1 10:18:55 f5auto notice panic: ../net/packet.c:983: Assertion "valid xfrag pointer" failed. Dec 1 10:18:55 f5auto notice cmp state: 0x0 Dec 1 10:18:55 f5auto notice ** SIGFPE ** Dec 1 10:18:55 f5auto notice fault addr: 0x57c59f Dec 1 10:18:55 f5auto notice fault code: 0x1 Dec 1 10:18:55 f5auto notice fault time: Sun Dec 01 10:18:55 IST 2013 Dec 1 10:18:55 f5auto notice version: default x86_64 padc TMM Version 11.2.1.797.21 Dec 1 10:18:55 f5auto notice ticks since last clock update: 1 Dec 1 10:18:55 f5auto notice ticks since start of poll: 0 Dec 1 10:18:55 f5auto notice RAX=0 RBX=0x1479740 RCX=0 RDX=0 Dec 1 10:18:55 f5auto notice RSI=0x24cf7d0 RDI=0 RBP=0x46, RSP=0x5572b1bfe410 Dec 1 10:18:55 f5auto notice R8=0x5572b1bfe330 R9=0 R10=0, R11=0x3246 Dec 1 10:18:55 f5auto notice R12=0x2e135159 R13=0x5572b7ef1a40 R14=0, R15=0x5572b1bfe560 SIGABRT – TMM killed by sod proc. or a user (sig 15) Rarely useful core. In most cases it is a result and not a cause of a problem). SIGABRT since v11.3 may also contain Panic/Assertion string. But possibly in a different log file - (multithreaded TMM). SIGABRT Panic/Assertion in var/log/tmm.1_transformed Mar 26 21:33:49 balhbedstc01 notice 4: lib/c/xbuf.c:522: xfrag_prealloc_get: Assertion `valid xf' failed. Stacktrace in var/log/tmm2.1_transformed Mar 26 21:33:49 balhbedstc01 notice ** SIGABRT ** Mar 26 21:33:49 balhbedstc01 notice sender pid: 30324 (tmm.0 -T 4 --tmid 0 --npus 4 --platform C106 -m -s 6428 ) Mar 26 21:33:49 balhbedstc01 notice fault code: 0xfffffffa Mar 26 21:33:49 balhbedstc01 notice version: default x86_64 padc TMM Version 11.3.0.2806.0 Mar 26 21:33:49 balhbedstc01 notice TMM clock is 0 seconds from system time Always that TMM Crashed, it logs a log called “stack trace” (what you are seeing in the examples above) and there is important to send to F5 support team with a qkview and core files in order to let them diagnose what had happen.
View the graphs.
Another points to take in mind are graphs. They let us know not only values, but also the trends. Maybe, most of times, graphs do not give us relevant information, but we must check it anyway cause maybe, they can point us to somethings that is wrong (CPU / Memory / SSL connections / Throughput – Spikes) as a key point in order to begin troubleshoot, or maybe parsing our diagnostic with the graphs we can be sure of it or not…