Platform Automated Monitoring

Platform Automated Monitoring

Table 1. Feature History

Feature Name

Release Information

Feature Description

Support For Platform Automated Monitoring

Cisco IOS XE Dublin 17.12.1y

With this release, cBR-8 supports Platform Automated Monitoring (PAM), which is a system monitoring tool that is integrated with Cisco IOS XE Software image to monitor the following issues:

  • Process Crashes

  • When StandbySUP cannot bootup

PAM is an IOSd-process running on the Supervisor Card (SUP) to periodically monitor the system’s crash. When an RP/FP/CC crashinfo or corefile is detected, the syslog displays on the active SUP’s IOSd console.

The benefit of PAM is that you can use a script (for example, EEM) to monitor PAM and automatically submit a TAC case and share the core/crashinfo with TAC, when a crash event is detected.

PAM Process

PAM is an IOSd-process running on the Supervisor Card (SUP) to periodically monitor the system’s crash. Use the show process | in PAM command to check if the PAM process is running:

router#show process | in PAM
 314 Mwe 633F12E936BA	142300	563398	252 15808/24000 0 CBR PAM Process

The preceding output is a sample showing an example of the cbr-8 PAM process already running.

A hidden *.pam file file is created in the /harddisk/core/ path. This is an empty file which is used to record the last monitored timestamp of the PAM process. Only the corefile/crashinfo whose timestamp is newer than the *.pam file timestamp, is considered processed by PAM.

Use the following command to view the *.pam file file.

router#dir harddisk:core
Directory of harddisk:/core/

4751365  -rw-                1  Feb 20 2024 13:56:27 +08:00  .pam 

PAM process handles two timers:

  • 5-minute Periodical Timer: PAM initiates a 5 minute to check the new crashinfo/corefile on both active and standby SUP. The following messages are possible outputs which can be displayed on the SUP’s IOSd console:

    • Initial Message

      %PAM-4-TEMP_CORE: PAM detects a new core file %s start to dump at %-27s. 
      Need to wait for several minutes to get the full core file.
    • This is an example of a successful dump of a core file case:

      %PAM-3-CRASH: PAM detects crash <crashinfo or corefile path>
    • This is an example of an incomplete dump of a core file case:

      %PAM-3-CORE_UNCOMPLETE: PAM detects core file <uncomplete core file path> doesn't generate successfully.
  • Here is a sample output that is displayed on the console:

    router#dir harddisk:core
    Directory of harddisk:/core/
    
    2981892  -rw-         11010048  Feb 20 2024 23:23:22 +08:00  router_SIP_1_vidman_7014_1704986383.core.gz.TEMP_IN_PROGRESS
    3080199  -rw-                1  Feb 20 2024 23:19:51 +08:00  .pam
    2981891  -rw-          7884800  Feb 20 2024 23:19:46 +08:00  router_SIP_1_vidman_7014_1704986383.core
    2981889  -rw-                0  Feb 20 2024 23:19:46 +08:00  router_SIP_1_vidman%cc_1_0%0.TEMP_IN_PROGRESS
    
    
    Feb 20 23:19:51.179 CST: %PAM-4-TEMP_CORE: PAM detects a new core file harddisk:core/router_SIP_1_vidman_7014_1704986383.core 
    start to dump at Feb 20 2024 23:19:47 +08:00. Need to wait for several minutes to get the full core file.
    
    Feb 20 23:29:51.397 CST: %PAM-3-CRASH: PAM detects crash for process vidman on fru CC slot 1, 
    path:  harddisk:core/router_SIP_1_vidman_7014_1704986383.core.gz
    
    router#
  • 30 Minutes One Time Timer: This timer begins when the standby SUP initializes with a bootup image. If the boot fails and the timer expires, then the following error message about the standby SUP bootup failure displays:

    %PAM-3-FAILURE: StandbySUP stucks at booting state for 30 minutes.

Location and Format of the Crashinfo Or Corefile

The following tables show the Location and Format of the Crashinfo Or Corefile with examples:

Table 2. Crashinfo

Type

Location and Format With Example

cdman crashinfo

harddisk:<hostname>_SIP_<slot>_cdman_crashinfo_xxx.log

Example:

harddisk:L08_SIP_6_cdman_crashinfo_7437_09152023155523.log

iosd-clc crashinfo

harddisk:Slot-<slot>-0_crashinfo_SIP_<slot>_xxx.log

Example:

harddisk:Slot-0-0_crashinfo_SIP_00_00_20230905-155430-CST

sup-iosd crashinfo

bootflash:<hostname>_crashinfo_RP_<slot>_xxx

Example:

bootflash:L08_crashinfo_RP_01_00_20221201-191258-EDT
Sample Console Message:
%PAM-3-CRASH: PAM detects crash for process linux_iosd-image on fru RP slot 0, 
path: bootflash:L08_crashinfo_RP_00_00_20240129-153213-CST
Table 3. Corefile

Type

Location and Format With Example

Linecard process core

harddisk:core/<hostname>_SIP_<slot>_<process_name>_<pid>_xxx.core.gz

Examples:

harddisk:core/L08_SIP_1_ubrclc-k9lc-ms_8030_1698622516.core.gz
harddisk:core/RPCC01_SIP_2_CDM_PKTIO_7481_1700073874.core.gz

RP process core

harddisk:core/<hostname>_<process_name>_<pid>_xxx.core.gz

Examples:

harddisk:core/L08_dbm_19549_20231122-100257-CST.core.gz
harddisk:core/L08_fman_fp_image_28299_20231019-200127-MST.core.gz
harddisk:core/L08_cpp_cp_svr_27953_20231019-200045-MST.core.gz

Sample Console Messages:

Jan 11 23:19:51.179 CST: %PAM-4-TEMP_CORE: PAM detects a new core file 
harddisk:core/L08_SIP_1_vidman_7014_1704986383.core start to dump at Jan 11 2024 23:19:47 +08:00. 
Need to wait for several minutes to get the full core file. 
Jan 11 23:29:51.397 CST: %PAM-3-CRASH: PAM detects crash for process vidman on fru CC slot 1, 
path: harddisk:core/L08_SIP_1_vidman_7014_1704986383.core.gz

Incomplete core file

harddisk:core/<hostname>_<process_name>_<pid>_xxx.core

Examples:

harddisk:core/L08_SIP_1_cdman_7797_1701140043.core
harddisk:core/L08_SIP_1_cdman%cc_1_0%0.TEMP_IN_PROGRESS

Sample Console Messages:

Jan 12 20:44:14.315 CST: %PAM-3-CORE_UNCOMPLETE: PAM detects core file 
harddisk:core/L08_SIP_1_CDM_RT_7802_1705062593.core doesn't generate successfully.
Table 4. Kernel Core

Type

Location and Format With Example

Kernel Core

harddisk:core/kernel.CC_CYLONS_<slot>_<timestamp>.core.flat.gz

harddisk:core/kernel.RP_CBR_<slot>_<timestamp>.core.flat.gz

Examples:

harddisk:core/kernel.CC_CYLONS_6_20231026003938.txt
harddisk:core/kernel.CC_CYLONS_6_20231026003938.core.flat.gz
harddisk:core/kernel.RP_CBR_1_20230126014834.core.flat.gz
harddisk:core/kernel.RP_CBR_1_20230126014834.txt

Sample Console Messages:

Jan 11 23:44:52.451 CST: %PAM-4-TEMP_CORE: PAM detects a new core file 
harddisk:core/kernel.CC_CYLONS_1.core.TEMP_IN_PROGRESS start to dump at Jan 11 2024 23:44:49 +08:00. 
Need to wait for several minutes to get the full core file.

Jan 11 23:49:52.684 CST: %PAM-3-CRASH: PAM detects crash for process kernel on fru CC slot 1, 
path: harddisk:core/kernel.CC_CYLONS_1_20240111154328.core.flat.gz
Table 5. StandbySUP Crash or Core

StandbySUP Crash or Core

Location and Format With Example

Kernel Core

stby-harddisk:core/<hostname>_<process_name>_<pid>_xxx.core.gz

Examples:

stby-harddisk:core/L08_dbm_19549_20231122-100257-CST.core.gz

Sample Console Messages:


Jan 12 14:44:30.949 CST: %PAM-3-CRASH: PAM detects crash for process cpp on fru RP slot 1,
path: stby-harddisk:core/L08_cpp_cp_svr_21886_20240112-142857-CST.core.gz

Note

 

If ActiveSUP crashes after a SUPSO event, then a new ActiveSUP can also detect and report the old ActiveSUP’s crashinfo or core file.

Limitations of PAM

  • If you configure the exception crashinfo file command, then this feature does not work.

    Configuring the exception crashinfo file command allows you to define a custom prefix of the crashinfo file. PAM cannot detect such crashinfo since it cannot know which process/fru/slot crash happened.

  • If the standbySUP cannot bootup, PAM cannot cover the following cases:

    • StandbySUP is removed intentionally.

    • StandbySUP is inserted and under ROMMON state without bootup image. This may occur due to config-register configured as 0x0.

    • StandbySUP is inserted but stops responding and does not have a bootup image. This may occur due to a hardware issue.

In releases before Cisco IOS XE Dublin 17.12.1y, there is no support for a unified syslog, which covers all modules or processes crash. You must manually filter several syslogs to obtain the relevant log information and manually submit the log files to TAC.