Process Documentation: Troubleshooting and Restarting Stuck Job Machine Processes
1. Purpose
This document provides a step-by-step guide for Operations (Ops) personnel to diagnose, terminate, and restart job machine processes that are identified as "stuck" or non-responsive. The goal is to quickly restore processing flow and prevent data backlog.
2. Scope
This procedure applies to specific application processes running on "job machines" (servers dedicated to processing tasks), particularly those identified by the zabbix monitoring system.
3. Prerequisites
Before starting this procedure, ensure you have:
- Access: SSH access to the relevant server with
sudoprivileges. - Access: Database access.
4. Shared Responsibilities
- Ops (Primary): Initial detection, triage, process identification, execution of kill/restart commands, internal communication.
- Dev (Secondary/Escalation): Deeper diagnosis of application-level issues, code-related problems, assistance with specific application restart procedures if complex.
- Infra (Secondary/Escalation): Diagnosis of underlying infrastructure issues (OS, network, disk, VM), assistance with server-level restarts if required.
- BP (Information Only): To be informed of critical incidents and potential service impact.
5. Process Steps
Step 1: Verify the "Stuck" Condition
Trigger: You have received an alert from Check Jobmachin PID file age {HOST.NAME} Check /apps/log/midgard/jobmachine.sh-message.log if jobmachine is running (or another monitoring system) indicating an JobMachine possibly is stuck.
-
Check Job machine Logs:
- SSH to the server indicated in the alert.
- Navigate to the log directory (e.g.,
/apps/log/midgard/jobmachine.sh-message.log.[yyyy-mm-dd-hh]). - Examine the logs for the current day, if logs are written inform ops that standby check was falsely alerted.
- Check Database table "qcjob" if there are job with status "IPR"
- In case logs are not written and there is no job with status "IPR" that indicates jobmachine is stuck.
-
Identify the Corresponding Process:
- Determine the process id's need to be killed: on the server type:
- Command Example:
bash ps fx
Step 2: Terminate Identified Processes
WARNING: Terminating processes can cause data loss or service disruption if not handled carefully. Always confirm the correct PID before proceeding. If unsure, escalate to senior Ops or Dev.
-
Attempt Graceful Termination (Preferred):
- Try to send a
SIGKILLsignal, which allows the process to shut down gracefully. - Command Example:
bash sudo kill -9 <PID> (list all pids related to jobmachine process sepparated by space) - Verify Termination: Immediately run
ps fxagain.
- Try to send a
-
Repeat if Multiple Instances: If multiple instances of the same stuck process were identified, repeat steps 2.1 for each identified PID.
- When process id's were killed remove the process PID file
- Command Example:
bash rm /home/qcuser/apps/qc/midgard/scripts/sh/startjobmachine.sh.pid
- Command Example:
Step 3: Verify Restoration of Service
-
Check Jobmachine Logs (Again):
- Navigate to the log directory (e.g.,
/apps/log/midgard/).
- Navigate to the log directory (e.g.,
-
Confirm
JobmachineResolution:- Wait for the next
Jobmachinerun (typically every few minutes). - You should receive a "Resolved" notification for the previously stuck process. If you do not, revisit Step 1 and re-evaluate.
- Wait for the next
-
Check Application Logs:
- Review the application's own logs for any errors or warnings during startup and during message processing.
- Command Example:
bash tail -f `/apps/log/midgard/jobmachine.sh-message.log.[yyyy-mm-dd-hh]` - Check Database table "qcjob" if there are job with status "IPR" exists, in case "IPR" not exists, check that there are also now entries with status "NEW" and outdated start time.
Step 4: Document and Review
-
Update Incident Ticket:
- Record all actions taken: timestamps, commands executed, PIDs killed, and observations.
- State the final resolution and service restoration time.
-
Internal Communication:
- Update relevant internal channels (e.g., incident chat) with the resolution status.
-
Consider Post-Mortem / RCA:
- If the process was stuck due to an unknown reason, or if this is a recurring issue, propose or initiate a Root Cause Analysis (RCA) meeting with relevant teams (Ops, Dev, Infra) to identify and address the underlying cause.