Process Documentation: Troubleshooting and Restarting Stuck Job Machine Processes

1. Purpose

This document provides a step-by-step guide for Operations (Ops) personnel to diagnose, terminate, and restart job machine processes that are identified as "stuck" or non-responsive. The goal is to quickly restore processing flow and prevent data backlog.

2. Scope

This procedure applies to specific application processes running on "job machines" (servers dedicated to processing tasks), particularly those identified by the zabbix monitoring system.

3. Prerequisites

Before starting this procedure, ensure you have:

Access: SSH access to the relevant server with sudo privileges.
Access: Database access.

4. Shared Responsibilities

Ops (Primary): Initial detection, triage, process identification, execution of kill/restart commands, internal communication.
Dev (Secondary/Escalation): Deeper diagnosis of application-level issues, code-related problems, assistance with specific application restart procedures if complex.
Infra (Secondary/Escalation): Diagnosis of underlying infrastructure issues (OS, network, disk, VM), assistance with server-level restarts if required.
BP (Information Only): To be informed of critical incidents and potential service impact.

5. Process Steps

Step 1: Verify the "Stuck" Condition

Trigger: You have received an alert from Check Jobmachin PID file age {HOST.NAME} Check /apps/log/midgard/jobmachine.sh-message.log if jobmachine is running (or another monitoring system) indicating an JobMachine possibly is stuck.

Check Job machine Logs:
- SSH to the server indicated in the alert.
- Navigate to the log directory (e.g., /apps/log/midgard/jobmachine.sh-message.log.[yyyy-mm-dd-hh]).
- Examine the logs for the current day, if logs are written inform ops that standby check was falsely alerted.
- Check Database table "qcjob" if there are job with status "IPR"
- In case logs are not written and there is no job with status "IPR" that indicates jobmachine is stuck.
Identify the Corresponding Process:
- Determine the process id's need to be killed: on the server type:
- Command Example: bash ps fx

Step 2: Terminate Identified Processes

WARNING: Terminating processes can cause data loss or service disruption if not handled carefully. Always confirm the correct PID before proceeding. If unsure, escalate to senior Ops or Dev.

Attempt Graceful Termination (Preferred):
- Try to send a SIGKILL signal, which allows the process to shut down gracefully.
- Command Example: bash sudo kill -9 <PID> (list all pids related to jobmachine process sepparated by space)
- Verify Termination: Immediately run ps fx again.
Repeat if Multiple Instances: If multiple instances of the same stuck process were identified, repeat steps 2.1 for each identified PID.
When process id's were killed remove the process PID file
- Command Example: bash rm /home/qcuser/apps/qc/midgard/scripts/sh/startjobmachine.sh.pid

Step 3: Verify Restoration of Service

Check Jobmachine Logs (Again):
- Navigate to the log directory (e.g., /apps/log/midgard/).
Confirm Jobmachine Resolution:
- Wait for the next Jobmachine run (typically every few minutes).
- You should receive a "Resolved" notification for the previously stuck process. If you do not, revisit Step 1 and re-evaluate.
Check Application Logs:
- Review the application's own logs for any errors or warnings during startup and during message processing.
- Command Example: bash tail -f `/apps/log/midgard/jobmachine.sh-message.log.[yyyy-mm-dd-hh]`
- Check Database table "qcjob" if there are job with status "IPR" exists, in case "IPR" not exists, check that there are also now entries with status "NEW" and outdated start time.

Step 4: Document and Review

Update Incident Ticket:
- Record all actions taken: timestamps, commands executed, PIDs killed, and observations.
- State the final resolution and service restoration time.
Internal Communication:
- Update relevant internal channels (e.g., incident chat) with the resolution status.
Consider Post-Mortem / RCA:
- If the process was stuck due to an unknown reason, or if this is a recurring issue, propose or initiate a Root Cause Analysis (RCA) meeting with relevant teams (Ops, Dev, Infra) to identify and address the underlying cause.