Functional Documentation: ActiveMQ (AMQ) Stuck Queue Monitoring and Response System
1. Executive Summary
This document outlines the purpose, functionality, and operational protocols for the automated system that monitors our ActiveMQ (AMQ) message queues. Its primary goal is to prevent and mitigate business process interruptions by detecting and alerting when message queues become "stuck" (i.e., stop processing messages despite a backlog). This system ensures continuous data flow, reduces manual oversight, and provides a clear incident response pathway.
2. System Capabilities & Value Proposition
This system delivers the following core capabilities to ensure business continuity:
- Automated Anomaly Detection: Continuously monitors AMQ message processing activity, identifying when queues cease to dequeue messages for a defined period.
- Tiered Alerting: Supports different sensitivity levels (thresholds) for various queues, allowing critical or low-traffic queues to have tailored alert timings.
- Problem Notification: Delivers immediate email alerts to relevant operational teams when a queue is detected as stuck, enabling rapid response.
- Resolution Confirmation: Notifies teams once a previously stuck queue resumes processing, confirming restoration of service and aiding in incident closure.
- Reduced MTTR (Mean Time To Recovery): By providing early warnings, the system significantly reduces the time it takes to detect and resolve issues with message processing.
- Minimized Data Backlogs: Prevents the accumulation of unprocessed messages that could lead to system performance degradation or data consistency issues.
3. Operational Workflow
This section describes the lifecycle of an AMQ stuck queue incident as managed by this system.
A. Detection Phase
- Trigger: The system runs periodically (e.g., every 5 minutes, configured by Ops).
- Mechanism: It analyzes recent AMQ application logs, specifically looking for messages enqueued and dequeued over time.
- Stuck Condition: A queue is flagged as "stuck" if:
- It has
pendingMessages > 0(there's work to do). messagesDequeuedhas not increased for a duration exceeding its configured threshold.
- It has
B. Notification Phase (Problem)
- Action: An email notification is sent to pre-configured recipients.
- Frequency: Only the first detection of a "stuck" state for a specific queue triggers a notification. The system uses a "stamp" file to prevent repetitive alerts for an ongoing issue.
- Notification Content:
- Subject:
Problem: [Customer Name] AMQ [Channel]/[Process] hasn't processed anything in [X] minutes - Body: Provides details of the affected queue, the threshold exceeded, and the latest log data that triggered the alert. This is crucial for initial diagnosis.
- Subject:
C. Incident Response Protocol (Shared Responsibility)
Upon receiving a "Problem" notification:
- Acknowledge Incident (Ops):
- Action: The Operations team (or on-call engineer) immediately acknowledges the alert and logs it in the incident management system.
- Responsibility: Ops
- Initial Triage & Diagnosis (Ops / DBA):
- Action: Determine the scope of impact (single queue, multiple queues, related services). Review monitoring dashboards for the affected AMQ broker, the queue itself, and the associated processing application. Check for obvious errors (e.g., application crashes, full disks, network issues).
- Shared Responsibility: Ops (system-level checks), DBA (if queue processing involves database operations, e.g., deadlocks, slow queries).
- Root Cause Analysis & Remediation (Ops / Dev / Infra):
- Action: Based on triage, identify the underlying cause (e.g., application bug, resource exhaustion, network partition, misconfiguration).
- Responsibility:
- Dev: If the issue is within the application code processing the queue.
- Infra: If the issue is with the underlying server, virtualization, storage, or network. This includes Communicating with Hosting/Cloud Provider if it's a platform issue (Infra leads, Ops provides context).
- Ops: For operational issues, configuration changes, or restarting services.
- Objective: Restore processing for the affected queue. This may involve:
- Restarting the processing application.
- Scaling up resources.
- Fixing configuration errors.
- Working with the hosting provider to resolve platform-level outages.
- Internal Communication (Ops / Biz):
- Action: Provide timely updates to internal stakeholders (e.g., other Ops teams, relevant business unit leads, management) on the status and estimated recovery time.
- Responsibility: Ops
- External/Customer Communication (Biz / Ops):
- Action: If the incident impacts customer-facing services or data processing, the Business/Management team decides on the communication strategy and messaging. Ops then distributes pre-approved messages via appropriate channels (status page, email, etc.).
- Shared Responsibility: Biz (Messaging Approval), Ops (Distribution).
D. Notification Phase (Resolved)
- Trigger: The system detects that the
messagesDequeuedcount for a previously "stuck" queue has increased since the last check, indicating processing has resumed. - Action: An email notification is sent to pre-configured recipients. The "stamp" file for that queue is deleted.
- Notification Content:
- Subject:
Resolved: [Customer Name] AMQ [Channel]/[Process] hasn't processed anything in [X] minutes(Subject reflects the original problem context for clarity.) - Body: Confirms that the queue is now actively processing messages.
- Subject:
E. Post-Incident Review (All Teams)
- Action: After the incident is fully resolved and services are stable, a blameless post-mortem is conducted.
- Objective:
- Identify the root cause of the stuck queue.
- Analyze the effectiveness of the detection and response process.
- Document lessons learned.
- Propose and assign action items for preventing recurrence or improving response (e.g., updating thresholds, improving monitoring, application code changes, infrastructure resilience improvements).
- Responsibility: All involved teams (Ops, Dev, Infra, DBA, Biz) participate. Ops often facilitates, Biz ensures business impact is fully understood and addressed.
4. Key Business Configuration Parameters
These parameters directly influence the sensitivity and recipients of the monitoring system. They should be reviewed and approved by relevant business and operational stakeholders.
| Parameter | Description | Primary Owner for Definition | Impact of Changes (Business) |
|---|---|---|---|
Normal Queue Threshold |
The maximum number of minutes a standard AMQ queue (with pending messages) can remain idle (no dequeued messages) before a "Problem" notification is triggered. | Ops / Biz | Defines sensitivity for most queues. Lower values mean faster alerts but potentially more noise. |
Special Queue Threshold |
The maximum number of minutes a queue on the Special Queue List can remain idle before a "Problem" notification. This is used for queues with inherently low traffic, or those deemed extremely critical where immediate intervention is required for even short delays. |
Biz / Ops | Allows for fine-tuned alerting for specific, high-priority, or low-activity queues. |
Special Queue List |
A comma-separated list of specific AMQ queue process names (e.g., invoice_processor, payment_gateway) that should use the Special Queue Threshold instead of the Normal Queue Threshold. |
Biz / Ops | Directly dictates which business processes receive specialized (more or less stringent) monitoring. |
Notification Recipients |
A comma-separated list of email addresses for internal operational teams (Ops, Dev, Infra) who will receive all "Problem" and "Resolved" notifications. This is the primary alert channel. | Ops | Ensures the right technical personnel are immediately informed. |
External Recipients (Optional) |
A separate list of email addresses for external parties or client-facing teams. This list is only used if the monitoring script is explicitly configured to send external notifications (typically for specific client-facing incidents). | Biz / Ops | Enables controlled communication of issues that directly affect external partners or customers. |
5. System Dependencies
- AMQ Application Logging: Accurate, consistent, and verbose logging of AMQ queue activity is paramount. Any changes to AMQ logging configuration could impact detection.
- Mail Service: A reliable email sending service is required for all notifications.
- Configuration Management: The
ewreportingproperties.ymlconfiguration file must be maintained and accessible in its defined location.
6. Ownership & Governance
- Service Owner: [Designate Team/Role, e.g., "Operations Team" or "Platform Engineering"]
- Technical Support: [Designate Team/Role, e.g., "DevOps Team" for script maintenance, "Infra Team" for underlying server support]
- Business Review Cadence: All business-impacting configuration parameters (
Normal Queue Threshold,Special Queue Threshold,Special Queue List,Notification Recipients) should be formally reviewed and approved by relevant business and technical stakeholders at least annually, or whenever major changes to AMQ usage or business processes occur. This ensures the system remains aligned with current operational needs and business priorities.