Analysis – eGaming Application Support @ BCLC

Paola Flores Aguirre T00651883

Analysis report

Purpose

The purpose of this project is to assist the eGaming Application Support team (eGAS) which is team responsible for maintaining and supporting PlayNow.com that requires high availability and performance for applications like eCasino, online sports, payment services, and others, to get improved visibility into the system and monitoring already set by refining the existing tag rules and alerting in Dynatrace. Thus, reduce the time spent figuring out the parts of the system that are impacted by an issue or problem and act on it before it breaks. This will help support teams to be more efficient and proactive on their approaches as they will be able to identify root causes and provide a solution quickly, rather than figuring out the system on the spot and try to locate relevant alerts and/or incidents created by the system that are not useful.

Data Collection:

  1. Conducting Interviews with Key Stakeholders

The goal of the interviews was to gather insights from different teams impacted by the monitoring and alerting system. Understanding their perspectives helps to identify strengths, weaknesses, and opportunities for improvement based on the teams perspective and experience.

  • Teams Involved:
    • eGaming Application Support Team: Interviewed to understand how the current monitoring system affects day-to-day operations, including any issues with false positives or missed alerts.
    • Agile Subdivisions: Conducted interviews with members of teams such as eCasino, eLottery, eSports, Customer Identity & Access Management (CIAM), Player Account Management Teams (PAM1 & PAM2), and Payments (PAM2). These teams provided insights into specific application behaviors, critical components, and their expectations from the monitoring system.
    • Application Performance Management (APM) Team: Interviewed to gain a technical understanding of the current Dynatrace setup, including how it monitors the system and the rationale behind existing alert configurations.
    • ServiceNow Team: Engaged to understand how incident records are managed and reported, and how the integration between Dynatrace and ServiceNow supports issue tracking and resolution documentation.
  1. Exploring the System Infrastructure
  • Dynatrace Analysis:
    • System Infrastructure Review: Used Dynatrace to map out the system infrastructure and network layout. This exploration provided a clear view of how different components and services are interconnected, which is crucial for identifying potential monitoring gaps.
    • Performance Metrics Review: Analyzed key performance metrics such as response times, error rates, and user behavior patterns within Dynatrace. This step was essential for defining what constitutes normal and abnormal behavior across different applications and services.
  • ServiceNow Incident Analysis:
    • Incident Report Analysis: Ran reports in ServiceNow to observe the frequency and types of incidents created by Dynatrace alerts. The focus was on incidents related to error rates and user behavior, which are often cited as frequent and potentially clearable without team intervention.
    • Impact Assessment: Evaluated the impact of these incidents on the eGAS team’s workload and identified patterns or trends that could suggest opportunities for optimizing alert thresholds or auto-remediation.

Key Findings:

Recurrent False Positives:

  • Observation: The current alerting system frequently generates false positive alerts and incidents. These alerts often correspond to issues that resolve themselves within a short period, even before the support teams begin investigating the potential cause. For instance, the current system triggers alerts for routine traffic increases, which are not indicative of issues, or notify when a player tries to check the results of a draw during the time it is processing, which sends an error increase, but it’s expected since results are not posted yet, leading to alert fatigue among team members.
  • Impact: These self-resolving issues do not affect system performance or user experience on the PlayNow site. Consequently, they do not lead to revenue loss or customer dissatisfaction. However, the incidents remain open in the support team’s backlog, adding unnecessary tasks that require attention and updates.
  • Operational Consequences: Support teams typically use a default message to close these incidents, consuming valuable time that could be better spent on addressing issues that genuinely impact players or completing routine tasks critical to the team’s success.

The top five most frequently created alerts are:

  1. Failure rate increase on Web service BcLmaCouponsService
  2. JavaScript error rate increase for Web application www.playnow.com
  3. Failure rate increase on Web request service SG IAM – Identity and Access Management / Services (/iam-service/v1)
  4. Unexpected high traffic for Web application www.playnow.com
  5. Response time degradation on Web request service playnow.com:80

Communication and Investigation Challenges:

  • Observation: When a genuine issue requires investigation, the support teams must engage in constant communication with other teams to determine which system components—such as hosts, services, process groups, or requests—have been impacted and to understand their functionality.
  • Impact: This need for continuous coordination complicates the identification of abnormal application behavior and hinders the ability to detect patterns that could facilitate quicker issue resolution.

Other findings were included related to the new payment system launched at BCLC earlier this year:

Overview of Requests for Change (RFCs):

  • Total RFCs: A total of 13 Requests for Change (RFCs) were recorded between June 15th when the new system was launched and July 24th the date of the report. Most issues were resolved through restarts or reboots.

Incident Management:

  • Average Time Spent per Incident: 2 hours.
  • Overtime Hours: In June, there were 24 overtime hours recorded, and in July, 18 overtime hours were logged, totaling 42 hours. This indicates a significant time investment in resolving issues outside of regular working hours. This comes to 83% of issues occurring outside of standard business hours, with only 17% occurring during business hours.

System Performance Issues:

  • Disk and CPU Usage: The /var/log partition on KamLeap010 and KamLeap011 servers consistently experiences over 80% disk and CPU usage. This high usage could be contributing to the system’s instability and warrants close monitoring or optimization.
  • PaymentController Issues: The getPaymentByUuid request has shown a constant increase in failure rates, indicating a growing issue that needs addressing.
  • Multiple API Failures: Several API endpoints on KamLeap010 and KamLeap011 have been experiencing increased failure rates, further contributing to system instability.

 

Service Restart Efficiency:

  • First-Attempt Resolutions: 73% of the issues were resolved on the first attempt at restarting the service, supporting the case for automation to improve efficiency and reduce manual interventions.

Reviewing Current Alert Configurations

  • Configuration Review: Examined the current alert configurations in Dynatrace to understand the existing thresholds and conditions that trigger alerts. This review was done in collaboration with the APM team to ensure a thorough understanding of why certain thresholds were set and how they might be adjusted to reduce false positives. It was concluded that most thresholds were set as default and modifications could be made as the support team confirmed that removing false positives will not impact the system or alerting in a way that could affect later on.
  • Comparison with Normal Behavior: Compared the alert configurations against the normal behavior patterns identified during the Dynatrace analysis and discussing with the teams that are in charge of troubleshooting these errors. This comparison helped to pinpoint alerts that might be unnecessarily triggered by routine system activities, such as expected traffic spikes and other mentioned before.

Recommendations:

To reduce the time spent investigating and reduce the noise created by some alerts, the most common alerts have to be identified to make changes on the default thresholds in Dynatrace that can be adjusted, as well as implement auto-remediation when possible is advised to save time, reduce costs, and optimize resource use. Finally, creating tags to identify each service, host, or entity in the system and what they serve is key to reduce the time troubleshooting.

Improving Alerts:

  • Adjust the threshold for traffic-related alerts to reduce false positives, focusing instead on error rates and response time anomalies.
  • Create maintenance windows for behavior that is expected or slowdown on systems such as checking tickets at the draw times to avoid those alerts.
  • Adjust thresholds for alerts related to payment failures since some of them could be on the end user side and there is a current known issue that is being worked on.

Tagging & Classification:

  • Implement a tagging system that categorizes services by their function, for instance payments, registration, eCasino, eLottery, and any other tag that may assist the teams to identify which components of the system belong to them and which support team the problem may be escalated to.