Design – eGaming Application Support @ BCLC

September 25, 2024

Design Report

Student: Paola Flores Aguirre

ID: T00651883

Incident Management for eGaming Application Support team at BCLC

Date: September 2024

Introduction

Purpose of the Report:

This report presents a detailed design plan and steps that will be followed to enhance the current monitoring setup for PlayNow.com which is supported by the eGaming Application Support team and the alerts created by Dynatrace, the visibility tool for the system, which is then linked to Service Now, the application to keep track of the teamwork and pending tasks.

Background:

The project focuses on optimizing system monitoring and incident management using Dynatrace and ServiceNow while collaborating with multiple stakeholders and platform admins to make the appropriate changes and assist the eGaming Application Support team to improve their observability and response time to issues on PlayNow.com. Reviewing the system’s infrastructure through Dynatrace helped identify interconnected services per business function, potential monitoring gaps, threshold adjustment options for alerts, and possible auto-remediation options. In the same way, key performance metrics, such as response times and error rates, were analyzed to define normal and abnormal system behavior. The findings include the frequent creation of false positive alerts, often linked to routine traffic increases or expected user behavior such as lottery draw times, which creates tasks that are not helpful for the support teams but only add noise and work to their backlog. This can be time-consuming incident management due to unnecessary alerts, which takes away valuable time and resources that can be used for actual issues that impact players or affect platform performance. Additionally, incidents related to system performance issues, such as high CPU and disk usage on critical servers and payment system failures, have resulted in significant overtime hours and if not resolved could cause revenue loss. Moreover, challenges in communication between teams during genuine incidents have also been noted. A review of the current alert configuration showed that thresholds were set to defaults, and adjusting them, along with implementing auto-remediation and a tagging system for better classification of services and hosts, could help reduce alert noise, improve efficiency, and lower costs.

Design Objectives

Goals :

The main goal of the design is to reduce the noise created by the false positive alerts that saturate the multiple support teams backlog with alerts that are not useful as they may report expected behavior. Thus, critical issues can be identified and addressed directly.

To accomplish this, it is required to change the thresholds set in Dynatrace to trigger these alerts which then will stop the creation of incident objects in ServiceNow. In the same way, it is important to create tags that will allow the different agile teams to identify hosts, services, processes, and other infrastructure items that belong to their team and scope of work. So, if a part of the system is having issues they can quickly identify if they support that part of the system, and which function it serves. Finally, recommendations should be made, and auto-remediation should be implemented when possible.

Constraints :

  • The design must be implemented within the existing infrastructure and not disrupt or break ongoing operations.
  • The design requires approval from the Application Performance Management team (APM) to adjust the alert thresholds set in Dynatrace without decreasing the visibility and alerting processes for critical services or processes.
  • The design requires the collaboration of the multiple agile teams within the eGaming Application Support team (eGAS) to identify and tag crucial items and objects being monitored by Dynatrace so they can be tagged appropriately.
  • The auto-remediation recommendations and possible implementation must be approved by the eGaming Technical Services team (ETS) as access to the F5 is required to implement server and/or service restarts.

System Architecture

Overview of the Current System:

The current system relies on Dynatrace to monitor various applications and system infrastructure. The eGAS team uses Dynatrace primarily to monitor PlayNow.com and ensure it is available and performing as expected. This is through the alerts configured for each part of the system that is used to help the application run such as services and hosts but also alerts configured specifically for performance metrics such as response times, high traffic, error rates, etc. As for now, these alerts have been using the default values which causes issues when alerts are created for normal behavior which is also not yet recognized or identified by some teams. It is also difficult to identify the root cause of problems and alerts when they are created as not all teams have the knowledge of other teams’ services or processes but also find it challenging to identify which team is responsible for what. If a piece is down, it is hard to understand what is being affected and what other services are linked to the troubled part. Since those problems take time when investigating the possible impact and support team, these alerts get escalated into an Incident in ServiceNow, adding more workload to the support team queue. Finally, all of these manual and tedious processes utilize valuable time and resources.

Design

Proposed Design:

The proposed design includes adjusting the thresholds that are based on the support team feedback to determine whether the alert is triggered by normal behavior or not, if so, remove the alert, and if it impacts PlayNow.com and players, then guarantee we increase the thresholds to a point that it does alert when a major issue has arisen then it notifies, basically reducing the sensitivity of the current set up in Dynatrace. In the same way, with the assistance of the support teams can identify the services, hosts, and processes they support, as well as dig into the system to identify these components and tag them to show the “functionality” they serve. Consequently, it will prevent unnecessary incidents in ServiceNow and therefore clear the noise for the support teams. Lastly, provide recommendations on auto-remediation opportunities and synthetic monitoring of key activities performed by users in PlayNow.com.

Alert Configuration:

As part of the incident management initiative to support eGAS and the Data Centre Operations team (DCO) by reducing the noise created by some alerts in Dynatrace, a report was created in service now to identify the top 5 most frequent incidents created by Dynatrace alerts, using the data from the last three months, and these issues were discussed with the support teams that work on them to determine how helpful those alerts are and identify possible player impact if any. On the other hand, collaborating with other team members from the APM team, all possible changes and adjustment options were gathered to determine what the best approach would be when implementing them in Dynatrace to reduce the noise without impacting other monitoring areas or the current setup that is working properly. The possible solutions that can be implemented are:

  • Create a maintenance window for a specific time when a service should not alert as “abnormal behavior” for the system is expected.
  • Mute or silence a service or request, if the request is not crucial for system performance and teams do not need to supervise it then it can be silenced.
  • Manually changing the thresholds for the current default alert set up to a higher number is a service or request works with big numbers of data, and the current sensitivity is too high creating unnecessary alerts and thus incidents.

After discussing the top 5 alerts and possible solutions with the different agile teams, the following changes were made in Dynatrace alerting system:

  1. Failure rate increase on Web service BcLmaCouponsService – MarkCouponAsSeen This service is creating most of the incidents in Service Now (69 incidents), this error occurs when a draw is happening, and people want to check their lottery tickets but the results are not posted yet, as it takes a few hours. This behavior may be expected around that time (7 pm to 12 am).

Update: Maintenance window will be created 7 pm-12 am every day.

  1. JavaScript error rate increase for Web application www.playnow.com – Different JavaScript errors, most on OpenBet side, the main vendor used at BCLC, which means no changes could be made on our end to fix it but an increase on thresholds can be implemented. No impact found yet, but it’d be important to be notified if the errors increase significantly as per incidents where a release caused issues, and it needs to be rolled back.

Update: Thresholds will be modified from an absolute value of 0%, Relative value of 50%, and time 1min to an absolute value of 100%, Relative value of 150%, and failure rate response time 15mins (Time it takes to create a problem).

  1. Failure rate increase on Web request service SG IAM – Identity and Access Management / Services (/iam-service/v1) – This is an Identity and Access endpoint. It is used for account login/verification during login but some of these issues are expected as users could forget their passwords or have difficulties while login.

Update: Thresholds will be modified from an absolute value of 0%, Relative value of 50%, and time 1min to an absolute value of 8.5%, Relative value of 100%, and failure rate response time 10mins (Time it takes to create a problem).

  1. Unexpected high traffic for Web application www.playnow.com – High traffic can be expected and it’s seen when big jackpot draws happen. However, as per cyber incidents, it could be good to alert on unexpected behavior. It can’t be context-aware necessarily, but changes could be made.

Update: Thresholds will be modified from Traffic spikes of 200%, Traffic drops of 50%, and time 1min to Traffic spikes of 250%, Traffic drops of 75%, and failure rate response time of 5mins (Time it takes to create a problem).

  1. Response time degradation on Web request service playnow.com:80 – Slowdown found on different services including sessions, payments, and /playnow, the last one being very broad. Since it is too broad and impacts multiple services, the only threshold that could be changed is the failure rate response time as it is quite sensitive at the moment.

Update: Thresholds will be modified failure rate response time from 1min to 10mins (The time it takes to create a problem)

Tagging and Classification:

After researching the multiple agile teams’ documentation stored in Confluence/Jira and overall PlayNow.com setup and functionalities as well as the current tags being used in Dynatrace. It has been determined that the creation of the “Function” tag can complement the “Support.team” already being used in Dynatrace to help the support teams identify what functionality that part of the system is serving, and what components could be affected by it. Thus, the following subcategories will be created for the “Function” tag:

  • Function: Describes the functionality a service, process or technology serves: iCasino, iLottery, Payments, Registration, Login, Promotions, Geolocation, Digital Signage, PlayNow Encore.

These tags will help teams identify each part of the system in an easier and faster manner, especially when they have the same or similar name, but they serve different functions. For example, Tomcat/localhost is found twice but the function they serve is different, which would require the assistance of a different agile team. To complete this, a deep dive into the current monitored pieces needs to be examined and investigated to identify their purpose and function concerning other parts of the system. See example below:

Auto-Remediation:

Recommendations can be made based on the findings of all recurring issues with multiple payment-related servers and services. At this time, further understanding of the system and the current setup is needed to implement a restart of services and servers when they break after increase in failure or CPU saturation. This has to be discussed and approved by the eGaming Technical Support team which manages the infrastructure and its maintenance. On the other hand, the APM team can provide input if any other possibilities exist to mitigate the impact as these issues cause BCLC resources and overtime.

Conclusion

The design addresses the current challenges by optimizing alert configurations and improving monitoring coverage through tagging which will improve the efficiency of troubleshooting an issue for the support teams, allow the teams to use the time for problems that have an actual impact on the system and gameplay of users, and auto-remediation recommendations and hopefully implementation.

Leave a Reply

Your email address will not be published. Required fields are marked *