AurigaSentry handbook
Background
As unified data platform team, we setup and maintain a bunch of monitors against the scheduled Cosmos jobs in ODINSentry. They will probe and guard different SLAs or the availability of the proposed jobs. Due to the ODIN team following a stricter compliance policy with IDEAs team, we need to decouple from ODIN tools, including Sentry.
AurigaSentry is a lightweight platform which inherit the part of the data pipeline DRI monitor functions from ODIN sentry. It covers the different functions via multiple applications/frameworks. In the long term, AurigaSentry would handle more customized data engineering functionalities.
For any requirements and questions, please contact msaidataplatformdri@microsoft.com.
Infrastructure.
MSAI Data Platform Team Monitors
AurigaSentry hosts mutiple monitors for the data cooking pipelines across different data platforms such as Cosmos, Phoenix and so on. For now, it has supported several Cosmos and Blueshift job monitors for the specific scheduled Cosmos jobs.
JobSLAMonitor
Azure TimerTrigger Function v1
JobSubmitSLAMonitor
Azure TimerTrigger Function v1
CosmosJobFailMonitor
Azure TimerTrigger Function v1
AurigaSentry Portal
.NET Framework MVC
AurigaSentry contains a web application portal published on the Azure subscription. All the AAD accounts can access this portal. It supports various functionalities such as Check Cosmos Job Status, Modify Configurations and so on.
CosmosJobLoader Application
Console Application
AurigaSentry Monitors (Todo)
AurigaSentry implements the availability and performance monitors itself.
CosmosJobLoader availability monitor
Azure TimerTrigger Function v1
CosmosJobLoader application availability impacts the AurigaSentry monitor accuracy. The CosmosJobLoaderAvailability monitor will check the SLA for the application execution health status in order to ensure the update in the CosmosJobInfo table.
AurigaSentry DB
There are several usable tables for DRI before the AurigaSentry Portal is ready.
Table | Description |
---|---|
SLAMonitorConfigs | The configuration table for Cosmos JobSubmitSLA and JobSLA monitors |
AvailabilityMonitorConfigs | The configuration table for Cosmos Job Fail monitor |
SLAMonitorInstances | The instance table ffor Cosmos JobSubmitSLA and JobSLA monitors |
AvailabilityMonitorInstances | The instance table for Cosmos Job Fail monitor |
CreatedICMs | The table records the IcM created |
HistoricalJobInfoUpdateTimes | The table records the last time CosmosJobLoader, which is running on VM (stcac-420), checked the OBD.prod Cosmos job |
DRI can connect SQL server DB with following configuration. You can use SQL Server Management Studio OR VS code Extension "SQL Server (mssql)".
- Server type: Database Engine
- Service name: auriga-phoenix.database.windows.net
- Database name: Sentry
- Authentication: SQL Server Authentication
- username: *** (check with Zhao Liu)
- password: *** (check with Zhao Liu)
SLAMonitorConfigs and AvailabilityMonitorConfigs table
You can check the config for SLAMonitor and AvailabilityMonitor by such query:
SELECT *
FROM dbo.SLAMonitorConfigs
WHERE IsEnabled = 1
SELECT *
FROM dbo.AvailabilityMonitorConfigs
WHERE IsEnabled = 1
If DRI find any monitor keep firing false alarm with call, DRI can find the alert and update the Severity of this alert from Sev 2 to Sev 3, so DRI can avoid be bothered by calling.
When you are going to do the severity update, you need to send out the notification to MSAI Data Platform DRI msaidataplatformdri@microsoft.com.
Please be VERY CAREFUL when you would like to update the severity of the alert, DO NOT change any MonitorInformation or any other fields.
SLAMonitorInstances and AvailabilityMonitorInstances table
These 2 instance tables record the checking snapshot for the configured monitor.
SELECT TOP(500) *
FROM dbo.SLAMonitorInstances
ORDER BY MonitorInstanceID DESC
SELECT TOP(500) *
FROM dbo.AvailabilityMonitorInstances
ORDER BY MonitorInstanceID DESC
Each raw in the table indicates the checking snapshot with a specifc timestamp (usually hourly chime)
- If AurigaSentry checks the status of one monitor is fine, it will append an instance with CheckStatus==Success, and will not re-check this instance any more.
- If AurigaSentry checks the status of one monitor break the monitor configuration rule, it will append an instance with CheckStatus==Fail, and fire an IcM ticket accordingly.
- AurigaSentry will then reguarly re-check the status of this instance, e.g. every 15min, if the status turns fine when re-checking, the CheckStatus will be marked as Success, and IsResolved field will be marked. However, the re-checking will only affect the AurigaSentry DB, but not the status of IcM ticket. So if DRI find the instance in DB is marked as Success and IsResolved, you can mitigate the IcM ticket, as it's actually resolved.