[DRI][TSG][OLM][Pipeline FVL MSIT NAM
under O365CCOIVICPRODNAM
is unhealthy][How to monitor the job status]
M365 ADF pipeline healthy monitor alert when ADF pipeline run fails, it will mark the pipeline as unheadly and send the severity 3 incident to IcM.
1. Find the pipeline name and environment in the title of the incident.
From this sample, the pipeline name is FVL MSIT NAM
and the ADF is OIVIC PROD NAM
.
2. Enter the ADF Portal
Name | Subscription Id | Data Factory | ADF Portal link |
---|---|---|---|
ODIN DEV | 9ce40ff0-cb61-4fd0-8a84-63a847f44520 | adfwqjdcihnbcbrm | DEV |
OIVIC PPE NAM | 4622f018-2e49-490f-b462-1b990f549058 | adfb4b03ae1nam | PPE |
OIVIC PROD NAM | 4622f018-2e49-490f-b462-1b990f549058 | adf6db9e51dnam | PROD |
Before entering the link, we need first request the read-only permission for PPE or PROD environment by the following commands and enter the ADF Portal above.
Set-MyTeam oivic
Request-AzureActiveDirectoryElevation.ps1 -GroupName oivic-JIT-M365DataEngineerDebugger
3. Then find the failed pipeline that trigger the alert
After entering the ADF, click Monitor (the 3rd icon in the left bar), click trigger runs, then click the number of pipelines in the proper trigger run. Choose the right pipeline run.
4. Click the output icon in the failed activity. Click to enter the component pipeline.
Do this repetitively until we see the error log. Remember the application id (red line) and cluster name (blue line).
5. Go to yarn UI: prod.lighthouse.office.net/yarnui/?clusterName=hdi7d009cb7nam to find the application. (Replace the cluster name hdi7d009cb7nam in the link with your own hdi cluster name.)
Click into the application. Scroll down to click logs.
6. Find the cause of the error in the log. Then try to identify the root cause.
In this sample, the SIGS data was not copied correctly, so the script could not find the path of data store. For other common errors, please refer to the wiki.
7. Mitigate the incident after figuring out the root cause:
a) If Data Copy activity failed, check if the source data payload is ready by following this wiki. If not, contact the source data owner. Otherwise, check whether there is any announcement in this Teams channel. If not, create an IcM ticket to Substrate Intelligence - OfflineAIBLR team on ICM. Sample Ticket
b) If payload registration failed, contact Vinit(vinittiwari) and Bhava(bhavatarini.mp).
c) If it is a false alarm/transient issue, such as input file hasn't been ready / linked service unstable, you can request for the operator permission by the following commands and then re-trigger the job;
Set-MyTeam oivic
Request-AzureActiveDirectoryElevation.ps1 -GroupName oivic-JIT-M365DataEngineerOperator
- d) If this does not mitigate the problem, go to Manage -> Linked services -> Global parameters to check the build version:
And then go to the pipeline to find the latest version and the previous version, identify the commits between them.
If the code change of the commits matches the error log, rollback to the previous version. Go to the build pipeline link and get the version and drop url.
Run the following commands in DMS to deploy:
Set-MyTeam oivic
$vstsDropUrl = "{drop url}"
$ev2RootPath = "/target/distrib/deploy/PRODnam/OIVICCuration/eastus2"
$version = "{version}"
$serviceName = 'oivicEuclidPartner'
$environmentName = 'Production'
Start-GriffinAzureEv2Deployment.ps1 -TenantName "MARS" -ServiceName $serviceName -Ev2ArtifactsRootPath $ev2RootPath -Ev2RolloutSpec "RolloutSpec.json" -VstsDropUrl $vstsDropUrl -EnvironmentName $environmentName -DeploymentVersion $version
- e) Otherwise, please contact OLM developers (msaidataolmcrew@microsoft.com) to fix the problem.
Sample Incident
Incident ID | Severity | State | Start Time | Cause Analysis |
---|---|---|---|---|
283032442 | 3 | MITIGATED | 2022-01-13 18:52 CST |