[DRI][TSG][OLM][Pipeline `FVL MSIT NAM` under `O365CCOIVICPRODNAM` is unhealthy][How to monitor the job status]

Last updated by Siyao Jiang on 7/22/2022 3:39:51 AM This content is over 658 days old.

M365 ADF pipeline healthy monitor alert when ADF pipeline run fails, it will mark the pipeline as unheadly and send the severity 3 incident to IcM.

1. Find the pipeline name and environment in the title of the incident.

From this sample, the pipeline name is FVL MSIT NAM and the ADF is OIVIC PROD NAM.

ADF incident

2. Enter the ADF Portal

Name	Subscription Id	Data Factory	ADF Portal link
ODIN DEV	9ce40ff0-cb61-4fd0-8a84-63a847f44520	adfwqjdcihnbcbrm	DEV
OIVIC PPE NAM	4622f018-2e49-490f-b462-1b990f549058	adfb4b03ae1nam	PPE
OIVIC PROD NAM	4622f018-2e49-490f-b462-1b990f549058	adf6db9e51dnam	PROD

Before entering the link, we need first request the read-only permission for PPE or PROD environment by the following commands and enter the ADF Portal above.

Set-MyTeam oivic
Request-AzureActiveDirectoryElevation.ps1 -GroupName oivic-JIT-M365DataEngineerDebugger

3. Then find the failed pipeline that trigger the alert

After entering the ADF, click Monitor (the 3rd icon in the left bar), click trigger runs, then click the number of pipelines in the proper trigger run. Choose the right pipeline run.

ADF trigger runs

4. Click the output icon in the failed activity. Click to enter the component pipeline.

Do this repetitively until we see the error log. Remember the application id (red line) and cluster name (blue line).

ADF failed activity ADF failed id

5. Go to yarn UI: prod.lighthouse.office.net/yarnui/?clusterName=hdi7d009cb7nam to find the application. (Replace the cluster name hdi7d009cb7nam in the link with your own hdi cluster name.)

Click into the application. Scroll down to click logs.

ADF yarn ui ADF yarn log

6. Find the cause of the error in the log. Then try to identify the root cause.

In this sample, the SIGS data was not copied correctly, so the script could not find the path of data store. For other common errors, please refer to the wiki.

ADF yarn error

7. Mitigate the incident after figuring out the root cause:

a) If Data Copy activity failed, check if the source data payload is ready by following this wiki. If not, contact the source data owner. Otherwise, check whether there is any announcement in this Teams channel. If not, create an IcM ticket to Substrate Intelligence - OfflineAIBLR team on ICM. Sample Ticket
b) If payload registration failed, contact Vinit(vinittiwari) and Bhava(bhavatarini.mp).
c) If it is a false alarm/transient issue, such as input file hasn't been ready / linked service unstable, you can request for the operator permission by the following commands and then re-trigger the job;

Set-MyTeam oivic
Request-AzureActiveDirectoryElevation.ps1 -GroupName oivic-JIT-M365DataEngineerOperator

d) If this does not mitigate the problem, go to Manage -> Linked services -> Global parameters to check the build version:

ADF manage

And then go to the pipeline to find the latest version and the previous version, identify the commits between them.

ADF build pipeline ADF commit

If the code change of the commits matches the error log, rollback to the previous version. Go to the build pipeline link and get the version and drop url.

oivic version drop

Run the following commands in DMS to deploy:

Set-MyTeam oivic
$vstsDropUrl = "{drop url}"
$ev2RootPath = "/target/distrib/deploy/PRODnam/OIVICCuration/eastus2"
$version = "{version}"
$serviceName = 'oivicEuclidPartner'
$environmentName = 'Production'
Start-GriffinAzureEv2Deployment.ps1 -TenantName "MARS" -ServiceName $serviceName -Ev2ArtifactsRootPath $ev2RootPath -Ev2RolloutSpec "RolloutSpec.json" -VstsDropUrl $vstsDropUrl -EnvironmentName $environmentName -DeploymentVersion $version

e) Otherwise, please contact OLM developers (msaidataolmcrew@microsoft.com) to fix the problem.

Sample Incident

Incident ID	Severity	State	Start Time	Cause Analysis
283032442	3	MITIGATED	2022-01-13 18:52 CST

[DRI][TSG][OLM][Pipeline FVL MSIT NAM under O365CCOIVICPRODNAM is unhealthy][How to monitor the job status]