Update Search Dataset Usage Metrics Data
Background
Same with COGS Metrics Data,We leverage SOTEL insights data as the source data to compute the Data Usage metrics values for MSAI Data Platform. There are two approaches for compute the Data Usage metrics values into our SQL database.
- (Live) Execute Azure Data Factory pipeline and hook the SOTEL insights data from kusto.
- (Manual) Execute SQL script against the Cosmos Resource Management database.
Data Usage Metadata Schema
Column Name | Data Type | Description | Sample Value |
---|---|---|---|
ID | int | Auto-increment field | 1 |
StoragePlatform | varchar(10) | Source data classification | Cosmos |
ClusterName | varchar(200) | The cluster name of the streamset. | cosmos14-prod-cy2 |
VCName | varchar(200) | The VC name of the streamset. | exchange.storage.prod |
StreamSetNormalizedPath | varchar(500) | The normalized path of the streamset. | /local/Aggregated/Datasets/Public/Usage/Impression/LogicalSearchImpressionHourly/Mso/Search/%n/%n/%n/MsoLogicalSearchImpressions_%n-%n-%n.ss |
StreamFolderPath | varchar(500) | The normalized path prefix of the streamset. | /local/Aggregated/Datasets/Public/Usage/Impression/LogicalSearchImpressionHourly/Mso/Search/ |
JobClusterName | varchar(200) | The cluster name of the job, may be different from the StreamSetClusterName. | cosmos14-prod-cy2 |
JobVCName | varchar(200) | The VC name of the job, may be different from the StreamSetVCName. | exchange.storage.prod |
JobNormalizedName | varchar(200) | The normalized name of the job. | AnaExp_Others_Job%n_Run%n__Pri_%n.%n_lsrep_TrigFilt_OneOffJob_%n_%n_%nn%n_sssUsageQualityOfServiceMetric.SearchQualityOfService.AB.Standard_%n-%n-%nT%n_%n-%n-%nT%n |
SubmittedBy | varchar(200) | The submittor of the job. | CO4AEAP000079D2 |
JobCount | int | How many job instances this record contains. | 5 |
LatestAccessTime | datetime | The date on which the latest job instance accessed the streamset. | 2022-02-01 13:46:13.000 |
SnapshotTime | datetime | The snapshot time when capture this data usage metadata. | 2022-02-28 00:00:00.000 |
ConsumerEMAlias | varchar(200) | The alias of engineer manager under which this jobset is registered | yupan |
ConsumerJobContact | varchar(200) | The alias of contacts | 3sexpdri@microsoft.com |
ConsumerServiceId | varchar(500) | The id of service tree node under which this jobset is registered | a3374a5c-dd8c-4da4-99de-bdaeedbcdbd2 |
ConsumerServiceName | varchar(500) | The name of the service. | ExP Classic |
DataCategory | varchar(200) | DataCategory by StreamSetNormalizedPath | Impression |
DataSubCategory | varchar(200) | Data sub Category by StreamSetNormalizedPath | /LogicalSearchImpressionHourly |
Update Data Usage Metrics Data on ADF
Access Permission
Request eligibility CosmanDBReader on OSP Portal . We use AAD authentication for external users, when the eligibility is granted, your debug account will be added into corresponding AAD group.
Request join SOTELS Data Insight Users group on IDWeb.
Request Contributor on DMS-DatacenterManagement.
Torus Command
Request-AzureResourceRoleElevation -Role Contributor -SubscriptionId 8b8b2cf1-f7e3-4d86-8581-a15cd35463a4 -Reason "update ADF" -Duration 4
Configure Details Of Pipeline
Source Data | SOTELS Insights Data on Kusto interface: cluster('resourcemanagement.westus2.kusto.windows.net ').database('prod') |
Target Data Source | msaidataplatformmetrics |
ADF Pipeline | DataUsageMetricsIngestion |
ADF Trigger | Monthly on 5th |
Steps in DataUsageMetricsIngestion Pipeline
1: Extact Data Usage from SOTELS Insight to SQL
a: Clear Unified Data Usage Metadata defined on temporary Table,Freshness of data when ensure doing aggregate calculations with Cosman Resources.
DELETE FROM SOTELInsightDataUsage
b: Extract Streamset usage metadata from SOTELs Insights source data. Transform the output data schema to the unified data usage metadata structure defined. (SOTELs Insights data just hosts the metadata in the last month window)
let msaidataServiceId = "c377de9c-e737-4f3f-bd9b-f2024bb57e74";
let ImpressionStreamSets = materialize(cluster('resourcemanagement.westus2.kusto.windows.net').database('prod').vw_NormalizedStreamMetaData()
| where VcNameLower =~ "exchange.storage.prod" and FullClusterName =~ "cosmos14-prod-cy2"
| where NormalizedLocalPath startswith "/local/Aggregated/Datasets/Public/Usage/Impression/"
| where ServiceId =~ msaidataServiceId
| project-rename StreamSetOwnerService = ServiceId
| order by LatestCreateTime desc);
let SessionStreamSets = materialize(cluster('resourcemanagement.westus2.kusto.windows.net').database('prod').vw_NormalizedStreamMetaData()
| where VcNameLower =~ "exchange.storage.prod" and FullClusterName =~ "cosmos14-prod-cy2"
| where NormalizedLocalPath startswith "/local/Aggregated/Datasets/Public/Usage/Session/"
| where ServiceId =~ msaidataServiceId
| project-rename StreamSetOwnerService = ServiceId
| order by LatestCreateTime desc);
let ImpressionUsage = materialize(cluster('resourcemanagement.westus2.kusto.windows.net').database('prod').vw_LineageMetaData()
| where StreamClusterName =~ "cosmos14-prod-cy2" and StreamVcName =~ "exchange.storage.prod"
| where AccessType =~ "I" // Just filter out the input streamset info
| join kind = inner ImpressionStreamSets on $left.StreamName == $right.NormalizedURI);
//| project SubmittedBy,NormalizedNames
let SessionUsage = materialize(cluster('resourcemanagement.westus2.kusto.windows.net').database('prod').vw_LineageMetaData()
| where StreamClusterName =~ "cosmos14-prod-cy2" and StreamVcName =~ "exchange.storage.prod"
| where AccessType =~ "I" // Just filter out the input streamset info
| join kind = inner SessionStreamSets on $left.StreamName == $right.NormalizedURI);
ImpressionUsage
| union SessionUsage
| extend JobCount = toint(SuccessJobCount) + toint(iff(isempty(CancelledJobCount), 0, CancelledJobCount)) + toint(iff(isempty(FailedJobCount), 0, FailedJobCount))
| extend LatestAccessMonth = startofmonth(LatestInstanceEndTime)
| extend StreamFolderPath = substring(NormalizedLocalPath,0,indexof(NormalizedLocalPath,"#"))
| extend DataCategory = substring(NormalizedLocalPath,40,indexof(substring(NormalizedLocalPath,40),"/"))
//| extend DataSubCategory1 = split(NormalizedLocalPath,"/",7)
| extend DataSubCategory = strcat("/",substring(substring(substring(substring(NormalizedLocalPath,45),1,indexof_regex(substring(NormalizedLocalPath,45),"#")),indexof(substring(substring(NormalizedLocalPath,45),1,indexof_regex(substring(NormalizedLocalPath,45),"#")),"/")+1),
0,indexof(substring(substring(substring(NormalizedLocalPath,45),1,indexof_regex(substring(NormalizedLocalPath,45),"#")),indexof(substring(substring(NormalizedLocalPath,45),1,indexof_regex(substring(NormalizedLocalPath,45),"#")),"/")+1),"/")))
| extend JobMainNormalizedName = split(NormalizedNames, "\t", 0)
| extend StreamSetNormalizedPath = replace_regex(NormalizedLocalPath, @'#', @'%n')
| project StoragePlatform = "Cosmos",StreamSetClusterName = FullClusterName1,StreamSetVcName = VcNameLower1,StreamSetNormalizedPath,StreamFolderPath,JobClusterName = FullClusterName,
JobVcNameLower = VcNameLower,JobNormalizedName = tostring(JobMainNormalizedName[0]),SubmittedBy,JobCount,LatestAccessDate = LatestInstanceEndTime, LatestAccessMonth,ConsumerEMOwner = DevOwner,
ConsumerJobContact = Contact,ConsumerServiceId = ServiceId,ConsumerServiceName = ServiceName,DataCategory,DataSubCategory
| where JobVcNameLower == "exchange.storage.prod"
| where LatestAccessMonth == datetime_add('month',-1,startofmonth(now()))
| order by JobCount desc
| distinct StoragePlatform,StreamSetClusterName,StreamSetVcName,StreamSetNormalizedPath,StreamFolderPath,JobClusterName,JobVcNameLower,JobNormalizedName,SubmittedBy,JobCount,
LatestAccessDate,LatestAccessMonth,ConsumerEMOwner,ConsumerJobContact,ConsumerServiceId,ConsumerServiceName,DataCategory,DataSubCategory
2: JOIN Cosman and SOTELS Insight Data Usage
Calculation and cache the Data Usage metadata from SOTELs Insights data in Kusto and the Data Usage Metadata from Cosman Resources,this will serve as our new data to support the future dashboard.
WITH SOTELAndCosmanConjoint AS (
SELECT DISTINCT
A.StoragePlatform,
A.ClusterName,
A.VCName,
A.StreamSetNormalizedPath,
A.StreamFolderPath,
A.JobClusterName,
A.JobVCName,
A.JobNormalizedName,
A.SubmittedBy,
A.JobCount + B.JobCount AS JobCount,
A.LatestAccessTime,
A.SnapshotTime,
A.ConsumerEMAlias,
A.ConsumerJobContact,
A.ConsumerServiceId,
A.ConsumerServiceName,
A.DataCategory,
A.DataSubCategory
FROM SOTELInsightDataUsage A
INNER JOIN (
SELECT StreamSetNormalizedPath,StreamFolderPath,JobNormalizedName,SubmittedBy,JobCount,
ConsumerEMAlias,ConsumerJobContact,ConsumerServiceId,ConsumerServiceName,SnapshotTime
FROM DataUsageMetadata
WHERE DATEDIFF(DAY,Dateadd(mm, Datediff(mm, 0, GETDATE())-1, 0),SnapshotTime) = 0
) B
ON A.JobNormalizedName = B.JobNormalizedName
AND A.StreamSetNormalizedPath = B.StreamSetNormalizedPath
AND A.ConsumerServiceId = B.ConsumerServiceId
AND A.StreamFolderPath = B.StreamFolderPath
),
CosmanUnique AS (
SELECT DISTINCT
B.StoragePlatform,
B.ClusterName,
B.VCName,
B.StreamSetNormalizedPath,
B.StreamFolderPath,
B.JobClusterName,
B.JobVCName,
B.JobNormalizedName,
B.SubmittedBy,
B.JobCount ,
B.LatestAccessTime,
B.SnapshotTime,
B.ConsumerEMAlias,
B.ConsumerJobContact,
B.ConsumerServiceId,
B.ConsumerServiceName,
B.DataCategory,
B.DataSubCategory
FROM SOTELInsightDataUsage A
RIGHT JOIN (
SELECT *
FROM DataUsageMetadata
WHERE DATEDIFF(DAY,Dateadd(mm, Datediff(mm, 0, GETDATE())-1, 0),SnapshotTime) = 0
) B
ON A.JobNormalizedName = B.JobNormalizedName
AND A.StreamSetNormalizedPath = B.StreamSetNormalizedPath
AND A.ConsumerServiceId = B.ConsumerServiceId
AND A.StreamFolderPath = B.StreamFolderPath
WHERE A.JobCount IS NULL
),
SOTELUnique AS (
SELECT DISTINCT
A.StoragePlatform,
A.ClusterName,
A.VCName,
A.StreamSetNormalizedPath,
A.StreamFolderPath,
A.JobClusterName,
A.JobVCName,
A.JobNormalizedName,
A.SubmittedBy,
A.JobCount,
A.LatestAccessTime,
A.SnapshotTime,
A.ConsumerEMAlias,
A.ConsumerJobContact,
A.ConsumerServiceId,
A.ConsumerServiceName,
A.DataCategory,
A.DataSubCategory
FROM SOTELInsightDataUsage A
LEFT JOIN (
SELECT StreamSetNormalizedPath,StreamFolderPath,JobNormalizedName,SubmittedBy,JobCount,
ConsumerEMAlias,ConsumerJobContact,ConsumerServiceId,ConsumerServiceName,SnapshotTime
FROM DataUsageMetadata
WHERE DATEDIFF(DAY,Dateadd(mm, Datediff(mm, 0, GETDATE())-1, 0),SnapshotTime) = 0
) B
ON A.JobNormalizedName = B.JobNormalizedName
AND A.StreamSetNormalizedPath = B.StreamSetNormalizedPath
AND A.ConsumerServiceId = B.ConsumerServiceId
AND A.StreamFolderPath = B.StreamFolderPath
WHERE B.JobCount IS NULL
)
SELECT * FROM SOTELAndCosmanConjoint
UNION
SELECT * FROM CosmanUnique
UNION
SELECT * FROM SOTELUnique
ORDER BY JobCount DESC
Update Data Usage Metrics Data with COSMAN Resource Data
This is Legacy's calculation and data input method.
Access Permission
- Request eligibility CosmanDBReader on OSP Portal . We use AAD authentication for external users, when the eligibility is granted, your debug account will be added into corresponding AAD group.
- Once the request is approved, use torus debug account to connect Cosman DB:
Example to connect with Microsoft SQL Server Management Studio (SSMS):
Sign in : | Options: |
---|---|
Query the latest data of Data Usage
Execute this SQL Query against the ResourceManagementDatabase. The output is a aggregated-value since the streamset registered.
--MSAI Unified Data Platform Service Tree Id
DECLARE @TargetServiceId CHAR(36);
SET @TargetServiceId = 'c377de9c-e737-4f3f-bd9b-f2024bb57e74';
WITH MSAIDataPlatformStreamSetRegistration AS (
SELECT DISTINCT
A.ClusterName,
A.VCName,
A.NormalizedPath,
B.NormalizedPath AS StreamFolderPath
FROM dbo.CosmosStorageInfo A
INNER JOIN dbo.CosmosStreamSetRegistration B
ON A.RegistrationId = B.Id
WHERE B.ServiceId = @TargetServiceId and StreamSetType = 2)
SELECT
'Cosmos' AS StoragePlatform,
A.ClusterName,
A.VCName,
A.NormalizedPath AS StreamSetNormalizedPath,
A.StreamFolderPath,
B.JobClusterName,
B.JobVCName,
B.NormalizedName AS JobNormalizedName,
B.SubmittedBy,
B.JobCount,
B.LatestAccessTime,
Dateadd(mm, Datediff(mm, 0, GETDATE()), 0) AS SnapshotTime,
C.EMAlias AS ConsumerEMAlias,
C.Contact AS ConsumerJobContact,
C.ServiceId AS ConsumerServiceId,
D.ServiceName AS ConsumerServiceName,
CASE
WHEN A.NormalizedPath LIKE '%Session%' THEN 'Session'
ELSE 'Impression'
END
AS DataCategory,
'/'+LEFT(SUBSTRING(RIGHT(A.NormalizedPath ,LEN(A.NormalizedPath) -44),CHARINDEX('/',RIGHT(A.NormalizedPath ,LEN(A.NormalizedPath) -44))+1,LEN(RIGHT(A.NormalizedPath ,LEN(A.NormalizedPath) -44))),
CHARINDEX('/',SUBSTRING(RIGHT(A.NormalizedPath ,LEN(A.NormalizedPath) -44),CHARINDEX('/',RIGHT(A.NormalizedPath ,LEN(A.NormalizedPath) -44))+1,LEN(RIGHT(A.NormalizedPath ,LEN(A.NormalizedPath) -44))))-1) AS DataSubCategory
FROM MSAIDataPlatformStreamSetRegistration A
INNER JOIN dbo.CosmosAccessInfo B
ON A.NormalizedPath = B.NormalizedPath AND A.VCName = B.StreamSetVCName
INNER JOIN dbo.CosmosJobInfo C
ON B.NormalizedName = C.NormalizedName AND B.JobVCName = C.VCName
INNER JOIN dbo.ServiceTreeSnapshot D
ON C.ServiceId = D.ServiceId
WHERE B.AccessType = 'I' and A.NormalizedPath like '/local/Aggregated/Datasets/%'
and FORMAT(B.LatestAccessTime, 'yyyy-MM') = '2022-04'
ORDER BY LatestAccessTime DESC
Manually insert new data for DataUsageMetadata
Example to connect with Microsoft SQL Server Management Studio (SSMS):
Attribute | Information |
---|---|
Server Type | Database Engine |
Server Name | msai-datavolumepipeline.database.windows.net |
Authentication | SQL server authentication |
User Name | datavolumepipeline |
Database | MSAIDataPlatformMetrics |
Delete the data of the current month contained in the original DataUsageMetadata
DELETE FROM DataUsageMetadata WHERE MONTH(SnapshotTime) = 04
Copy the queried data to CSV file and import new data through Navicat.
Note: Columns LatestAccessTime and SnapshotTime in the CSV file need to be formatted as "yyyy-mm-dd hh:mm:ss"