Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Metrics section to capabilities in understanding domain #1068

Open
wants to merge 9 commits into
base: features/mslearn
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs-mslearn/framework/understand/allocation.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,22 @@ At this point, you have an allocation strategy with detailed cloud management an

<br>

## KPIs and metrics

Consider the following key performance indicators (KPIs) to measure the effectiveness and completeness of your allocation strategy.

| KPI | Definition | Formula |
|--------------|----------------|---------|
| Cost allocated | Evaluates the extent to which cloud costs are allocated among organizational units.| Percentage of cloud cost allocated. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the formulas are actual formulas. This applies to all KPIs.

Also, avoid the use of "cloud". We need to remove that everywhere in the next Framework update so that will make the next update easier.

Suggested change
| Cost allocated | Evaluates the extent to which cloud costs are allocated among organizational units.| Percentage of cloud cost allocated. |
| Cost allocated | Evaluates the extent to which costs are allocated among organizational units. | {Allocated cost amount} / {Total cost} * 100 |

| Allocation granularity | Assesses the level of detail in cost allocation, from department to project scope. | Percentage of cost allocation defined across various scope levels (department, subscription, resource group, project, application). |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value is this supposed to be? Granularity sounds like an attribute and not a metric.

| Unallocated cloud costs | Measures the percentage of cloud costs that are not allocated to any specific project, team, or department. | Percentage of unallocated cloud costs. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need both this and allocated %? I'm not opposed to having both but is it useful to have 83% next to 17%? Seeing one is obvious what the other is. Not sure which is more important, tho. Maybe we want to describe both to let them choose. If that's the case, we might also want to include a justification column for why each metric is important. Thoughts?

| Allocation tagging strategy | Evaluates the implementation of a tagging strategy for cost allocation for each workload or business unit. | Percentage of cost allocation tagging strategy defined and implemented for each workload or business unit, and the percentage of untagged resources and associated costs. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this already implicitly covered by the allocated %?

| Tagging policy compliance | Measures compliance with the organizational tagging policy for cloud resources. | Percentage of cloud resources that are compliant with the organization's allocation tagging strategy. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to have a goal defined for each metric?

| Ownership coverage | Measures the extent to which ownership is defined for all resources. | Percentage of resources with resource owners defined. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this covered by tag compliance already? I suppose ownership could be defined externally and measured independently. Is this useful when it is covered by tags? Maybe it is. Just thinking out loud.

| Shared resources | Measures the identification of shared resources and the allotted cost distribution. | Percentage of shared resources identified and allocation distribution defined. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are they doing with this number? How does it help them?


<br>

## Learn more at the FinOps Foundation

This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [Allocation capability](https://www.finops.org/framework/capabilities/allocation/) article in the FinOps Framework documentation.
Expand Down
16 changes: 16 additions & 0 deletions docs-mslearn/framework/understand/anomalies.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,22 @@ At this point, you have automated alerts configured and ideally views and report

<br>

## KPIs and metrics

Consider the following key performance indicators (KPIs) to measure the effectiveness and completeness of your anomaly management approach.

| KPI | Definition | Formula |
|--------------|----------------|---------|
| Anomaly alert coverage | Measures the extent to which anomaly alerts are enabled across all workloads. | Percentage of workloads/subscriptions with anomaly alerts enabled. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have a bicep module for scheduled actions, should we call that out as a way to drive this number up? Not sure if we want that in the KPI section or elsewhere. It might be nice to point out when we have tools to facilitate improving a KPI. Of course, if we do that, it probably wouldn't work as a table.

| Time to alert awareness | Measures the average time taken from the occurrence of an anomaly to the alert being raised and the resource owner being made aware. | Average length of time from anomaly detection to alert/resource owner awareness. |
| Time to anomaly remediation | Measures the average time taken from the occurrence of an anomaly to its remediation. | Average length of time from anomaly detection to remediation. |
| Unresolved anomalies | Measures the number and duration of unresolved anomalies. | Quantity and duration of unresolved anomalies. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a specified time period?

This reminds me of SMART goals. Specific, measurable, achievable, relevant, and time-based. We should probably make sure each metric factors in these principles.

| Forecasted unnecessary cloud spend | Measures the amount of forecasted unnecessary cloud spend if the anomaly was not detected for the billing period. | Amount of forecasted unnecessary cloud spend if anomaly was not detected for the billing period. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't completely clear. Is this about quantifying cost avoidance or avoided waste?

| Proactive anomaly alerts | Measures the number of planned anomalies that were not proactively alerted to all core personas involved over a period. | Number of planned anomalies not proactively alerted to all core personas involved over a period. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is either.

| Anomaly detection accuracy | Measures the number of false positive and false negative anomaly alerts. | Number of false positives and false negatives. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a number or a percentage? Although maybe false negatives should be separate from false positives since you'll never know all the missed anomalies, so it's not possible to calculate a percentage there. Maybe those should be split into separate KPIs.


<br>

## Learn more at the FinOps Foundation

This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [Anomaly management capability](https://www.finops.org/framework/capabilities/anomaly-management) article in the FinOps Framework documentation.
Expand Down
17 changes: 17 additions & 0 deletions docs-mslearn/framework/understand/ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,23 @@ At this point, you have a data pipeline and are ingesting data into a central da

<br>

## KPIs and metrics

flanakin marked this conversation as resolved.
Show resolved Hide resolved
Consider the following key performance indicators (KPIs) to measure the health, effectiveness, and completeness of your FinOps data estate.

| KPI | Definition | Formula |
|----------|-----------|-----|
| Data completeness | Measures the extent to which all required data fields are present in the dataset and tracks the overall data completeness trend over a specified period.| Percentage of data fields that are complete and the overall data completeness over time. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do they know what "complete" is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two aspects come to mind, the first ensuring all rows and columns of data have values(no empty cells). The second, connecting with stakeholders to determine what is required for analysis and crosschecking with what is currently ingested.

Suggested change
| Data completeness | Measures the extent to which all required data fields are present in the dataset and tracks the overall data completeness trend over a specified period.| Percentage of data fields that are complete and the overall data completeness over time. |
| Data completeness | Measures the extent to which all required data fields are present in the dataset. Ensures no empty cells in rows and columns, and aligns with stakeholder requirements for analysis.| Percentage of data fields that are nonempty and overall data ingestion against stakeholder requirements. |

| Data quality | Measures the percentage of successful data quality checks and the total number of data quality checks conducted within a specified period. | Number of data quality checks conducted and the percentage of successful data quality checks . |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are data quality checks? Not sure how they would measure this.

Copy link
Contributor Author

@KevDLR KevDLR Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the ingestion, checking the data to make sure all data fields are populated, in the expected range, consistent, follow the correct standards, or do not have unnecessary duplicate records.

Suggested change
| Data quality | Measures the percentage of successful data quality checks and the total number of data quality checks conducted within a specified period. | Number of data quality checks conducted and the percentage of successful data quality checks . |
| Data quality | Measures the percentage of successful data quality checks by ensuring data accuracy, consistency, validity, and uniqueness. | Number of data quality checks conducted and the percentage of successful data quality checks. |

| Investigation time to resolution | Measures the time taken to investigate and resolve data quality or availability issues and tracks the trend of this resolution time over a specified period. | Mean time to investigate and resolve data quality or availability issues, and the trend over time. |
|Data ingestion frequency | Measures how often data is ingested into the system. | Number of data ingestion events per unit of time (daily, weekly, monthly, quarterly, annually). |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this measured? What do they do with this information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be based on the frequency that is selected for the cost exports to ensure it aligns with report refresh requirements.

Suggested change
|Data ingestion frequency | Measures how often data is ingested into the system. | Number of data ingestion events per unit of time (daily, weekly, monthly, quarterly, annually). |
| Data ingestion alignment | Measures how often cost and usage data is exported and ingested into the repository and its alignment with reporting refresh requirements. | Percentage of cost exports/ingestions that have the correct export frequency to ensure reports are up-to-date. |

| Data size | Measures the total volume of data ingested into the repository. | Total volume of data ingested into the repository. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder how they can do this using storage, PBI, and ADX. We should add those to the backlog if you don't have them already.

| Growth rate | Measure the rate at which the volume of data ingested is increasing over time. | Percentage increase of total data volume in repository per unit of time. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another good example of where we should add details about why it's important and what to do with it. This may not be obvious to everyone. It may also be useful to include the cost of the storage, compute, and networking costs associated with the ingestion process πŸ€” Compute would also be applicable in reporting and not sure if we can differentiate, but something to consider.

| Ingestion latency | Measures the average time taken for data to be ingested into the repository. | Average time of data ingestion latency per dataset. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (or maybe all) of these seem like they are per dataset or data source. Should we call that out in any way? Should we have 2 lists? One for overall KPIs and one per dataset/source? πŸ€”

| Historical data availability | Measures the lookback period of data that is ingested and available for analysis. | Span of historical data ingested. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a KPI? This seems more like a configuration setting per data source. I like calling it out, but not sure it's an indication of "performance".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need to be adjusted so that it is to ensure how much historical data they have is aligned to the business needs.

Suggested change
| Historical data availability | Measures the lookback period of data that is ingested and available for analysis. | Span of historical data ingested. |
| Historical data availability | Measures the lookback period of data that is ingested and ensures it meets analysis requirements. | Percentage of historical data ingested that meets the required lookback period for business needs. |


<br>

## Learn more at the FinOps Foundation

This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [data ingestion capability](https://www.finops.org/framework/capabilities/data-ingestion/) article in the FinOps Framework documentation.
Expand Down
20 changes: 20 additions & 0 deletions docs-mslearn/framework/understand/reporting.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,26 @@ At this point, you're likely utilizing the native reporting and analysis solutio

<br>

## KPIs and metrics

Consider the following key performance indicators (KPIs) to measure the effectiveness, timeliness, and completeness of your FinOps reporting.

| KPI | Definition | Formula |
|--------------|----------------|---------|
| Reporting needs | Measures the identification of stakeholders and their reporting needs. | Percentage of stakeholders with defined reporting needs. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this achievable? For a large organization, it seems unlikely that a FinOps team would talk to everyone.

| Report coverage | Measures the number of teams with comprehensive reports available. | Number of teams with reports for all personas. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is measuring exactly.

| Report distribution | Measures the frequency and reach of distributed reports. | Frequency and reach of distributed reports. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making me question the term "report". Is this referring to an email? Microsoft generally refers to a preconfigurated collection of visuals as a report, not an email. Should this be about alerts or email notifications?

| Investigative time | Measures the time required to analyze cloud usage and cost questions. | Average time to report on required cloud usage and costs details. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this achievable? How would they measure it?

| Tagging compliance | Measures the resource tagging compliance to facilitate accurate reporting and analytics. | Percentage of resources tagged and the compliance. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong here. Tagging is part of allocation.

| Spend awareness | Measures the awareness and accountability of cloud spend across all workloads, and personas. | Percentage of personas receiving cloud usage and cost reports. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this achievable?

| Feedback pipelines | Evaluates feedback processes for stakeholders and core personas. | Automation capability to provide feedback on reports to the FinOps team. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what I think this is trying to cover: feedback on reports. If that's what it is, it probably needs to be reworded to be a little clearer. The KPI name should be a metric and the formula should be a formula. I'm wondering if there's a diference between new feedback in a period, resolved feedback in a period, and total active/unresolved feedback. We should also add this to the backlog as something for us to add to our reports.

| Adoption rate | Measures usage of the reporting and analytics tools. | Percentage of teams utilizing provided reports. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be measured? Would DAU, WAU, and MAU be more achievable? I like the idea of tracking adoption, but that's just the first use and not the ongoing use. I wonder how many FinOps teams would really care to get into adoption, engagement, and retention. Although, the same could be asked about MAU. since it doesn't indicate success or give context on growth πŸ€”

| Data update frequency | Tracks how often report data is updated. | Time between data refreshes. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this belong in data ingestion?

| Data accuracy | Evaluates the accuracy of available reports. | Accuracy percentage of the reports. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data ingestion? How is this measured?

| Report development | Measures the time to develop requested reports. | Average time to generate and provide a new or updated report to stakeholders. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How useful is this? Every report will have a different amount of complexity and, in theory, teams shouldn't be churning out new reports endlessly. I like the idea of it, but I suspect new reports would be rare once a team is established. They'll likely update existing ones more.


<br>

## Learn more at the FinOps Foundation

This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [Reporting and analytics capability](https://www.finops.org/framework/capabilities/reporting-analytics/) article in the FinOps Framework documentation.
Expand Down