-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Metrics section to capabilities in understanding domain #1068
base: features/mslearn
Are you sure you want to change the base?
Changes from all commits
1633104
bf91fba
98c0bc6
7cfa67d
2306b40
e778243
18a84eb
c42e502
4063db1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,6 +125,22 @@ At this point, you have an allocation strategy with detailed cloud management an | |
|
||
<br> | ||
|
||
## KPIs and metrics | ||
|
||
Consider the following key performance indicators (KPIs) to measure the effectiveness and completeness of your allocation strategy. | ||
|
||
| KPI | Definition | Formula | | ||
|--------------|----------------|---------| | ||
| Cost allocated | Evaluates the extent to which cloud costs are allocated among organizational units.| Percentage of cloud cost allocated. | | ||
| Allocation granularity | Assesses the level of detail in cost allocation, from department to project scope. | Percentage of cost allocation defined across various scope levels (department, subscription, resource group, project, application). | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What value is this supposed to be? Granularity sounds like an attribute and not a metric. |
||
| Unallocated cloud costs | Measures the percentage of cloud costs that are not allocated to any specific project, team, or department. | Percentage of unallocated cloud costs. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Do we need both this and allocated %? I'm not opposed to having both but is it useful to have 83% next to 17%? Seeing one is obvious what the other is. Not sure which is more important, tho. Maybe we want to describe both to let them choose. If that's the case, we might also want to include a justification column for why each metric is important. Thoughts? |
||
| Allocation tagging strategy | Evaluates the implementation of a tagging strategy for cost allocation for each workload or business unit. | Percentage of cost allocation tagging strategy defined and implemented for each workload or business unit, and the percentage of untagged resources and associated costs. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this already implicitly covered by the allocated %? |
||
| Tagging policy compliance | Measures compliance with the organizational tagging policy for cloud resources. | Percentage of cloud resources that are compliant with the organization's allocation tagging strategy. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be useful to have a goal defined for each metric? |
||
| Ownership coverage | Measures the extent to which ownership is defined for all resources. | Percentage of resources with resource owners defined. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this covered by tag compliance already? I suppose ownership could be defined externally and measured independently. Is this useful when it is covered by tags? Maybe it is. Just thinking out loud. |
||
| Shared resources | Measures the identification of shared resources and the allotted cost distribution. | Percentage of shared resources identified and allocation distribution defined. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are they doing with this number? How does it help them? |
||
|
||
<br> | ||
|
||
## Learn more at the FinOps Foundation | ||
|
||
This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [Allocation capability](https://www.finops.org/framework/capabilities/allocation/) article in the FinOps Framework documentation. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,6 +67,22 @@ At this point, you have automated alerts configured and ideally views and report | |
|
||
<br> | ||
|
||
## KPIs and metrics | ||
|
||
Consider the following key performance indicators (KPIs) to measure the effectiveness and completeness of your anomaly management approach. | ||
|
||
| KPI | Definition | Formula | | ||
|--------------|----------------|---------| | ||
| Anomaly alert coverage | Measures the extent to which anomaly alerts are enabled across all workloads. | Percentage of workloads/subscriptions with anomaly alerts enabled. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given we have a bicep module for scheduled actions, should we call that out as a way to drive this number up? Not sure if we want that in the KPI section or elsewhere. It might be nice to point out when we have tools to facilitate improving a KPI. Of course, if we do that, it probably wouldn't work as a table. |
||
| Time to alert awareness | Measures the average time taken from the occurrence of an anomaly to the alert being raised and the resource owner being made aware. | Average length of time from anomaly detection to alert/resource owner awareness. | | ||
| Time to anomaly remediation | Measures the average time taken from the occurrence of an anomaly to its remediation. | Average length of time from anomaly detection to remediation. | | ||
| Unresolved anomalies | Measures the number and duration of unresolved anomalies. | Quantity and duration of unresolved anomalies. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be in a specified time period? This reminds me of SMART goals. Specific, measurable, achievable, relevant, and time-based. We should probably make sure each metric factors in these principles. |
||
| Forecasted unnecessary cloud spend | Measures the amount of forecasted unnecessary cloud spend if the anomaly was not detected for the billing period. | Amount of forecasted unnecessary cloud spend if anomaly was not detected for the billing period. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't completely clear. Is this about quantifying cost avoidance or avoided waste? |
||
| Proactive anomaly alerts | Measures the number of planned anomalies that were not proactively alerted to all core personas involved over a period. | Number of planned anomalies not proactively alerted to all core personas involved over a period. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this is either. |
||
| Anomaly detection accuracy | Measures the number of false positive and false negative anomaly alerts. | Number of false positives and false negatives. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be a number or a percentage? Although maybe false negatives should be separate from false positives since you'll never know all the missed anomalies, so it's not possible to calculate a percentage there. Maybe those should be split into separate KPIs. |
||
|
||
<br> | ||
|
||
## Learn more at the FinOps Foundation | ||
|
||
This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [Anomaly management capability](https://www.finops.org/framework/capabilities/anomaly-management) article in the FinOps Framework documentation. | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -108,6 +108,23 @@ At this point, you have a data pipeline and are ingesting data into a central da | |||||
|
||||||
<br> | ||||||
|
||||||
## KPIs and metrics | ||||||
|
||||||
flanakin marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Consider the following key performance indicators (KPIs) to measure the health, effectiveness, and completeness of your FinOps data estate. | ||||||
|
||||||
| KPI | Definition | Formula | | ||||||
|----------|-----------|-----| | ||||||
| Data completeness | Measures the extent to which all required data fields are present in the dataset and tracks the overall data completeness trend over a specified period.| Percentage of data fields that are complete and the overall data completeness over time. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do they know what "complete" is? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two aspects come to mind, the first ensuring all rows and columns of data have values(no empty cells). The second, connecting with stakeholders to determine what is required for analysis and crosschecking with what is currently ingested.
Suggested change
|
||||||
| Data quality | Measures the percentage of successful data quality checks and the total number of data quality checks conducted within a specified period. | Number of data quality checks conducted and the percentage of successful data quality checks . | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are data quality checks? Not sure how they would measure this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After the ingestion, checking the data to make sure all data fields are populated, in the expected range, consistent, follow the correct standards, or do not have unnecessary duplicate records.
Suggested change
|
||||||
| Investigation time to resolution | Measures the time taken to investigate and resolve data quality or availability issues and tracks the trend of this resolution time over a specified period. | Mean time to investigate and resolve data quality or availability issues, and the trend over time. | | ||||||
|Data ingestion frequency | Measures how often data is ingested into the system. | Number of data ingestion events per unit of time (daily, weekly, monthly, quarterly, annually). | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is this measured? What do they do with this information? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be based on the frequency that is selected for the cost exports to ensure it aligns with report refresh requirements.
Suggested change
|
||||||
| Data size | Measures the total volume of data ingested into the repository. | Total volume of data ingested into the repository. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes me wonder how they can do this using storage, PBI, and ADX. We should add those to the backlog if you don't have them already. |
||||||
| Growth rate | Measure the rate at which the volume of data ingested is increasing over time. | Percentage increase of total data volume in repository per unit of time. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is another good example of where we should add details about why it's important and what to do with it. This may not be obvious to everyone. It may also be useful to include the cost of the storage, compute, and networking costs associated with the ingestion process π€ Compute would also be applicable in reporting and not sure if we can differentiate, but something to consider. |
||||||
| Ingestion latency | Measures the average time taken for data to be ingested into the repository. | Average time of data ingestion latency per dataset. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some (or maybe all) of these seem like they are per dataset or data source. Should we call that out in any way? Should we have 2 lists? One for overall KPIs and one per dataset/source? π€ |
||||||
| Historical data availability | Measures the lookback period of data that is ingested and available for analysis. | Span of historical data ingested. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this really a KPI? This seems more like a configuration setting per data source. I like calling it out, but not sure it's an indication of "performance". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may need to be adjusted so that it is to ensure how much historical data they have is aligned to the business needs.
Suggested change
|
||||||
|
||||||
<br> | ||||||
|
||||||
## Learn more at the FinOps Foundation | ||||||
|
||||||
This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [data ingestion capability](https://www.finops.org/framework/capabilities/data-ingestion/) article in the FinOps Framework documentation. | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -129,6 +129,26 @@ At this point, you're likely utilizing the native reporting and analysis solutio | |
|
||
<br> | ||
|
||
## KPIs and metrics | ||
|
||
Consider the following key performance indicators (KPIs) to measure the effectiveness, timeliness, and completeness of your FinOps reporting. | ||
|
||
| KPI | Definition | Formula | | ||
|--------------|----------------|---------| | ||
| Reporting needs | Measures the identification of stakeholders and their reporting needs. | Percentage of stakeholders with defined reporting needs. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this achievable? For a large organization, it seems unlikely that a FinOps team would talk to everyone. |
||
| Report coverage | Measures the number of teams with comprehensive reports available. | Number of teams with reports for all personas. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what this is measuring exactly. |
||
| Report distribution | Measures the frequency and reach of distributed reports. | Frequency and reach of distributed reports. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is making me question the term "report". Is this referring to an email? Microsoft generally refers to a preconfigurated collection of visuals as a report, not an email. Should this be about alerts or email notifications? |
||
| Investigative time | Measures the time required to analyze cloud usage and cost questions. | Average time to report on required cloud usage and costs details. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this achievable? How would they measure it? |
||
| Tagging compliance | Measures the resource tagging compliance to facilitate accurate reporting and analytics. | Percentage of resources tagged and the compliance. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't belong here. Tagging is part of allocation. |
||
| Spend awareness | Measures the awareness and accountability of cloud spend across all workloads, and personas. | Percentage of personas receiving cloud usage and cost reports. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this achievable? |
||
| Feedback pipelines | Evaluates feedback processes for stakeholders and core personas. | Automation capability to provide feedback on reports to the FinOps team. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like what I think this is trying to cover: feedback on reports. If that's what it is, it probably needs to be reworded to be a little clearer. The KPI name should be a metric and the formula should be a formula. I'm wondering if there's a diference between new feedback in a period, resolved feedback in a period, and total active/unresolved feedback. We should also add this to the backlog as something for us to add to our reports. |
||
| Adoption rate | Measures usage of the reporting and analytics tools. | Percentage of teams utilizing provided reports. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How would this be measured? Would DAU, WAU, and MAU be more achievable? I like the idea of tracking adoption, but that's just the first use and not the ongoing use. I wonder how many FinOps teams would really care to get into adoption, engagement, and retention. Although, the same could be asked about MAU. since it doesn't indicate success or give context on growth π€ |
||
| Data update frequency | Tracks how often report data is updated. | Time between data refreshes. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't this belong in data ingestion? |
||
| Data accuracy | Evaluates the accuracy of available reports. | Accuracy percentage of the reports. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Data ingestion? How is this measured? |
||
| Report development | Measures the time to develop requested reports. | Average time to generate and provide a new or updated report to stakeholders. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How useful is this? Every report will have a different amount of complexity and, in theory, teams shouldn't be churning out new reports endlessly. I like the idea of it, but I suspect new reports would be rare once a team is established. They'll likely update existing ones more. |
||
|
||
<br> | ||
|
||
## Learn more at the FinOps Foundation | ||
|
||
This capability is a part of the FinOps Framework by the FinOps Foundation, a non-profit organization dedicated to advancing cloud cost management and optimization. For more information about FinOps, including useful playbooks, training and certification programs, and more, see the [Reporting and analytics capability](https://www.finops.org/framework/capabilities/reporting-analytics/) article in the FinOps Framework documentation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure the formulas are actual formulas. This applies to all KPIs.
Also, avoid the use of "cloud". We need to remove that everywhere in the next Framework update so that will make the next update easier.