Impact measurement was very complicated — before AI

One of impact measurement’s most persistent problems is comparability. Thousands of organizations track social, environmental and economic outcomes using different frameworks and different languages. That diversity is healthy for mission fit, but it makes aggregation and cross-portfolio analysis difficult.

For decades, we have chased standardization through taxonomies and frameworks. This is useful, yes, but differences in context challenge the uniformity we seek to impose. 

Organizations should measure what matters for their missions and markets. For example, a job created in Lagos is not the same as a job created in Lisbon. The job in Lagos may be informal yet transformative for a household, while in Lisbon it may be formal yet marginal in effect. Artificial intelligence lets us model that nuance rather than treat it as noise. 

There is precedent that aggregation can work without erasing context. Research on the Common Impact Data Standard shows how dissimilar indicators can be “messily” rolled up and mapped to multiple frameworks while preserving meaning, offering a practical path for portfolio-level comparisons.

Today, impact data is usually fragmented, collected from multiple sources, and measured in different ways. A recent AI pilot offers an alternative path forward that could transform the impact investment ecosystem if adopted at scale. 

From theory to practice

A few years ago, our team at Agile Impacts took on a challenge from the Inter-American Development Bank to compare impact across 500 investment operations, each with different indicators, frameworks and time horizons, including methodology shifts midstream and evolving terminology. 

The core question was whether we could train a system to decide when indicators are equivalent and can be aggregated, when they are complementary and should be compared side by side, and when they reflect different dimensions that should remain distinct.

Our pilot showed that artificial intelligence can bridge frameworks without erasing meaning. It recognized statements such as “microcredit borrowers reached” and “low-income entrepreneurs financed” as describing the same progress despite different wording. Because cross-project indicator matching was not feasible manually, no baseline existed. With artificial intelligence, we made the task possible and reached 90% precision in internal tests. 

Women TechEU, a two-year project funded by the European Union to support women leading deep tech startup companies, recently recognized and funded this approach as a deep tech innovation. Agile Impacts has signed a formal agreement with the Polytechnic University of Valencia to ensure the academic rigor of the solution.

Borrowing a page from astrophysics

Comparability is a clustering challenge: The goal is to group indicators that measure the same underlying concept, even if their language, structure or data source differs. 

Traditional clustering methods depend on numeric similarity. Yet impact indicators are largely text based and context dependent, so the goal is conceptual proximity instead. 

We borrowed a method from astrophysics: the torque balance algorithm. In astronomy, torque balance helps identify natural clusters of stars based on mass, which represents gravitational influence, and distance, which represents the strength of interaction. Clusters form where the torques reach equilibrium and groups become stable.

Translated to impact measurement, a star becomes an indicator. Mass becomes indicator weight or relevance, such as its frequency across datasets or its significance in a framework like IRIS+. Distance becomes the semantic similarity between indicators. Torque becomes the conceptual pull indicators exert on each other. And finally, clusters are the groups of indicators that measure the same underlying concept.

What investors actually see

Consider two familiar metrics: “people trained for employment” and “people trained in marketing.” They share meaning and units, so they cluster together and can roll up to a higher-level measure such as “total people trained.” By contrast, “jobs created” belongs in a different cluster, since it reflects a later stage in the employment outcome chain and should be compared with placements and retention, not aggregated with training.

Now imagine you manage a $200 million fund comparing two workforce programs, one in Nairobi and one in Lima. Normally, your impact reporting would include a jumble of training hours, completion rates, certifications, job placements and six-month retention. With torque balance scaled by artificial intelligence, indicators self-organize. Training measures roll up to a comparable “people trained” metric, while employment outcomes sit in a separate cluster. 

Perhaps the system would show an 82% semantic match between “placed within 90 days” and “formal employment at six months,” allowing benchmarking side-by-side, with a published confidence threshold. You’d see that Program A delivers 140 comparable placements per $1 million versus 95 for Program B, while Program B outperforms on retention. You could then reweight capital and add a performance-linked tranche tied to retention.

Sifting through complexity

I am not challenging impact standards in principle. They were, for a long time, the only practical mechanism for achieving comparability, and a top-down structure was simply unavoidable. 

In practice, however, many companies and consultants express frustration that these standards do little to help them solve their underlying challenges.

Artificial intelligence fundamentally changes this equation. The problem can now be approached from the bottom up: Companies are free to measure impact in the ways that create the most value for them, while AI and algorithmic models take on the task of making those diverse metrics comparable. 

This approach avoids rigid standard metrics that struggle to keep pace with evolving language. It allows data to self-organize into coherent groups guided by meaning and weight, and it can continuously re-cluster as new information arrives. It also reuses historical data by interpreting it through a new framework, making transparent the assumptions behind aggregation and explaining the algorithm’s decision logic. 

Just as financial statements explicitly describe how each indicator is calculated, what ultimately matters is not the specific method but the transparency that underpins it. Together, these features open the door to a new architecture for impact data.

From comparability to competitiveness

When data becomes comparable, markets transform. This has happened before: Credit ratings shifted lending from relationship-based to risk-based, and financial indices gave investors a common language that reshaped global capital flows in the 1970s. Impact investing may now be approaching a similar inflection point, one in which technology finally enables what the ecosystem increasingly demands: genuine comparability across diverse interventions.

Comparability turns fragmented reporting into decision-ready comparisons that travel across programs, geographies and time. DFIs gain like-for-like views to set performance-linked terms, compare pipelines and justify co-financing. Funds obtain verifiable evidence of outperformance against true peers, with shared rules for what to aggregate and what to keep separate. Everyone spends less time harmonizing and more time allocating. That is how comparability becomes competitiveness and investability.

The invitation is open: If you manage a portfolio and want decision-ready, like-for-like comparisons across indicators, or if you would like to validate torque balance on your data, let’s talk.


Adriana Mata is co-founder of Agile Impacts.

Guest posts on ImpactAlpha represent the opinions of their authors and do not necessarily reflect the views of ImpactAlpha.