Making Sense of the Entity-Resolution (ER) Black Box

Hundreds of solutions at J.P. Morgan sit on top of an entity resolution process that outputs a dataset meant to represent the global corporate universe

After much turnover and a complete absence of documentation, the process had turned into a black box that is fed hundreds of millions of data points. Our only artifacts were the source code itself.

I considered this process to be the most crucial function within the team. The models that sit on top of this data can only be as good as the data they consume.

Define success - Minimize False Positive & False Negatives

In this example, ER detected two separate entities, however the records from entity two belong with entity one making them false negatives. Meanwhile “Mc Donald’s Foundation” is a false Positive because it doesn’t belong with the rest of the records in entity one.

A false positive occurs when the algorithm incorrectly identifies two distinct entities as being the same. This misclassification results in erroneously linking separate records, which can lead to inaccurate data consolidation and potential operational issues.

A false negative refers to a situation where the algorithm fails to recognize two entities that actually represent the same real-world object. This oversight prevents the effective merging of data, which can lead to double counting of entities

Conceptualize Entity-Resolution as a Graph

Using another product that I solely developed, the “graph visualizer”, the team and I were able to understand the current processes weaknesses along with why and how it was generating False Positives and False Negatives.

ER produces pairwise records which are matched together. Records are represented as nodes. And their connection strength is represented by an edge connecting the nodes. Joining all of these matches forms a graphical entity.

In this example the highly connected component is Mc Donald’s Corp, while it’s branches are separate entities (Mc Donald’s Foundation, Mc Donald’s Finance Co., Mc Donald’s Service Co., and Mc Donald’s Logistics Co.)

With this visualization we can see that most edges are good but a few are bad actors that bring together separate entities. Very troubling is that there are some records that are transitively matched together (A-B, B-C, A-C) despite no direct linkage. These transitive matches span up to 6 edge lengths!

Add in Community Detection

To create more intelligent entities, we decided to implement the Leiden algorithm on top of our graphically connected entities. We chose Leiden over other algorithms such as Louvain’s due to its more stable outcomes

This implementation led to a massive reduction in False Positive occurrences. It also allowed us to loosen our matching strength conditions which led to a more edges in our graphical representation. This increase in connectivity worked to reduce the occurrence of False Negatives.

You can find more information on the Leiden Algorithm here: Leiden wiki link

Results and Impact of Changes

Reduction 2MM+ False Negatives entities

Reduced incidence of False Positives entities by over 50%

General increase in accuracy for downstream models that consume ER output

Reduction of negative consumer feedback

Other Enhancements

Validation of graphical connections that are pulled forward from prior model runs. Boosted recall and stability of outcomes
Revisitation of standardizations across different matching fields
Assessment of “graphical connections” that are always honored
Developed of KPIs to measure accuracy of matching outcomes (homogeneity of records, LSP, etc.)

Next steps - Modernized Model

Generate Semantic embedding for the attributes used in the entity-resolution process (name, address, etc.)
Transformation of newly created attribute embeddings into a record level embedding
Map these embeddings into vector space and find nearest neighbors as candidate pairs
Train agents to determine if candidate pairs are the same entity