Deduplication is a deceptively complex problem to handle at scale. There’s a simple, ugly and brutal way to do it, where you compare two records and determine if they are exactly the same, but this doesn’t work if the records are very slightly different. A missing comma is enough to render that algorithm useless.
Doing deduplication properly means more than just mashing together texts, its about building the infrastructure and referencing that allows you to narrow the body of potential duplicates, whilst also providing accurate and intuitive analysis of what constitutes a duplicate.
Augmenting texts with machine learning and then using that to refine the matching process and similarity scoring means that our deduplication algorithm is 99% accurate across a body of over 50,000 live tender opportunities each day.
At Spend Network, we have government procurement data for more than 150 countries globally. We take the hard work out of collecting, organising and analysing procurement data at scale. Get in touch to find our more about our data products and services.