40-year old trick to clean data efficiently, and perform fuzzy matching

While much of data cleaning is performed before loading data in a database (especially for one-time, ad hoc analyses), there is a way to do it, continuously (like once a week or once a day), once the data is in its final database. It consists of adding look-up tables to help with the messy fields.

When I was working at eBay, there was a database of clients from around the world. Some clients had names in a foreign language, containing accents and special characters. Somehow, it made some SQL joins very tricky. We created a lookup table of names, matching different spelling of a company name, to a standardized name and client ID. Think about names such as M.I.T and MIT that represent the same entity but can be spelled differently. It also helps dealing with duplicate records.This old trick allows you to do fuzzy matching, and the size of the lookup tables (updated daily) was manageable.

What do you think of this idea? Of course the best solution is to use this system, together with traditional cleaning techniques, if possible. But in systems where data is automatically uploaded and updated on a daily basis, lookup tables are very helpful.

Posted by Vincent Granville

Post HERE

NoeliaGorod | Inteligencia Artificial aplicada a negocio

Estrategia, casos reales y gobernanza de IA para empresas y líderes.

40-year old trick to clean data efficiently, and perform fuzzy matching

Deja un comentario Cancelar la respuesta

40-year old trick to clean data efficiently, and perform fuzzy matching

Comparte esto:

Relacionado

Deja un comentario Cancelar la respuesta