Semantic search, Classification and Data migration: the winning team

Seven years ago, I had a mission to perform data migration from one system to another. One of the major challenges was to import parts inside a hierarchy of categories, which have been designed in the new system. Analyzing the legacy data, another hierarchy had been set, but this hierarchy had more than 900 entries, so users had mostly used a wrong category, making this information totally unreliable.

So we estimated we could try to use the description field of parts to classify the objects, guessing that the users had used meaningful words to describe their objects. The method had to be found.

So I imagined an algorithm to do so. The method was to analyze the words of the description, and to compare those words to a dictionary, providing as well a multiplication factor to each word depending on its position in the description. In parallel, I built the technical dictionary analyzing the description of roughly 500 000 parts, founding the most used words.

I shown that more than 75% of the parts could be automatically migrated using this algorithm. For the remaining 25%, I built an application which was providing the list of parts to classify, and the possible categories available in the new system, and we asked to experts to manually classify the remaining parts. Having done that, I enriched my dictionary with some new words that I had not been able to imagine the meaning (including some funny ones…). With the new dictionnary, we could be able to automatically classify more than 90% of the parts.

Then we set up an automatic procedure using this algorithm in order to migrate data at night from the legacy system to the new one, as both systems were decided to run in parallel for a given period of time. This system ran for one year, until all project data was migrated to the new system. Then the migration system was stopped, and put on archive. I created a semantic search engine without knowing it.

Years after, I have now to implement a search engine based on Exalead search engine. This technology implements semantic options, and hopefully I can reuse the dictionary I built seven years ago to provide more value this new technology.

My conclusion today is that there are several lessons I learnt from this experience:

  • semantic search can help migrate data
  • semantic search can help classify data
  • data migration activity can bring value for future activities
  • companies should pay attention building technical dictionaries, compiling words that users are using everyday