Automated Query Correction component - DellSpell, is a key DellNLP algorithm component that involves the detection and auto-correction of spelling errors. Several exploratory analyses helped identify key issues found in textual data within Dell Technologies.
The analysis revealed that the following types of issues are prominent in Dell Technologies text:
Addressing these issues meant investigating readily available libraries that perform automated spelling correction. The SymSpell library could handle issues such as spelling mistakes, compounds, and decompounds, but was not expected to work on regular Dell Technologies text because of proprietary and internal terminology that does not exist in SymSpell’s language resources. This led to the creation of custom language resources from Dell Technologies data that are compatible with SymSpell. The key language resources developed include a unigram database and a bigram database.
Still, even after constructing SymSpell-compatible dictionaries, empirical analyses showed a substantial number of false positives, where false positive means the algorithm unnecessarily correcting a correct term, leading to an incorrect term. The SymSpell word segmentation feature also has low accuracy when operating on Dell Technologies data.
In response to these issues, the language resources were leverages to derive and devise custom algorithms to perform compounding and decompounding tasks.
Minimizing false positives required a mechanism to semantically validate syntactic corrections proposed by the SymSpell module. For this purpose, a word embedding model was trained and devised as an algorithm to predict spelling corrections based on word embeddings trained on Dell Technologies data. This completely novel algorithm acts as a key component that reduces false positives.
The development of these functionalities led to several useful Python functions that diverse teams can leverage in various NLP applications, leading to the decision to expose these functionalities through a package and an API, later named DellNLP.
The following figure shows the main components of our algorithm, some of which are discussed in detail within this document: