Main tasks of EDA involved defining noise in case logs and performing class distribution and textual distribution analysis.
In addition to performing n-gram (unigram, bigram, and trigram) analysis on case logs, we also performed analysis on the average word count, character count, and language distribution analysis. These analyses were instrumental in building optimal models (see Model Experimentation) and understanding how to identify and handle lengthy or extremely short text logs.
Noise analysis helped in defining necessary preprocessing steps. For example, it was noticed that case logs contain many acronyms and abbreviations introduced by support agents. Misspellings were also common. DellNP, an NLP product previously developed in-house, effectively addressed these challenges. It was also noticed that some case logs contain machine-generated content such as the logs from different diagnostic tools. Strategies were identified to extract relevant information from such logs and remove redundant content that can lead to model confusion. Unigram analysis was also useful in identifying and building a custom stop-word list for minimizing noise.
We also performed an in-depth analysis of target label distribution. As anticipated, we noticed a huge class imbalance. For example, the most frequently occurring label had 480 instances, whereas the least frequently occurring label only had nine instances in the dataset. The analysis also revealed that the nature of the problem was multiclass classification.
EDA was also helpful in proposing some new labels. One such label we added is “Not Sufficient Information”. It is useful when a case log presents minimal to no information about a technical issue. We also noticed that depending on the level of technical details present in a case-log, the target label needs to be carefully drawn from the T1-T2-T3 label hierarchy that is, T1 – minimal information present, or T1-T2 - some technical/issue details are available or T1-T2-T3 – all necessary details present so that we can precisely identify the issue.