An ML model must have sufficient examples for each class to “learn” how to make inferences for that class on unseen text. Therefore, we decided to use a minimum threshold of 10 examples for each label, which eliminated some rarely occurring issues from our dataset. The total number of classes remaining was 234.
We conducted a time-bound literature review to understand the different techniques that would be most beneficial for the use case involving multiclass text classification of documents with variable character length and class imbalance. In addition to exploring academic literature, leaderboards of various NLP benchmarks such as superGLUE were analyzed to identify applicable NLP techniques. During the analysis of leaderboards, we discovered that XLNET outperformed other NLP techniques in most of the tasks. However, classical ML techniques and deep learning (DL) text classification approaches, including hierarchical classification techniques were also identified as candidate techniques handling text features, a large number of classes, and class imbalance.
Next, we designed key experiments for attempting different ML/DL techniques, text preprocessing steps, featurization of text, and handling class imbalance. The techniques experimented include tree-based methods (for example, Catboost, Random Forest), Support Vector Classifier (SVC), XLNET, and DL techniques for hierarchical and imbalance classification. Additional experiments included the application of pseudolabeling strategies to address the lesser amount of training data for certain classes, and application of Synthetic Minority Oversampling Technique (SMOTE) to treat the class imbalance. However, the latter techniques such as pseudolabelling and SMOTE did not lead to considerable performance improvements. For representing the text, we attempted word/character count-based vectors, word/character Term Frequency-Inverse Document Frequency (TF-IDF) vectors, as well as word/document embeddings.
The following table summarizes model performances of the key techniques attempted (for English model):
Table 2. Model experiment
Experiment | Macro-fscore |
Textcnn | 0.136225 |
Textrnn | 0.106093 |
Textrcnn | 0.098848 |
FastText | 0.009733 |
AttentionConva | 0.110873 |
Transformer | 0.059808 |
TextVDCNN | 0.003865 |
DRNN | 0.022878 |
XLNET | 0.1442 |
SVC | 0.25 |
Similar models were developed for Mandarin and Portuguese and used custom preprocessing for each language. Along with looking at macro-fscore, Normalized Discounted Cumulative Gain (NDCG) for three models, known asT1, T1-T2 and T1-T2-T3 models, was also calculated. Based on the experimentation results and considering the resource requirements for model deployment, SVC was chosen as the final model. SVC model was trained using TF-IDF features. TF-IDF penalizes the high-frequent and common words that are not useful for classification. The combination of character n-grams TF-IDF features and SVC provided the best overall results for this use-case when compared to other model-feature set combinations. The NDCG score achieved for T1, T1-T2 and T1-T2-T3 models were 0.9, 0.8 and 0.6 for the top three labels. The NDCG metric is more applicable for this use-case as the top three model recommendations are shown to the support agents.