Batch prediction pipeline

Thank you for your feedback!

The architecture includes several key stages:
- Data Ingest and Filtering─This step obtains the latest data at three hourly cadences. Next, it runs a language detection model on raw data and stores detected languages against records. Then, it performs data filtering operations including filtering out unsupported languages, records with missing text data, or records with only one or two words.
- Data Preprocessing and Normalization─This step minimizes noise for the model including removal of irrelevant tool-generated content and machine-generated prompts from chat records. Next, it expands commonly used acronyms and abbreviations by support agents with the aid of DellNLP. The final step involves lemmatizing text.
- Model Inference─This steps converts filtered, cleaned text into ML features and feeds it to the model to obtain the classification result. As discussed in Model Experimentation, character level TF-IDF data is used and the final model is an augmented Linear Support Vector Machine, which outputs model confidence for individual prediction. Final output is a T1-T2-T3 label, given a textual case log.
The preceding stages have been implemented as separate Python packages following single-responsibility principles. Development of independent packages enabled effective management of updates, tracking of versions, improved code reuse, and ease of automation through Continuous Integration (CI/CD) pipelines.