Test methodology

Thank you for your feedback!

For a good user experience in interactive applications with an LLM, the next token/2nd token latency is considered a KPI. We measured the next token latency with various input token sizes ranging from small to large represented by 32 and 2k tokens. We also conducted a few experiments in varying the output token length, however that did not significantly impact the KPI of next token latency. As such, we left the output token length to 32 tokens. We took an average of many runs with each run being 100 individual requests (batch size 1). We measured the 2nd token latency for the Llama 2 7B inference with and without being secured by TDX across fp32, bf16, and int8 precisions in the VM. FP32 uses AVX512 instructions while bf16 and int8 are boosted in performance by Intel® AMX on Xeon® processors.