AMD EPYC Milan Processors and FFC Mode Accelerate Oracle 244%
Thu, 06 May 2021 18:54:59 -0000|
Read Time: 0 minutes
Intriguing, right? The Oracle team at Dell Technologies recently configured a Dell EMC PowerEdge server equipped with the new AMD EPYC processors to test the performance of an Oracle 19c database feature called Force Full Cache (FFC) mode. When using FFC mode, the Oracle server attempts to cache the full database footprint in buffer cache memory. This effectively reduces the read latency from what would have been experienced with the storage system to memory access speed. Writes are still sent to storage to ensure durability and recovery of the database. What’s fascinating is that by using Oracle’s FFC mode, the AMD EPYC processors can accelerate database operations while bypassing most storage I/O wait times.
For this performance test our PowerEdge R7525 server was populated with two AMD EPYC 7543 processors with a speed of 2.8 GHz, each with 32 cores. There are several layers of cache in these processors:
- Zen3 processor core, includes an L1 write-back cache
- Each core has a private 512 KB L2 cache
- Up to eight Zen3 cores share a 32 MB L3 cache
Each processor also supports 8 memory channels and each memory channel supports up to 2 DIMMS. With all these cache levels and memory channels our hypothesis was that the AMD EPYC processors were going to deliver amazing performance. Although we listed the processor features we believe will most impact performance, in truth there is much more to these new processors that we haven’t covered. The AMD webpage on modern data workloads is an excellent overview. For a deep dive into RDBMS tuning, this white paper provides more great technical detail.
For the sake of comparison, we also ran an Oracle database without FFC mode on the same server. Both database modes used the exact same technology stacks:
- Oracle Enterprise Edition 19c (188.8.131.52.200414)
- Red Hat Enterprise Linux 8.2
- VMware vSphere ESXi 7.0 Update 1
We virtualized the Oracle database instance since that is the most common deployment model in use today. AMD and VMware are continuously working to optimize the performance of high value workloads like Oracle. In the paper, “Performance Optimizations in VMware vSphere 7.0 U2 CPU Scheduler for AMD EPYC Processors” VMware shows how their CPU scheduler achieves up to 50% better performance than vSphere 7.0 U1. As the performance gap narrows between bare metal and virtualized applications, the gains in agility with virtualization outweigh the minor performance overhead of a hypervisor. The engineering team performing the testing used VMware vSphere virtualization to configure a primary Oracle virtual machine that was cloned to a common starting configuration.
HammerDB is a leading benchmarking tool used with databases like Oracle, Microsoft SQL Server and others. The engineering team used HammerDB to generate a TPROC-C workload on the database VM. The TPROC-C benchmark is referred to as an Online Transaction Processing (OLTP) workload because it simulates terminal operators executing transactions. When running the TPROC-C workload the storage system must support thousands of small read and write request per minute. With a traditional configuration, Oracle’s buffer cache would only be able to accelerate a portion of the reads and writes. The average latency of the system will increase when more reads and writes go to storage system as the wait times are greater for physical storage operations than memory. This is what the team expects to observe with the Oracle database that is not configured for FFC mode. Storage I/O is continually getting faster but not nearly as fast as I/O served from memory.
Once the test tool warms the cache, most of the reads will be serviced from memory rather than from storage, providing what we hope will be a significant boost in performance. We will not be able to separate out the individual performance benefits of using AMD EPYC processors combined with Oracle’s FFC mode, however, the efficiencies gained via AMD caches, memory channels, and a VMware vSphere optimizations will make this performance test fun!
Before reviewing the performance results, it is important that we review the virtual machine, storage, and TPROC-C workload configurations. One important difference between the baseline virtual machine (no FFC mode) and the database configuration with FFC mode enabled is the memory allocated to the SGA. A key consideration is that the logical database size is smaller than the individual buffer cache. See the Oracle 19c Database Performance Tuning Guide for a complete list of considerations. In this case the SGA size is 784 GB to accommodate caching the entire database in the Oracle buffer cache. All other configuration parameters like vCPU, memory, and disk storage were identically configured.
Using memory technologies like Force Full Cache mode should be a key consideration for the Enterprise as the AMD EPYC processors enable the PowerEdge R7525 servers to support up to 4 TB of LRDIMM. Depending upon the database and its growth rate, this could support many small to medium- sized systems. The advantage for the business is the capability to accelerate the database by configuring a Dell EMC PowerEdge R7525 server and AMD processors with enough memory to cache the entire database.
Table 1: Virtual Machine configuration and SGA size
Oracle Force Full Cache Mode
This database storage configuration includes using VMware vSphere’s Virtual Machine File System (VMFS) and Oracle’s Automatic Storage Management (ASM) on Direct Attached Storage (DAS). The storage configuration is detailed in the table below. ASM Normal redundancy mirrors each extent providing the capability to protect against one disk failure.
Table 2: Storage and ASM configuration
We used HammerDB to create a TPROC-C database with 5,000 simulated warehouses which generated approximately 500 GB of data, whichwas small enough to be loaded entirely in the buffer cache. Other HammerDB settings we used included those shown in this table:
Table 3: HammerDB: TPROC-C test configuration
Time Driver Script
Total Transactions per user
Minutes of Ramp Up Time
Minutes of Test Duration
Use All Warehouses
Number of Virtual Users
New Orders Per Minute (NOPM) is a metric that indicates the number of orders that were fully processed in one minute. This performance metric provides insight into the performance of the database system and can be used to compare two different systems running the same TPROC-C workload. The AMD EYPC processors combined with FFC mode delivered an outstanding 244% more NOPM than the baseline system. This is a powerful finding because it shows how tuning the hardware and software stack can accelerate database performance without adding more resources. In this case the optimal technology stack included AMD EYPC processors which, when combined with Oracle’s FFC mode, accelerated NOPM by 2.4 times the baseline.
Figure 1: New Orders Per Minute Comparison
What factors played a role in boosting performance? The Average Storage Response Time chart for the baseline test shows that the system’s average storage response time was .24 milliseconds. The goal of OLTP production systems is that most storage response times should be less than 1 millisecond as this is an indication of healthy storage performance. Thus, the baseline system was demonstrating good performance; however, even with the minimal storage response times the system only achieved 169,481 NOPM.
With FFC mode enabled, the entire database resided in the database buffer cache. This resulted in fewer physical reads and faster average storage response times. Results show the average storage response time with FFC was less than half the baseline numbers at just .11 milliseconds, or 2.2 times faster than the baseline. With most of the I/O activity working in memory the AMD EYPC processor cache and memory channel features provided a big boost in accelerating the database workload!
Figure 2: Average Storage Response Time
The combination of AMD EYPC processors with Oracle’s Force Full Cache mode should provide extremely good performance for databases that are smaller than 4 TBs. Our test results show an increase in 244% in New Orders per Minute and faster response time, meaning that this solution stack built on the PowerEdge R7525 can accelerate an Oracle database that fits the requirements of FFC mode. Every database system is different, and results will vary. But in our tests this AMD-based solution provided substantial performance.
Table 4: PowerEdge R7525 Configuration
2 x AMD EPYC 7543 32-Core processors @ 2800 MHz
16 x 128GB @ 3200 MHz (for a total of 2TB)
8 x Dell Express Flash NVMe P4610 1.6TB SFF (Intel 2.5 inch 8GT/s)
1 x Broadcom Gigabit Ethernet BC5720
1 x Broadcom Advanced Dual Port 25 Gb Ethernet
Related Blog Posts
The Dell PowerEdge C6615: Maximizing Value and Minimizing TCO for Dense Compute and Scale-out Workloads
Tue, 19 Sep 2023 18:00:49 -0000|
Read Time: 0 minutes
In the ever-evolving landscape of data centers and IT infrastructure, meeting the demands of scale-out workloads is a continuous challenge. Organizations seek solutions that not only provide superior performance but also optimize Total Cost of Ownership (TCO).
Enter the new Dell PowerEdge C6615, a modular node designed to address these challenges with innovative solutions. Let's delve into the key features and benefits of this groundbreaking addition to the Dell PowerEdge portfolio.
- Maximizing Rack utilization: One of the primary challenges in the data center world is maximizing rack utilization. The Dell PowerEdge C6615 addresses this by offering dense compute options.
- Cutting-edge processors: High-performance processors are crucial for scalability and security. The C6615 is powered by a 4th Generation AMD EPYC 8004 series processor, ensuring top-tier performance.
- Total Cost of Ownership (TCO): TCO is a critical consideration that encompasses power and cooling efficiency, licensing costs, and seamless integration with existing data center infrastructure. The C6615 is designed to reduce TCO significantly.
Introducing the Dell PowerEdge C6615
The Dell PowerEdge C6615 is a modular node designed to revolutionize data center infrastructure. Here are some key highlights:
- Price-performance ratio: The C6615 offers outstanding price per watt for scale-out workloads, with up to a 315% improvement compared to a one-socket (1S) server with AMD EPYC 9004 Series server processor.
- Optimized thermal solution: It features an optimized thermal solution that allows for air-cooling configurations with up to 53% improved cooling performance compared to the previous generation chassis.
- Density-optimized compute: The C6615's architecture is tailored for scale-out WebTech workloads, offering exceptional performance with reduced TCO.
- High-speed NVMe storage: It provides high-speed NVMe storage for applications with intensive IOPS requirements, ensuring efficient performance.
- Efficient scalability: With 40% more cores per rack compared to the AMD EPYC 9004 Series server processors, the C6615 allows for quicker and more efficient scalability.
- SmartNIC: It includes a SmartNIC with hardware-accelerated networking and storage, saving CPU cycles and enhancing security.
To maximize efficiency and reduce environmental impact, the PowerEdge C6615 incorporates several key features:
- Power and thermal efficiency: The 2U chassis with four nodes enhances power and thermal efficiency, eliminating the need for liquid cooling.
- Flexible I/O options: It supports up to two PCIe Gen5 slots and one 16 PCIe Gen5 OCP 3.0 slot for network cards, ensuring versatile connectivity.
- Security: Security is integrated at every phase of the PowerEdge lifecycle, from supply chain protection to Multi-Factor Authentication (MFA) and role-based access controls.
In benchmark testing, the C6615 outperforms the competition:
- HPL Benchmark: It showcases up to a 335% improvement in performance per watt per dollar and a 210% increase in performance per CPU dollar compared to other 1S systems with the AMD EPYC 9004 Series server processor.
Figure 1. HPL benchmark performance
- SPEC_CPU2017 Benchmark: Results demonstrate up to a 205% improvement in performance per watt per dollar and a remarkable 128% increase in performance per CPU dollar compared to similar systems.
Figure 2. SPEC_CPU2017 benchmark performance
The seamless integration of the Dell PowerEdge C6615 into existing processes and toolsets is facilitated by comprehensive iDRAC9 support for all components. This ensures a smooth transition while leveraging the full potential of your server infrastructure.
Dell's commitment to environmental sustainability is evident in its use of recycled materials and energy-efficient options, helping to reduce carbon footprints and operational costs.
In conclusion, the Dell PowerEdge C6615 emerges as a leading dense compute solution, delivering exceptional value and unmatched performance. For more information, visit the PowerEdge Servers Powered by AMD site and explore how this innovative solution can transform your data center operations.
Note: Performance results may vary based on specific configurations and workloads. It's recommended to consult with Dell or an authorized partner for tailored solutions.
Author: David Dam
Dell Reinforces its TPCx-AI Benchmark Leadership using the 16G PowerEdge R6625 Hardware Platform at SF1000
Wed, 12 Jul 2023 18:52:17 -0000|
Read Time: 0 minutes
On 06-13-2023, Dell Technologies published a TPCx-AI SF1000 result that was based on an 11 x Dell PowerEdge R6625 hardware platform powered by AMD Genoa processors. As of the publication date, Dell results held number one slots on the Top Performance and Price/Performance tables for TPCx-AI on SF3, SF100, SF300, and SF1000. These results reinforce Dell Technologies’ TPCx-AI benchmark leadership position; a statement to the great performance provided by its AI, ML, and DL solutions.
This blog presents the hardware platform that was tested, what was measured and what the results mean.
What TPCx-AI tests measure
TPCx-AI measures the end-to-end machine learning or data science platform using a diverse representative dataset scaling from 1 GB to 10 TB. The TPCx-AI benchmark assesses various aspects of AI training and inference performance, including data generation, model training, serving, scoring, and system scalability. The benchmark can be used across a wide range of different systems from edge to data center. It aims to provide a standardized and objective measure of AI performance across different platforms and configurations.
By using TPCx-AI, organizations and vendors can make informed decisions about the AI infrastructure that best suits their needs. The benchmark helps in understanding the system's capability to handle large-scale AI training workloads and can help optimize performance and resource allocation for AI tasks.
The TPCx-AI standard defines 10 use cases based on data science pipelines modeled on a retail business data center to evaluate the performance of artificial intelligence systems. The workload trains deep neural networks on large datasets using prominent machine learning frameworks such as TensorFlow. The benchmark measures:
- The total time taken to train a model for each use case to a specific level of accuracy
- The time taken for that model to be used for inference or serving
The blog, Interpreting the results of the TPCx-AI Benchmark, outlines the ten use cases, their data science models, and the benchmark phases.
System under test (SUT)
Figure 1 System Under Test (SUT).
Table 1 Software versions
Cloudera Data Platform (CDP)
Red Hat Enterprise Linux
8.7 (Master node)
Table 2 Primary metric scores
June 13, 2023
The three primary metrics in Table 2 are required for all TPC results. The top ten results, based on performance or price/performance at a particular SF category, are displayed in the tables of the respective benchmark standard categorized by the metric and SF. To compare any results, all three metrics must be disclosed in the body of the message. The TPC does not allow comparing TPCx-AI results from different SF categories. The blog, Interpreting the results of the TPCx-AI Benchmark, goes into the details of how the performance and price/performance metrics are calculated. The availability date is the date all the priced line items (SKUs) are available to customers and must be within 185 days of the submission date. For the performance metric, the higher the score the better. For price/performance, the lower the better.
Table 3 Other metrics
Total system cost
Cloudera SEL Data Platform Private Cloud Base Edition
Red Hat Enterprise Linux 8.6/8.7
Physical storage divided by scale factor
Scale factor divided by physical memory
Main data redundancy mode
Replication 3, RAID 1
Number of servers
Total processors, cores, and threads
Number of streams
The metrics in Table 3 are required to be reported and disclosed in the Full Disclosure Report (FDR) and Executive Summary (ES). Except for the total system cost, these other metrics are not used in the calculation of the primary metrics but provide additional information about the system that was tested. For instance, the total system cost is the total cost of ownership (TCO) for one year. The redundancy modes provide the data protection mechanisms that were used in the configuration as required by the benchmark standard. The number of streams refers to the number of concurrent serving tests during the Throughput phase.
Benchmark run times
Table 4 Benchmark run times
06-07-2023 9:35:25 PM
06-08-2023 3:20:10 AM
Benchmark phase times
Table 5 Benchmark phase metrics
Power Serving 1
Power Serving 2
The seven benchmark phases and their metrics are explained in Interpreting the results of the TPCx-AI Benchmark, and are performed sequentially from data generation to throughput tests. In power training, models are generated and trained for each use case sequentially from UC1 to UC10. In power serving, the models obtained during the training phase are used to conduct the serving phase sequentially, one use case at a time. There are two power serving tests. The test that registers the longer time provides the TPST metric. The throughput phase runs multiple streams of serving tests concurrently. The more the number of streams, the more the system resources are taxed. Typically, the number of streams are increased until TTTn+1 > TTTn (where n+1 refers to the next throughput test). The duration of the longest running stream (TTPUT) is used to calculate the throughput test metric TTT.
Use case times and accuracy
Table 6 Use case times and accuracy
-1.0 >= -1
word_error rate <= 0.5
mean_squared_log_error <= 5.4
f1_score >= 0.65
mean_squared_log_error <= 5.4 <= 0.5
matthews_corrcoef >= 0.19
median_absolute_error <= 1.8
accuracy_score >= 0.65
accuracy_score >= 0.9
accuracy_score >= 0.7
Table 6 shows the use case run times (in seconds) for each benchmark phase and the accuracy of the model that was used. For instance, the RNN model that was generated and trained for UC2 had a word_error rate of 0.4383 which was less (better) than the threshold error_rate of 0.5. The XGBoost model trained for UC8 was 74.99% accurate which was above and better than the 65% minimum accuracy threshold requirement.
Figure 2 Use case time by benchmark phase
TPCx-AI SF1000 results tables
Table 7 displays the top TPCx-AI SF1000 tables as of the publication of this blog.
Table 7 SF1000 top performance table
Table 8 Top price/performance table
Table 7 and Table 8 are similar. Of the four published results at SF1000, Dell Technologies’ hardware platforms hold the number 1, number 2, and number 3 positions on both the performance and price/performance tables. The main difference between the three top results is the processor generations:
- The number 1 result used 4th generation AMD Genoa processors
- The number 2 result used 3rd generation Intel Ice Lake processors
- The number 3 result used 2nd generation Intel Cascade Lake processors
- Dell dominates TPCx-AI top performance and price/performance tables at SF3, SF100, SF300, and SF1000.
- TPCx-AI performance improved greatly on newer generation Dell hardware platforms that have newer generation processors:
- There was a 60.71% performance improvement between hardware platforms powered by (14G) 2nd generation and (15G) 3rd generation processors.
- There was a 37.13% improvement between 3rd generation and (16G) 4th generation processors.
- TPCx-AI price/performance improved greatly between processor generations of the Dell 14G, 15G, and 16G hardware platforms:
- There was a 14.80% price/performance drop from hardware platforms powered by 2nd generation to 3rd generation processors.
- There was a 27.08% price/performance drop from 3rd generation to 4th generation processors.
- The form factor of the hardware platforms has reduced:
- The Dell 14G TPCx-AI SF1000 result used 2U servers
- The 15G and 16G results used 1U servers and scored better performance and price/performance
- Using NVMe data storage scored better price/performance metrics:
- The 14G result used hard drives
- The 15G and 16G results used more expensive NVMe data drives, and yet scored better price/performance metrics
This blog examined in detail the TPCx-AI performance result of the Dell 16G PE R6625 hardware platform. The result cemented Dell Technologies’ leadership positions on TPCx-AI performance and price/performance tables at SF1000, in addition to the leadership positions at SF3, SF100, and SF300. These results prove Dell Technologies’ leadership as a provider of high-performance AI, ML, and DL solutions based on verifiable performance data backed by a reputable, industry-standards performance consortium.
Nicholas Wakou, Nirmala Sundararajan; Interpreting the results of the TPCx-AI Benchmark; infohub.delltechnologies.com (February 2023).