AMD EPYC Milan Processors and FFC Mode Accelerate Oracle 244%
Thu, 06 May 2021 18:54:59 -0000|
Read Time: 0 minutes
Intriguing, right? The Oracle team at Dell Technologies recently configured a Dell EMC PowerEdge server equipped with the new AMD EPYC processors to test the performance of an Oracle 19c database feature called Force Full Cache (FFC) mode. When using FFC mode, the Oracle server attempts to cache the full database footprint in buffer cache memory. This effectively reduces the read latency from what would have been experienced with the storage system to memory access speed. Writes are still sent to storage to ensure durability and recovery of the database. What’s fascinating is that by using Oracle’s FFC mode, the AMD EPYC processors can accelerate database operations while bypassing most storage I/O wait times.
For this performance test our PowerEdge R7525 server was populated with two AMD EPYC 7543 processors with a speed of 2.8 GHz, each with 32 cores. There are several layers of cache in these processors:
- Zen3 processor core, includes an L1 write-back cache
- Each core has a private 512 KB L2 cache
- Up to eight Zen3 cores share a 32 MB L3 cache
Each processor also supports 8 memory channels and each memory channel supports up to 2 DIMMS. With all these cache levels and memory channels our hypothesis was that the AMD EPYC processors were going to deliver amazing performance. Although we listed the processor features we believe will most impact performance, in truth there is much more to these new processors that we haven’t covered. The AMD webpage on modern data workloads is an excellent overview. For a deep dive into RDBMS tuning, this white paper provides more great technical detail.
For the sake of comparison, we also ran an Oracle database without FFC mode on the same server. Both database modes used the exact same technology stacks:
- Oracle Enterprise Edition 19c (22.214.171.124.200414)
- Red Hat Enterprise Linux 8.2
- VMware vSphere ESXi 7.0 Update 1
We virtualized the Oracle database instance since that is the most common deployment model in use today. AMD and VMware are continuously working to optimize the performance of high value workloads like Oracle. In the paper, “Performance Optimizations in VMware vSphere 7.0 U2 CPU Scheduler for AMD EPYC Processors” VMware shows how their CPU scheduler achieves up to 50% better performance than vSphere 7.0 U1. As the performance gap narrows between bare metal and virtualized applications, the gains in agility with virtualization outweigh the minor performance overhead of a hypervisor. The engineering team performing the testing used VMware vSphere virtualization to configure a primary Oracle virtual machine that was cloned to a common starting configuration.
HammerDB is a leading benchmarking tool used with databases like Oracle, Microsoft SQL Server and others. The engineering team used HammerDB to generate a TPROC-C workload on the database VM. The TPROC-C benchmark is referred to as an Online Transaction Processing (OLTP) workload because it simulates terminal operators executing transactions. When running the TPROC-C workload the storage system must support thousands of small read and write request per minute. With a traditional configuration, Oracle’s buffer cache would only be able to accelerate a portion of the reads and writes. The average latency of the system will increase when more reads and writes go to storage system as the wait times are greater for physical storage operations than memory. This is what the team expects to observe with the Oracle database that is not configured for FFC mode. Storage I/O is continually getting faster but not nearly as fast as I/O served from memory.
Once the test tool warms the cache, most of the reads will be serviced from memory rather than from storage, providing what we hope will be a significant boost in performance. We will not be able to separate out the individual performance benefits of using AMD EPYC processors combined with Oracle’s FFC mode, however, the efficiencies gained via AMD caches, memory channels, and a VMware vSphere optimizations will make this performance test fun!
Before reviewing the performance results, it is important that we review the virtual machine, storage, and TPROC-C workload configurations. One important difference between the baseline virtual machine (no FFC mode) and the database configuration with FFC mode enabled is the memory allocated to the SGA. A key consideration is that the logical database size is smaller than the individual buffer cache. See the Oracle 19c Database Performance Tuning Guide for a complete list of considerations. In this case the SGA size is 784 GB to accommodate caching the entire database in the Oracle buffer cache. All other configuration parameters like vCPU, memory, and disk storage were identically configured.
Using memory technologies like Force Full Cache mode should be a key consideration for the Enterprise as the AMD EPYC processors enable the PowerEdge R7525 servers to support up to 4 TB of LRDIMM. Depending upon the database and its growth rate, this could support many small to medium- sized systems. The advantage for the business is the capability to accelerate the database by configuring a Dell EMC PowerEdge R7525 server and AMD processors with enough memory to cache the entire database.
Table 1: Virtual Machine configuration and SGA size
Oracle Force Full Cache Mode
This database storage configuration includes using VMware vSphere’s Virtual Machine File System (VMFS) and Oracle’s Automatic Storage Management (ASM) on Direct Attached Storage (DAS). The storage configuration is detailed in the table below. ASM Normal redundancy mirrors each extent providing the capability to protect against one disk failure.
Table 2: Storage and ASM configuration
We used HammerDB to create a TPROC-C database with 5,000 simulated warehouses which generated approximately 500 GB of data, whichwas small enough to be loaded entirely in the buffer cache. Other HammerDB settings we used included those shown in this table:
Table 3: HammerDB: TPROC-C test configuration
Time Driver Script
Total Transactions per user
Minutes of Ramp Up Time
Minutes of Test Duration
Use All Warehouses
Number of Virtual Users
New Orders Per Minute (NOPM) is a metric that indicates the number of orders that were fully processed in one minute. This performance metric provides insight into the performance of the database system and can be used to compare two different systems running the same TPROC-C workload. The AMD EYPC processors combined with FFC mode delivered an outstanding 244% more NOPM than the baseline system. This is a powerful finding because it shows how tuning the hardware and software stack can accelerate database performance without adding more resources. In this case the optimal technology stack included AMD EYPC processors which, when combined with Oracle’s FFC mode, accelerated NOPM by 2.4 times the baseline.
Figure 1: New Orders Per Minute Comparison
What factors played a role in boosting performance? The Average Storage Response Time chart for the baseline test shows that the system’s average storage response time was .24 milliseconds. The goal of OLTP production systems is that most storage response times should be less than 1 millisecond as this is an indication of healthy storage performance. Thus, the baseline system was demonstrating good performance; however, even with the minimal storage response times the system only achieved 169,481 NOPM.
With FFC mode enabled, the entire database resided in the database buffer cache. This resulted in fewer physical reads and faster average storage response times. Results show the average storage response time with FFC was less than half the baseline numbers at just .11 milliseconds, or 2.2 times faster than the baseline. With most of the I/O activity working in memory the AMD EYPC processor cache and memory channel features provided a big boost in accelerating the database workload!
Figure 2: Average Storage Response Time
The combination of AMD EYPC processors with Oracle’s Force Full Cache mode should provide extremely good performance for databases that are smaller than 4 TBs. Our test results show an increase in 244% in New Orders per Minute and faster response time, meaning that this solution stack built on the PowerEdge R7525 can accelerate an Oracle database that fits the requirements of FFC mode. Every database system is different, and results will vary. But in our tests this AMD-based solution provided substantial performance.
Table 4: PowerEdge R7525 Configuration
2 x AMD EPYC 7543 32-Core processors @ 2800 MHz
16 x 128GB @ 3200 MHz (for a total of 2TB)
8 x Dell Express Flash NVMe P4610 1.6TB SFF (Intel 2.5 inch 8GT/s)
1 x Broadcom Gigabit Ethernet BC5720
1 x Broadcom Advanced Dual Port 25 Gb Ethernet
Related Blog Posts
Joint engineering with AMD for SQL Server 2022
Mon, 14 Nov 2022 13:49:15 -0000|
Read Time: 0 minutes
In preparation for the PASS Data Community Summit, Dell Technologies has been heads down in our engineering labs testing some of the new features of SQL Server 2022. In this blog we will highlight a number of use cases, leveraging AMD EPYC™ 7003 Series Processors with 3D V-Cache. 3D V-Cache processors utilize AMD’s ground-breaking 3D Chiplet architecture with up to 768MBs of L3 cache per socket while providing socket compatibility with existing AMD EPYC™ 7003 platforms. AMD EPYC is optimized for performance with up to 64 cores and 4TB of memory per CPU.
SQL Server 2022 is the most Azure-enabled release of SQL Server, with continued innovation across performance, security, and availability. SQL Server 2022 is part of the Microsoft Intelligent Data Platform, which unifies operational databases, analytics, and data governance. The reference architecture below shows how an Azure Arc-enabled Azure Stack HCI platform helps to consolidate virtualized SQL Server workloads and get up and running quickly on SQL Server 2022.
SQL Server 2022 on Azure Stack HCI
Dell Azure Stack HCI provides a flexible, highly available, and cost-effective platform to host Online Transaction Processing (OLTP) workloads such as SQL Server 2022. This document provides customers with best practice recommendations on how to deploy SQL Server 2022 on a Dell Integrated System (DIS) for Azure Stack HCI. These best practices consider both performance and high availability.
We also evaluated some performance results and design best practices for a cluster of AMD EPYC™ 7003 Series Processors with 3D V-Cache based DIS for Azure Stack HCI. This cost-effective solution offers strong performance results, a low data center footprint, and a highly flexible configuration in terms of compute and storage.
Only Azure Stack HCI from Dell Technologies leverages Dell OpenManage Integration for Windows Admin Center to orchestrate full stack lifecycle management, enabling complex tasks to be completed in a fraction of the time.
Read the full paper here.
If you are joining Dell Technologies and AMD at the PASS Data Community Summit (November 15th-18th, please join us for breakfast session and get all the details on this solution and ask the experts all of the questions you may have.
Attending the PASS Community Data Summit? Register for the breakfast here.
SQL Server 2022 Data Analytics with Dell PowerEdge and Dell ECS
Data virtualization has become popular among large enterprises because unstructured and semi-structured data is everywhere and using this data is challenging. SQL Server 2022 PolyBase makes data virtualization possible for data scientists to use T-SQL for analytic workloads by querying data directly from S3-compatible object storage without separately installing client connection software.
Data analytic workloads can be CPU intensive and selecting the optimal CPUs for the data analytic servers can be challenging, time consuming, and expensive. Because running T-SQL queries for data analytics require quick response time, as in the previous document, here we leveraged the new Milan-X AMD EPYC processor with 3D V-Cache technology. The AMD 3D V-Cache technology is the first implementation of the AMD 3D Chiplet Architecture. 3D V-Cache product offers three times the L3 cache compared to standard 3rd Gen EPYC processors and keeps memory intensive compute closer to the core speeding up performance for database and analytics workloads. AMD 3rd Gen EPYC™ Processors with 3D V-Cache helps customers optimize core usage, license costs and total cost-of-ownership.
ECS, a modern object storage platform designed for both traditional and next-generation workloads, provides organizations with an on-premises alternative to public cloud solutions. Dell Technologies has been named a Leader in the 2022 Gartner® Magic Quadrant™ for Distributed File Systems and Object Storage for the seventh year in a row – a Leader every year since the commencement of this report. According to the Gartner report, Dell Technologies has also once again received the highest overall position for its Ability to Execute in the Leaders quadrant of the report (based on Bellcore component reliability modeling across all AX-nodes).
As organizations, both large and small, seek to gain an edge with their intelligent data estate, access to all types of datasets must be made available. SQL Server 2022 and Dell ECS is the preferred technology for querying the Data Lake using a T-SQL surface area. This combination of products and tools yields modern opportunities to store and manage different types of data on-premises and at public cloud scale.
Read the full paper here.
SQL Server 2022 Backup & Restore with Dell ECS S3 Object Storage
In the same paper above, we leveraged the same environment to showcase the new SQL Server 2022 capabilities to run backup and restore of databases leveraging object storage with Dell ECS. One of the benefits of being able to leverage object storage is the ability to move larger read-only tables outside of the SQL database. This reduced the footprint of the database and decreases the time it takes to backup and restore.
In this document we provide the configuration steps for creating the required credential inside the SQL Server Instance for connection to ECS as well as show the correct syntax for backing up and restoring a database.
Take the next step…
Microsoft Azure Stack HCI | Dell USA
Dell Technologies Solutions for Microsoft Data Platform | Dell USA
Dell Technologies Solutions for Microsoft Azure Arc | Dell USA
Have you sized your Oracle database lately? Understanding server core counts and Oracle performance.
Mon, 24 Oct 2022 14:08:36 -0000|
Read Time: 0 minutes
Oracle specialists at Dell Technologies are frequently engaged by customers to help with database (DB) server sizing. That is, sizing DB servers that are stand-alone or part of a VxRail or PowerFlex system.
Server sizing is important in the fact that it helps ensure the CPU core count matches your workload needs, provides optimal performance, and helps manage Oracle DB software licensing costs.
Understanding CPU frequency
When sizing a server for an Oracle DB, CPU frequency matters. By CPU frequency, we are referring to the CPU cycle time, which can basically be thought of as the time it takes for the CPU to execute a simple instruction, for example, an Addition instruction.
In this blog post, we will use the base CPU frequency metric, which is the frequency the CPU would run at if turbo-boost and power management options were disabled in the system BIOS.
Focus during the sizing process is therefore on:
- Single-threaded transactions
- The best host-based performance per DB session
Packing as many CPU cores as possible into a server will increase the aggregate workload capability of the server more than it will improve per-DB session performance. Moreover, in most cases, a server that runs Oracle does not require 40+ CPU cores. Using the highest base-frequency CPU will provide the best performance per DB session.
Why use a 3.x GHz frequency CPU?
CPU cycle time is essentially the time it takes to perform a simple instruction, for example, an Addition instruction. A 3.x GHz CPU cycle time is lower on the higher frequency CPU:
- A 2.3 GHz CPU cycle time is .43 ns
- A 3.4 GHz CPU cycle time is .29 ns
A 3.x GHz CPU achieves an approximate 33% reduction in cycle time with the higher base frequency CPU.
Although a nanosecond may seem like an extremely short period of time, in a typical DB working environment, those nanoseconds add up. For example, if a DB is driving 3,000,000+ logical reads per second while also doing all the other host-based work, speed may suffer.
An Oracle DB performs an extreme amount of host-based processing that eats up CPU cycles. For example:
- Traversing the Cache Buffer Chain (CBC)
- The CBC is a collection of hash buckets and linked lists located in server memory.
- They are used to locate a DB block in the server buffer cache and then find the needed row and column the query needs.
- SQL Order by processing
- Sorting and/or grouping the SQL results set helps the server memory create the results set in the order the user intended to see.
- Parsing SQL
- Before a SQL statement can be started, it must be parsed.
- CPU is used for soft and hard parsing.
- PLSQL is the procedural logic the application uses to run IF-THEN-ELSE, and DO-WHILE logic. It is the application logic on the DB server.
- Chaining a DB block just read from storage
- Once a new DB block is read from storage, it must be placed on the CBC and a memory buffer is “allocated” for its content.
- DB session logons and logoffs
- Session logons and logoffs allocate and de-allocate memory areas for the DB session to run.
- ACO – Advanced Compression Option
- ACO is eliminating identical column data in a DB block – once again, more CPU and RAM work to walk through server memory to eliminate redundant data.
- TDE – Transparent Data Encryption
- TDE encrypts and decrypts data in server memory when accessed by the user – on large data Inserts/Updates extra CPU is needed to encrypt the data during the transaction.
All of this host-based processing drives CPU instruction execution and memory access (remote/local DIMMs and L1, L2, and L3 caches). When there are millions of these operations occurring per second, it supports the need for the fastest CPU to complete CPU instruction execution in the shortest amount of time. Nanosecond execution time adds up.
Keep in mind, the host O/S and the CPU itself are:
- Running lots of background memory processes supporting the SGA and PGA accesses
- Using CPU cycles to traverse memory addresses to find and move the data the DB session needs
- Working behind the scenes to transfer cache lines from NUMA node to NUMA node, DIMM to the CPU core, and Snoopy/MESI protocol work to keep cache lines in-sync
In the end, these processes consume some additional CPU cycles on top of what the DB is consuming, adding to the need for the fastest CPU for our DB workload.
SLOB Logical Read testing:
The 3.x GHz recommendation is based on various Oracle logical read tests that were run with a customer using various Intel CPU models of the same family. The customer goal was to find the CPU that performs the highest number of logical read tests per second to support a critical application that predominantly runs single-threaded queries (such as no PQ – Parallel Query).
Each of the tests showed a higher base-frequency CPU outperformed a lower base-frequency CPU on a per-CPU core basis running SLOB logical read tests (i.e., no physical I/O was done, all host-based processing – only using CPU, Cache, DIMMs, and remote NUMA accesses).
In the tests performed, the comparison point across the various CPUs and their tests was the number of logical reads per second per CPU core. Moreover, the host was 100% dedicated to our testing and no physical I/O was done for the transactions to avoid introducing physical I/O response times in the results.
The following graphic depicts a few examples from the many tests run using a single SLOB thread (a DB session), and the captured logical read-per-second counts per test.
The CPU baseline was an E5-4627 v2, which has a 3.3GHz base frequency, but a 7.2GTs QPI and 4 x DDR3-1866 DIMM channels, versus the Skylake CPUs having a 10.2GTs UPI and 6 x DDR4-2666 DIMMs.
Why is there a variance in the number of logical reads per second between the test run of the same CPU model on the same server? Because SLOB runs a random read DB block profile which leads to more or less DB blocks being found on the local NUMA node the DB session was running on in the various tests. Hence, where fewer DB blocks were present, the remote NUMA node had to transfer DB blocks over the UPI to the local NUMA node, which takes time to complete.
If you are still wondering if Dell Technologies can help you find the right CPU and PowerEdge server for your Oracle workloads, consider the following:
- Dell has many PowerEdge server models to choose from, all of which provide the ability to support specific workload needs.
- Dell PowerEdge servers let you choose from several Intel and AMD CPU models and RAM sizes.
- The Oracle specialists at Dell Technologies can perform AWR workload assessments to ensure you get the right PowerEdge server, CPU model, and RAM configurations to achieve optimal Oracle performance.
If per-DB session, host-based performance is important to you and your business, then CPU matters. Always use the highest base-frequency CPU you can, to meet CPU power needs and maintain faster cycle times for Oracle DB performance.
If you need more information or want to discuss this topic further, a Dell Oracle specialist is ready to help analyze your application and database environment and to recommend the appropriate CPU and PowerEdge server models to meet your needs. Contact your Dell representative for more information or email us at firstname.lastname@example.org