We designed our use case to parallel the steps that customers take to deploy a SQL Server Big Data Cluster. Throughout the use case discussion, we describe important steps, design considerations, and outcomes. The discussion is not intended to outline step-by-step deployment actions but, instead, to provide guidance to make your Big Data Cluster solution successful.
Designing and configuring a resilient PowerFlex rack was our initial step. Because Big Data Clusters are business-critical, we designed a PowerFlex rack with multiple controller nodes and hyperconverged nodes so that there was no single point of failure. Performance was not a key consideration because PowerFlex enables massive scale-out with the addition of nodes. One of the many advantages of a PowerFlex system is that customers can choose between using a bare-metal infrastructure or virtualization. In this use case, we used VMware vSphere virtualization to increase manageability and security.
We created vSphere VMs to host Red Hat Enterprise Linux. The Red Hat Linux Operating System is certified for SQL Server 2019 Big Data Clusters, Docker Enterprise Edition, and Kubernetes. In this use case, we show the automation that we achieved using Kubernetes with the CSI plug-in for PowerFlex systems.
Deploying SQL Server 2019 Big Data Clusters in containers was our next step. For deployment steps, see How to deploy SQL Server Big Data Clusters on Kubernetes in Microsoft SQL Docs. We created a local private registry to manage our Big Data Cluster images, giving the database team the capability to update images.
When our Big Data Cluster was running, we populated the cluster with data from the TPC-H benchmark. The TPC-H benchmark is a decision-support benchmark. We chose it because it provided the capability to create 1 TB and 10 TB datasets. We used the 1 TB TPC-H dataset in our tests, and we used the 10 TB dataset at the end of the use case to validate a larger dataset in the Big Data Cluster.
The next step was to use PolyBase to connect three disparate resources. We ingested most of the TPC-H data into the Big Data Cluster and imported smaller tables from the same dataset into a SQL Server 2019 database and an Oracle 19c database. We could then use the TPC-H queries to show that PolyBase enables data virtualization—that is, PolyBase allowed us to query three different data sources with T‑SQL.
The outcome in setting up this Microsoft Big Data Cluster was the ability to report on data across our three databases. We tested our reporting capabilities in two ways. We first we used our TPC-H queries without any acceleration. Then we created a Big Data Cluster data pool and cached the data from the Oracle database.