The use case in this white paper shows how the innovative use of container technologies, combined with SQL Server 2019 Big Data Cluster, can simplify management and enable mining of large data volumes with the least amount of operational overhead. The use case demonstrates how you can access your organization’s data sphere for improved analytics on an agile platform that is designed for flexibility, automation, and orchestration.
We designed our use case to demonstrate the advantages of using SQL Server 2019 Big Data Cluster for analytic application development. The use case also demonstrates how Docker, Kubernetes, and the vSphere CSI driver accelerate the application development life cycle when they are used with the VxRail HCI platform. The lab environment for development and testing used four VxRail E560F nodes supported by the vSphere CSI driver. With this solution, developers can provision SQL Server Big Data Cluster in containerized environments without the complexities of traditional methods for installing databases and provisioning storage.
We chose Red Hat Enterprise Linux as the operating system for the vSphere VMs. The Red Hat Linux operating system is certified for SQL Server 2019 Big Data Cluster, Docker Enterprise Edition, and Kubernetes. In our use case, we show how to automate the setup of Big Data Cluster using Kubernetes with the vSphere CSI driver on the VxRail system.
Before we tested our use case workloads, we deployed a SQL Server 2019 Big Data Cluster on Kubernetes. We first created a local private registry to manage our Big Data Cluster Docker images. The database team could then update and manage images for use in Kubernetes. For detailed deployment documentation from Microsoft, see How to deploy SQL Server Big Data Clusters on Kubernetes in Microsoft SQL Docs.
To complete the setup after our Big Data Cluster was running, we populated the cluster with data similar to the schema used in a TPC-H benchmark. The official TPC-H benchmark is a decision-support benchmark that requires a certification before publication of results. We chose to replicate only the data generation specification because it defines a schema and relationships that easily scale from 1 TB to 10 TB and larger. We used a 1 TB TPC-H-like dataset in our tests.
We also configured PolyBase to query three external data sources using a join across the schemas. We loaded the largest “fact” table from the TPC-H data into the Big Data Cluster HDFS store. From the same dataset, we loaded smaller “dimension” tables, which were distributed across a SQL Server 2019 database, a MongoDB database instance, and an Oracle 19c database instance. We then show how typical TPC-H summary queries can use PolyBase and data virtualization to query three different data sources, all with T‑SQL.
This setup allowed us to show how SQL Server 2019 Big Data Cluster technology with PolyBase provides the capability to report on data across different database technologies, using skills that the database team already has. We tested our reporting capabilities in two ways: First, we used our TPC-H queries without any acceleration, and, second, we created a Big Data Cluster data pool and cached the data from the MongoDB database instance and the Oracle database instance. We also loaded a 10 TB version of the data to validate the ingestion of a larger dataset in Big Data Cluster.