This solution demonstrates a use case incorporating SQL Server, SQL Server Big Data Clusters, and an Oracle RDBMS to perform data analytics using data virtualization. Dell EMC PowerFlex systems provide the compute and storage resources for the entire data architecture supporting the use case.
This white paper describes the key steps of designing and building a SQL Server Big Data Cluster with a focus on the recommended practices to help make your implementations successful. Our solution demonstrates a reliable and repeatable method to deploy our application services. It relies on automation and orchestration that are provided by VMware vSphere, Docker, and Kubernetes, but using virtualization is optional.
A recently released white paper from the SQL Server Solutions team at Dell Technologies explains the foundational concepts of this solution. That paper, SQL Server Containers on Linux, discusses the advantages of hosting databases in Docker containers to support application development. Building on the container momentum, this solution shows how PowerFlex systems with SQL Server Big Data Clusters and PolyBase can accelerate your big data programs through integration with our implementation of Kubernetes.
The Dell EMC PowerFlex system, the software foundation of our software-defined PowerFlex family, integrates with Kubernetes through the Container Storage Interface (CSI) plug-in. This integration automates the provisioning and management of containers that require persistent storage. Kubernetes integration improves the productivity of developers and others by supporting quick and easy provisioning of a container with storage. Kubernetes and the CSI plug-in for PowerFlex automation have replaced many of the time-consuming processes of allocating compute and storage resources.
In addition to documenting our approach for building and running a SQL Server Big Data Cluster implementation in our lab, we describe how we ingest tabular data, based on the schema definition developed for the TPC-H test suite, into our SQL Server Big Data Cluster. The large tables are loaded into the HDFS store on the cluster while the smaller tables are ingested into SQL Server and Oracle databases, so our tests can use queries that span all three data sources.
Our use case demonstrates the capability of PolyBase data virtualization to query all three data sources without the need for ETL. We also implement a test use case that creates a Big Data Cluster data pool. We use the data pool for data persistence and caching. We also show how to cache data from the Oracle database to simulate how a developer or data scientist can improve analytics performance.