Overwhelming evidence suggests that data analytics innovation drives business transformation. The pioneers on this journey have had to adopt new organizational structures and make investments in new technologies. SQL Server Big Data Clusters is a new technology option that enables the union of large data initiatives and easy access to existing data sources in your datasphere. In this white paper, we showed how our test of Big Data Clusters easily managed TPC-H datasets of two different sizes—1 TB and 10 TB. The platform handled the complete pipeline of these decision-support datasets, from ingestion through analysis, with no challenges.
Our test goals were aggressive so that we could show numerous capabilities of the decision-support data platform. The eight tables that comprise the TPC-H data were split across two stand-alone databases, Oracle and SQL Server, and were in the storage pool of the Big Data Cluster. For PolyBase data virtualization to work, it had to access data from all three sources. Our SQL Server Solutions engineering team modified one of the TPC-H queries so that tables were selected from the two databases and the Big Data Cluster. The data virtualization test was successful, running without error and returning the results that joined all three data sources.
Further, we showed how a Big Data Cluster data pool can accelerate data access for demanding reporting and data analytics use cases. To test the capability of Big Data Cluster data pools, we ingested the tables from the stand-alone SQL Server instance into the data pool and modified a subset of TPC-H queries to use the tables in the data pool. Over half of the queries had lower execution times with no specific performance tuning.
Success in ingesting Big Data and using data virtualization with Big Data Cluster brings new value to customers looking for better tools for data analysis. Business analysts and developers gravitate to a platform that simplifies and centralizes access to and analysis of the organization’s datasphere. Big Data Clusters enable IT to simplify management by consolidating Big Data and data virtualization on one platform with a proven set of tools. For organizations that are familiar with SQL Server, using Big Data Clusters is a natural extension into Big Data and data virtualization. Organizations that are new to Big Data analytics will find that SQL Server 2019 provides a single data management and analysis platform for all the organization’s needs, from structured to semi-structured to unstructured data.
To help ensure an organization’s success on the digital transformation journey, IT needs a highly scalable infrastructure and service automation. We showed how Docker containers, Kubernetes, and the VxFlex CSI plug-in, in combination, enables fast and easy provisioning of the Big Data Cluster services. The initial installation of a Big Data Cluster took approximately 3 hours. With automation, subsequent refreshes took less than 30 minutes.
Key to this increase in deployment speed was the capability to seamlessly provision persistent storage on the VxFlex array. Without automated storage provisioning, administration would have been required for all the Big Data Cluster services: the SQL Server master instance, data pool, compute pool, and storage pool.
Our testing concluded that customers can virtualize Docker containers and achieve some important benefits, including the capability to securely isolate a Big Data Cluster instance on a shared VxFlex solution.
Organizations that want to become more agile in developing business solutions using insights from data analytics should assess how their uses cases and IT transformation goals align to SQL Server 2019 and the Big Data Cluster services on the Dell EMC VxFlex array. Microsoft’s design for Big Data Cluster services that are hosted in containers is aligned with the direction that most organizations see as critical for increasing the speed of innovation. The integration and automation benefits that come from the combined use of containers, Kubernetes, and the VxFlex array provide value both immediately and into the foreseeable future.