SQL Server and similar databases were designed primarily for online transaction processing (OLTP) and are scale-up systems. In a scale-up system, performance benefits come from adding more compute and memory resources in the host server or migrating to a larger server. In contrast, a scale-out database system is designed to use multiple networked servers with storage to distribute data and data processing across a cluster. Scale-out systems are designed to manage big data challenges that do not work well using the traditional scale-up systems approach.
With SQL Server 2019, Microsoft offers an option for hosting scale-out database services. SQL Server 2019 Big Data Cluster is a new scale-out big data solution that combines SQL Server, Spark, and HDFS across a cluster of servers. The SQL Server master instance coordinates connectivity, scale-out query management, metadata, and machine learning. The master instance is a fully functional SQL Server instance and a Tabular Data Stream (TDS) endpoint. It provides application-level protocol that is used by tools such as Azure Data Studio and SQL Server Management Studio.
The master instance contains the scale-out query engine that pushes computations to the compute pool. In Big Data Cluster, the compute pool processes the query across the SQL Server instances. This parallelization enables faster reading of large datasets, thus saving time in returning the results. Management of scale-out queries and other metadata is maintained in the master instance.
Metadata in the master instance includes:
Beyond the SQL engine and Big Data Cluster management, SQL Server 2019 integrates additional services such as machine learning. Machine learning is the capability to use data in developing models. For example, using machine learning, a model can be created to define the difference between legitimate email and spam. Such a model can then be used to improve email services by removing most spam from a user’s inbox.
Machine Learning Services is an add-on feature to the master instance. The feature is installed by default and enables developers to use Java, R, and Python code in SQL Server. This capability provides a pathway for data scientists to use Spark and HDFS tools with the SQL Server master instance, as shown in the following figure:
Figure 3. Machine Learning Services installed on SQL Server master instance
Apache Spark services are also available in SQL Server 2019. Spark is a general-purpose cluster computing system that is designed for scale-out data processing performance. Through APIs, Spark supports a broad set of programming languages such as Java, Scala, Python, and R. Spark uses Hadoop’s client libraries for accessing HDFS data nodes, meaning Spark is fully compatible with HDFS scale-out storage. Spark provides the ability within Big Data Cluster to perform machine learning (MLib), graph processing (GraphX), Spark Streaming, and a full range of traditional ETL-style workloads. As an option, customers can create a dedicated pool for Spark analytics. A dedicated Spark pool can be beneficial if customers plan to extensively use Spark services.
In addition to the master instance, SQL Server 2019 Big Data Cluster includes several scale-out services. For example, compute pools enable offloading distributed query processing to a dedicated scale-out service. In our lab setup, the Big Data Cluster compute pool has one or more SQL Server compute instances running in containers across each of the VxRail nodes. Other examples of how the compute pool accelerates data processing include:
The following figure shows the compute pool configuration:
Figure 4. Compute pool configuration
A SQL 2019 Big Data Cluster storage pool is a group of pods hosting the following services: SQL Server engine, HDFS data node, and Spark. An HDFS data node stores and replicates data with other nodes in the storage pool. Because HDFS replicates data in most architectures, traditional disk-based RAID is not needed. The HDFS storage nodes service all read and write operations to the pool.
The SQL Server master instance can use the local SQL engine to access the HDFS data nodes, providing the capability to use T-SQL syntax with Big Data Cluster. Thus, the organization can use existing T‑SQL reports and fully benefit from developer experience with T-SQL; developers do not have to learn a new programming language. The following figure shows the storage pool architecture as provisioned by Kubernetes in pods:
Figure 5. Storage pool architecture
A SQL Server 2019 Big Data Cluster data pool has one or more SQL Server instances. The data pool can be used to ingest data from external sources or cache data from the storage pool. The data pool is connected to the SQL Server master instance for control. Data Definition Language (DDL) is used to manage the data pool and its objects, and Data Manipulation Language (DML) is used for data management. The primary use of the data pool is to offload data processing activities from the SQL Server master instance. The data pool can accelerate data processing because it uses data sharding (partitioning) across the SQL Server instances in the pool. Database shards are data that is distributed in separate containers across multiple database instances residing on separate servers in the data pool. Most of the data is not duplicated or replicated, meaning that each shard contains a single copy of some data. Sharding improves performance by enabling multiple data reads in parallel across multiple database instances, thus reducing the time that is required to complete a query.