The Docker administrator downloads the latest Big Data Cluster image from the Microsoft Container Registry to the local private registry. The Docker administrator collaborates with the SQL Server administrator to ensure that all stated requirements for this Big Data Cluster image are met.
Note: Access to high-quality container images from a trusted source can save many hours of labor that are typically required to create and manage images that are built locally from Docker files. Always check requirements before attempting to deploy a container image.
To download the image, the Docker administrator runs the following Docker pull command:
$ docker pull mcr.start.com/mssql/bdc/<SOURCE_IMAGE_NAME>;<SOURCE_DOCKER_TAG>
This Docker pull command shows how to manually pull images from trusted public registries. Because we were performing an offline installation from our local private registry, we used a Python script to automate pulling of all the required images.
For details about how to use a Python script to automate this work, see Perform an offline deployment of a SQL Server big data cluster in Microsoft SQL Docs. The key step is to download the script, using the Bash shell, with Curl:
curl -o push-bdc-images-to-custom-private-repo.py "https://raw.githubusercontent.com/Microsoft/sql-server-samples/master/samples/features/sql-big-data-cluster/deployment/offline/push-bdc-images-to-custom-private-repo.py"
To manually populate the local private Docker registry with the Big Data Cluster, the Docker administrator runs the following Docker push command:
$ docker push localhost:5000/<SOURCE_IMAGE_NAME>;<SOURCE_DOCKER_TAG>
To automate populating the local registry, run the Microsoft Python script with the Linux command:
sudo python push-bdc-images-to-custom-private-repo.py
Note: If you customize the Big Data Cluster container image (we did not), save the base image or images and any customization to the local private registry with appropriate annotations, if required for your use case.
A key question related to the implementation of this Big Data Cluster environment concerns storage provisioning. In a bare-metal infrastructure, the control plane is Kubernetes. But when vSphere is part of the application stack, does the control plane remain with Kubernetes or vSphere?
vSphere Storage Policy Based Management (SPBM) provides a single unified control plane for storage services such as Kubernetes. SPBM is an abstraction layer that enables disparate storage services to be unified under a universal framework for vSphere. The use of SPBM enables vSphere administrators to manage storage in a container environment, as shown in the following figure:
Figure 9. vSphere SPBM
The management process includes creating StorageClass definitions for dynamic storage provisioning. The StorageClass parameters are contained in a YAML file. Parameters in the YAML file for VMFS and NFS include:
Use of StorageClass parameters enables granular control of persistent storage provisioning.
# cat SQLBDC-storageclass.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: sqlbdc-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vsphere.vmware.com
parameters:
storagepolicyname: "Space-Efficient"
fstype: ext4
# kubectl get storageclass sqlbdc-sc
NAME PROVISIONER AGE
sqlbdc-sc csi.vsphere.vmware.com 5s
After the storage classes have been defined, Kubernetes can create PersistentVolumes for Big Data Cluster.
Note: For more information about using vSphere to dynamically provision storage in a Kubernetes environment, see Dynamic Provisioning on GitHub and Storage Policy Based Management in VMware Docs.
Big Data Cluster installation prerequisites include installing tools such as Python, azdata, and kubectl. For detailed installation steps, see How to deploy SQL Server Big Data Clusters on Kubernetes in Microsoft SQL Docs.
azdata bdc config init -–source bdc-test-profile –target custom
azdata bdc create –-config-profile custom –accept-eula yes
In approximately 15 to 30 minutes, the following notification is displayed in the window where the installation was started, indicating that the controller pod is up and running:
Cluster controller endpoint is available at <IP_address>:<port_number>
Cluster control plane is ready.
When the entire installation is complete, the following message is displayed:
Cluster deployed successfully.
The use of a local private registry requires more preparation but significantly accelerates the deployment of Big Data Cluster. In contrast, an installation using the Microsoft Container Registry requires pulling all the image files over the Internet, which might take hours for every Big Data Cluster installation.
Useful commands to check the health of the cluster post installation are:
Figure 10. Big Data Cluster services and corresponding endpoints
The following figure shows the status of SQL Server 2019 Big Data Cluster from Azure Data Studio:
Figure 11. Big Data Cluster status in Azure Data Studio
The simplicity of the automated installation of Big Data Cluster overshadows the significant value that the automation provides. The application architecture is complex, consisting of many tools and software stacks that might take days, if not weeks, to manually install. In this step, coordination of azdata, Kubernetes, and Docker enables automated installation of Big Data Cluster in minutes or hours. This time savings reflects one of the core benefits of containers, which is the capability to provision Big Data Cluster and other applications quickly.
Kubernetes and azdata combined are the automation and orchestration engines. When the Big Data Cluster installation is complete, the containers, configuration, and storage are provisioned, and the cluster is in a ready state. This automation and orchestration move Big Data Cluster into a self-serve, on-demand platform. The IT organization can then easily stand up multiple Big Data Clusters for different teams within the organization.
Some Big Data Clusters, including dev/test Big Data Clusters that are regularly refreshed, might only be needed for a few months. Docker containers are designed to be ephemeral—that is, they can be stopped and destroyed easily, and new containers can be built from the same image and put in place with minimal setup and configuration. The ephemeral nature of Docker containers is ideal for dev/test scenarios, where most developers and other team members use the containers for only a short time.
You can ingest data into Big Data Cluster through any of the following methods:
The optimal method for ingesting data depends on the amount of data, type of data, and the target location of data. For example, if the goal is to ingest files into the storage pool (HDFS data node), using the Curl command line works well. In contrast, using SQL Server and Spark jobs works well if the goal is to ingest data into the data pool. In this use case, the goal was to ingest approximately 1 TB of TPC-H data into the Big Data Cluster storage pool, with smaller tables going into a stand-alone Oracle database and a stand-alone SQL Server database.
TPC-H data uses eight distinct and separate tables. The larger tables with more rows were ingested into the Big Data Cluster data pool. The medium-size tables were ingested into a stand-alone SQL Server database unrelated to the Big Data Cluster configuration. Finally, the smallest tables were ingested into the Oracle database. The goal was to place the largest tables in Big Data Cluster.
The following table summarizes where the TPC‑H data was ingested:
Table 2. TPC-H data locations after ingestion
Data location |
Table |
Scale Factor (SF) |
Table size |
Big Data Cluster storage pool (HDFS)
|
PARTSUPP |
SF*800,000 |
Large |
CUSTOMER |
SF*150,000 |
Large |
|
LINEITEM |
SF*6,000,000 |
Large |
|
ORDERS |
SF*1,500,000 |
Large |
|
Big Data Cluster data pool (SQL Server)
|
PART |
SF*200,000 |
Medium |
SUPPLIER |
SF*10,000 |
Medium |
|
External data source—MongoDB |
NATION |
25 |
Small |
External data source—Oracle |
REGION |
5 |
Small |
The scale factor column in the table refers to the size of the TPC-H database in gigabytes. In our use case, the database size was 1,000 GB (1 TB); thus, the scale factor is 1,000. To calculate the number of rows in a table, multiply the SF by the base row number. For example, for the first table, PARTSUPP:
1,000 (SF) * 800,000 = 8,000,000,000 rows
For more information, see TPC Benchmark H.
A key benefit of Big Data Cluster is the ability to easily ingest big data. In our use case, we used Curl to ingest approximately 1 TB of decision-support data. The tutorials in Microsoft SQL Docs enabled our SQL Server experts to quickly start the ingest of data. In testing, we took one extra step to optimize performance by moving from comma-separated value (CSV) files in HDFS to Parquet files, as described in Using Curl to ingest data.
By placing the medium-size tables in a stand-alone SQL Server database and the smaller tables in the Oracle database, our data virtualization tests validate:
Curl is a command-line tool that is designed for transferring data over a broad set of protocols. The benefit of using Curl is the ability to put local files into HDFS, which is the approach that we used in our lab testing.
We cleansed the TPC-H data and saved it to a CSV file. We then used a PowerShell script to dynamically create and run Curl to ingest the CSV file into HDFS. Part of the script creates an HDFS directory:
curl -L -k -u root:%KNOX_PASSWORD% -X PUT “https://%KNOX_ENDPOINT%/gateway/webhdfs/v1/%DIR_NAME%/csv/PART?OP=MKDIRS”
where:
We then put the CSV files into the HDFS directory:
curl -L -k -u root:%KNOX_PASSWORD% -X PUT https://%KNOX_ENDPOINT%/gateway/webhdfs/v1/%DIR_NAME%/csv/part/final_F00001_part.csv?op=create&overwrite=true -H “Content-type:application/octet-stream” -T “C:\stage\ps\v5\final\part\final_F00001_part.csv”
In the last Curl command, the file name is final_F00001_part.csv because the CSV file was split into eight parts and placed in the HDFS storage pool. For each of the eight files, the number in the file name was incremented (00001–00008). Splitting the file into eight parts enabled distribution of the data across the nodes in the storage pool, which optimizes performance.
In addition, to improve performance and gain capacity savings, we converted the CSV files in HDFS to Parquet files. Parquet is an Apache Software Foundation tool that is used to convert files into columnar data. When the data is in columnar form, Parquet can efficiently compress and encode it.
A second way to ingest data is to bring it into the data pool by using SQL Server and T‑SQL. The first step in this process is to create an external table in the data pool. An external table is an object that enables PolyBase to access data that is stored outside the database. Using an external table, PolyBase can access data in an Oracle database, Hadoop cluster, Azure blob storage, and, as in our lab tests, a text file with TPC-H data.
Create and use an external table, and then ingest data, as follows:
IF NOT EXISTS(SELECT * FROM sys.external_data_sources
WHERE name = 'SqlDataPool')
CREATE EXTERNAL DATA SOURCE SqlDataPool
WITH (LOCATION = 'sqldatapool://controller-svc/default');
GO
USE [TPCH1TB]
GO
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_HDFS_PARTSUPP')
CREATE EXTERNAL TABLE EXT_HDFS_PARTSUPP (
"PS_PARTKEY" BIGINT NOT NULL,
"PS_SUPPKEY" bigint,
"PS_AVAILQTY" INT,
"PS_SUPPLYCOST" float,
"PS_COMMENT" VARCHAR(199)
)
WITH
(
DATA_SOURCE = SqlStoragePool,
LOCATION = '/TPCH1GB/parquet/partsupp',
FILE_FORMAT = parquet_file
);
GO
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_DP_PARTSUPP')
CREATE EXTERNAL TABLE [EXT_DP_PARTSUPP] (
"PS_PARTKEY" BIGINT NOT NULL,
"PS_SUPPKEY" bigint,
"PS_AVAILQTY" INT,
"PS_SUPPLYCOST" float,
"PS_COMMENT" VARCHAR(199)
)
WITH (
DATA_SOURCE = SqlDataPool,
DISTRIBUTION = ROUND_ROBIN
);
GO
INSERT INTO EXT_DP_PARTSUPP(
"PS_PARTKEY",
"PS_SUPPKEY",
"PS_AVAILQTY",
"PS_SUPPLYCOST",
"PS_COMMENT"
)
SELECT
"PS_PARTKEY",
"PS_SUPPKEY",
"PS_AVAILQTY",
"PS_SUPPLYCOST",
"PS_COMMENT"
FROM EXT_HDFS_PARTSUPP
GO
Select count(1) from [owner_name].[external table name]
For example:
Select count(1) from [dbo].[EXT_DP_PARTSUPP]
The third way to ingest data is with Spark jobs. The following steps provide a broad outline of how to ingest data using a Spark job.
Note: For detailed steps, see Tutorial: Ingest data into a SQL Server data pool with Spark jobs in Microsoft SQL Docs.
See the example in Using SQL Server and T-SQL to ingest data.
After ingesting the decision-support data into Big Data Cluster, SQL Server, MongoDB, and Oracle, test data virtualization as follows:
A data source is a row in the sys.external_data_sources table in the Big Data Cluster master SQL Server database. The row contains the name, source, and location of the data source. Creating data sources enables PolyBase in Big Data Cluster to access the other database systems.
Note: The stand-alone SQL Server database requires no data source because Big Data Cluster can natively access the data.
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool')
CREATE EXTERNAL DATA SOURCE SqlStoragePool
WITH (LOCATION = 'sqlhdfs://controller-svc/default’);
i Create a master key for encrypting the database password:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '@StrongPasword!!!’;
ii Create the credentials to connect to the Oracle database in the master SQL Server instance.
The following T-SQL command assigns a name (oracle_cred) and saves a user account (oracle_user) with the password that you created in the preceding step:
CREATE DATABASE SCOPED CREDENTIAL oracle_cred
WITH IDENTITY = ‘oracle_user’, SECRET = ‘StrongOraclePassword!!!!’;
iii Run the following T-SQL command to finish creating the data source for the Oracle database:
CREATE EXTERNAL DATA SOURCE Oracle_Data_Source
WITH
( LOCATION = 'oracle://<Oracle_instance_IP>:<Oracle_Instance_Port>'
, CREDENTIAL = oracle_cred
);
i Create the MongoDB database credentials:
USE DATABASE TPCH1TB
GO
CREATE DATABASE SCOPED CREDENTIAL mongodb_poly
WITH IDENTITY = 'superuser', Secret = 'test123';
ii Create the data source:
CREATE EXTERNAL DATA SOURCE source_mongodb_poly
WITH (
LOCATION = 'mongodb://mongoserver:27017',
CONNECTION_OPTIONS='SSL=0',
PUSHDOWN = OFF,
CREDENTIAL = mongodb_poly
);
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_HDFS_ORDERS')
CREATE EXTERNAL TABLE EXT_HDFS_ORDERS (
"O_ORDERKEY" bigint NOT NULL, "O_CUSTKEY" bigint NOT NULL,
"O_ORDERSTATUS" char(1) NOT NULL,
"O_TOTALPRICE" decimal(12, 2) NOT NULL,
"O_ORDERDATE" date NOT NULL,
"O_ORDERPRIORITY" char(15) NOT NULL,
"O_CLERK" char(15) NOT NULL,
"O_SHIPPRIORITY" int NOT NULL,
"O_COMMENT" varchar(79) NOT NULL
)
WITH
(
DATA_SOURCE = SqlStoragePool,
LOCATION = '/TPCH1GB/parquet/orders',
FILE_FORMAT = parquet_file
);
GO
This script:
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_ORA_REGION')
CREATE EXTERNAL TABLE EXT_ORA_REGION (
"R_REGIONKEY" int,
"R_NAME" Varchar(25) COLLATE Latin1_General_100_BIN2_UTF8,
"R_COMMENT" VARCHAR(152) COLLATE Latin1_General_100_BIN2_UTF8)
WITH (DATA_SOURCE=bdcOracleDataSource,
LOCATION='<oracle-database>.<oracle-schema>.REGION');
GO
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_MONGO_NATION')
CREATE EXTERNAL TABLE EXT_MONGO_NATION (
N_NATIONKEY int,
N_NAME NVARCHAR(25) COLLATE SQL_Latin1_General_CP1_CI_AS,
N_REGIONKEY int,
N_COMMENT NVARCHAR(152) COLLATE SQL_Latin1_General_CP1_CI_AS)
WITH(
LOCATION='TPCH1TB.nation',
DATA_SOURCE=source_mongodb_poly
);
GO
For this use case, we created additional external tables for relevant tables in HDFS, Oracle, and MongoDB.
When the setup of connecting and defining external data sources is complete, the transition to data virtualization is seamless. Now, consider whether to use custom images or scripting for container automation.
If you will need to refresh the Big Data Cluster environment, save the state of the existing containers as custom images in the registry. Saving these custom images enables you to deploy a Big Data Cluster environment with the data virtualization configuration completed. Developers and data scientists can then start working without having to wait for additional setup.
Another approach is container automation through scripting. In testing Big Data Cluster, our experts created scripts to automate manual steps such as creating connections to external data resources. As part of the validation work, the Big Data Cluster was refreshed to ensure that the steps were repeatable. The initial container rollout took approximately 3 hours. Any refresh of the Big Data Cluster system took approximately 30 minutes. Scripting the post-installation steps worked well and enabled teams to start using the Big Data Cluster system quickly.
The ability to access all three data sources without any data transformation proves the value of data virtualization. The TPC-H benchmark has several queries that run against decision-support data. We selected TPC-H query 8, which is called National Market Share, to test our data virtualization configuration:
/* TPC_H Query 8 - National Market Share */
SELECT O_YEAR,
SUM(CASE WHEN NATION = 'CHINA'
THEN VOLUME
ELSE 0
END) / SUM(VOLUME) AS MKT_SHARE
FROM ( SELECT datepart(yy,O_ORDERDATE) AS O_YEAR,
L_EXTENDEDPRICE * (1-L_DISCOUNT) AS VOLUME,
N2.N_NAME AS NATION
FROM EXT_SQL_PART, -- Points to SQL Server
EXT_SQL_SUPPLIER, -- Points to SQL Server
EXT_HDFS_LINEITEM, -- Points to HDFS
EXT_HDFS_ORDERS, -- Points to HDFS
EXT_HDFS_CUSTOMER, -- Points to HDFS
EXT_MONGO_NATION N1, -- Points to Mongo
EXT_MONGO_NATION N2, -- Points to Mongo
EXT_ORA_REGION -- Points to Oracle
WHERE P_PARTKEY = L_PARTKEY AND
S_SUPPKEY = L_SUPPKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_CUSTKEY = C_CUSTKEY AND
C_NATIONKEY = N1.N_NATIONKEY AND
N1.N_REGIONKEY = R_REGIONKEY AND
R_NAME = 'ASIA' AND
S_NATIONKEY = N2.N_NATIONKEY AND
O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND
P_TYPE = 'SMALL BRUSHED COPPER'
) AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR
The body of the query (between FROM and WHERE) shows all the selected tables. This query uses seven of the eight tables in the decision-support benchmark. (PARTSUPP was not selected.) Of the seven selected tables, each of the four data sources (SQL, HDFS, Oracle, and MongoDB) were used:
The query completed quickly with no errors. PolyBase with Big Data Cluster enabled data virtualization.
Data scientists and developers can now access external databases without having to transform the data or migrate it into a data warehouse or data lake. In this use case, data virtualization simplified accessing data and eliminated the complexity that is associated with ETL processes. Less complexity can translate into time savings. Further, in our testing of Big Data Cluster with PolyBase data virtualization, we experienced impressively fast connection to external data sources.
Big Data Cluster unifies and centralizes big data and connects to external data sources. Combining big data and data virtualization gives data scientists one place to access information. Centralized data access, a common set of tools, and the ability to access information with less complexity provides the foundation for increased productivity.