Data scientists and others at times might need to move data from the original source to the Big Data Cluster data pool. For example, they might need to work on a subset of the data for data analytics, or they might need to accelerate performance. The primary use of the data pool is to offload data processing from the SQL Server master instance and, in doing so, accelerate performance. The data pool can accelerate performance in the following ways:
In this use case, we ingested all the data from the stand-alone SQL Server instance into the data pool. The stand-alone SQL Server instance was selected because it contained TPC-H tables that were larger than those tables in the Oracle database.
Ingest the data as follows:
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlDataPool')
CREATE EXTERNAL DATA SOURCE SqlDataPool
WITH (LOCATION = 'sqldatapool://controller-svc/default');
GO
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'BDC_DP_SUPPLIER')
CREATE EXTERNAL TABLE [BDC_DP_SUPPLIER] (
"S_SUPPKEY" BIGINT ,
"S_NAME" CHAR(25),
"S_ADDRESS" VARCHAR(40),
"S_NATIONKEY" INT,
"S_PHONE" CHAR(15),
"S_ACCTBAL" Decimal(12,2),
"S_COMMENT" VARCHAR(101) )
WITH (
DATA_SOURCE = SqlDataPool,
DISTRIBUTION = ROUND_ROBIN
);
GO
INSERT INTO BDC_DP_SUPPLIER(
"S_SUPPKEY",
"S_NAME",
"S_ADDRESS",
"S_NATIONKEY",
"S_PHONE",
"S_ACCTBAL",
"S_COMMENT"
)
SELECT "S_SUPPKEY",
"S_NAME",
"S_ADDRESS",
"S_NATIONKEY",
"S_PHONE",
"S_ACCTBAL",
"S_COMMENT"
FROM EXT_SQL_SUPPLIER
GO
After migrating the database to the data pool, you can use the following script as a test.
Note: The following script selects data only from tables in the data pool, Oracle, and HDFS.
/* TPC_H Query 8 - National Market Share */
SELECT O_YEAR,
SUM(CASE WHEN NATION = 'CHINA'
THEN VOLUME
ELSE 0
END) / SUM(VOLUME) AS MKT_SHARE
FROM ( SELECT datepart(yy,O_ORDERDATE) AS O_YEAR,
L_EXTENDEDPRICE * (1-L_DISCOUNT) AS VOLUME,
N2.N_NAME AS NATION
FROM BDC_DP_PART, -- Points to Data Pool
BDC_DP_SUPPLIER, -- Points to Data Pool
BDC_DP_LINEITEM, -- Points to Data Pool
BDC_DP_ORDERS, -- Points to Data Pool
BDC_DP_CUSTOMER, -- Points to Data Pool
BDC_DP_NATION N1, -- Points to Data Pool
BDC_DP_NATION N2, -- Points to Data Pool
BDC_DP_REGION -- Points to Data Pool
WHERE P_PARTKEY = L_PARTKEY AND
S_SUPPKEY = L_SUPPKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_CUSTKEY = C_CUSTKEY AND
C_NATIONKEY = N1.N_NATIONKEY AND
N1.N_REGIONKEY = R_REGIONKEY AND
R_NAME = 'ASIA' AND
S_NATIONKEY = N2.N_NATIONKEY AND
O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND
P_TYPE = 'SMALL BRUSHED COPPER'
) AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR
Reviewing performance was not the goal of this white paper; however, our SQL Server experts did observe performance. They first modified the 22 queries that the TPC‑H benchmark uses. They then ran queries against the data in the data pool and compared the time it took to run those queries to the time it took to run the queries against SQL Server, Oracle, MongoDB, and HDFS. Before reviewing the result, note that the data pool was not designed for performance, and the amount of data that was migrated to the data pool was not large. Thus, any performance findings are not significant.
All 22 queries ran faster on the data in the Big Data Cluster data pool than when they ran on the virtualized dataset outside of Big Data Cluster.
Using the data pool accelerated 78 percent of the TPC-H queries. Depending on how your organization plans to use Big Data Cluster, you can configure your VxRail system to address your performance requirements.