Data scientists and others at times might need to move data from the original source to the Big Data Cluster data pool—to work on a subset of all data for data analytics or to accelerate performance, for example. The primary use of the data pool is to offload data processing from the SQL Server master instance and, in doing so, accelerate performance. The data pool can accelerate performance in the following ways:
In this use case, we ingested all the data from the stand-alone SQL Server instance into the data pool. The stand-alone SQL Server instance was selected because it contained TPC-H tables that were larger than those tables in the Oracle database.
Ingest the data as follows:
CREATE EXTERNAL DATA SOURCE SqlDataPool
WITH (LOCATION = 'sqldatapool://controller-svc/default');
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_DP_SUPPLIER')
CREATE EXTERNAL TABLE [EXT_DP_SUPPLIER] (
"S_SUPPKEY" BIGINT ,
"S_COMMENT" VARCHAR(101) )
DATA_SOURCE = SqlDataPool,
DISTRIBUTION = ROUND_ROBIN
INSERT INTO EXT_DP_SUPPLIER(
After migrating the database to the data pool, you can use the following script as a test.
/* TPC_H Query 8 - National Market Share */
SUM(CASE WHEN NATION = 'CHINA'
END) / SUM(VOLUME) AS MKT_SHARE
FROM ( SELECT datepart(yy,O_ORDERDATE) AS O_YEAR,
L_EXTENDEDPRICE * (1-L_DISCOUNT) AS VOLUME,
N2.N_NAME AS NATION
FROM EXT_DP_PART, -- Points to Data Pool
EXT_DP_SUPPLIER, -- Points to Data Pool
EXT_DP_LINEITEM, -- Points to Data Pool
EXT_DP_ORDERS, -- Points to Data Pool
EXT_DP_CUSTOMER, -- Points to Data Pool
EXT_DP_NATION N1, -- Points to Data Pool
EXT_DP_NATION N2, -- Points to Data Pool
EXT_DP_REGION -- Points to Data Pool
WHERE P_PARTKEY = L_PARTKEY AND
S_SUPPKEY = L_SUPPKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_CUSTKEY = C_CUSTKEY AND
C_NATIONKEY = N1.N_NATIONKEY AND
N1.N_REGIONKEY = R_REGIONKEY AND
R_NAME = 'ASIA' AND
S_NATIONKEY = N2.N_NATIONKEY AND
O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND
P_TYPE = 'SMALL BRUSHED COPPER'
) AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR
Reviewing performance was not the goal of this white paper; however, our SQL Server experts did look at performance. They first modified the 22 queries that the TPC‑H benchmark uses. They then ran queries against the data in the data pool, Oracle, and HDFS and compared the time it took to run those queries to the time it took to run the queries against SQL Server, Oracle, and HDFS. Before reviewing the result, note that the data pool was not designed for performance, and the amount of data that was migrated to the data pool was not large. Thus, any performance findings are not significant.
Of the 22 queries, 13 had a faster execution time running with the data pool than running with the SQL Server instance. Execution time is the amount of CPU time that is used for a query. A faster execution time uses less CPU time.
Using the data pool accelerated 59 percent of the TPC-H queries. The PowerFlex system supports performance tiers, providing the ability to allocate faster flash drives to the data pool. Depending on how your organization plans to use Big Data Clusters, you can configure the PowerFlex system to address your performance requirements.