Step 3: Migrate data to the data pool

Thank you for your feedback!

Data scientists and others at times might need to move data from the original source to the Big Data Cluster data pool—to work on a subset of all data for data analytics or to accelerate performance, for example. The primary use of the data pool is to offload data processing from the SQL Server master instance and, in doing so, accelerate performance. The data pool can accelerate performance in the following ways:
- Offloading of data processing to servers in the data pool—The data pool can consist of one or more PowerFlex nodes running SQL Server. The PowerFlex rack in our use case had three SQL Server data pool instances, each on a separate node. The capability to distribute processing can accelerate analytics.
- Parallelization of data processing—Beyond the advantage of the physical configuration enabling offloading of data processing to separate nodes, the SQL Server master instance can parse and delegate processing. Distribution of the workload across the SQL Server instances further accelerates data processing.
- Use of shards—Data that is ingested into the data pool is distributed across the SQL Server instances. This benefits performance because each SQL Server instance in the data pool reads a small subset of information, thus more quickly returning the result sets back to the master SQL Server instance.
In this use case, we ingested all the data from the stand-alone SQL Server instance into the data pool. The stand-alone SQL Server instance was selected because it contained TPC-H tables that were larger than those tables in the Oracle database.
Ingest the data as follows:
1. Create the external data source:
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlDataPool')
CREATE EXTERNAL DATA SOURCE SqlDataPool
WITH (LOCATION = 'sqldatapool://controller-svc/default');
GO
1. Create an external table named EXT_DP_SUPPLIER in the data pool:
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'EXT_DP_SUPPLIER')
   CREATE EXTERNAL TABLE [EXT_DP_SUPPLIER] (
"S_SUPPKEY" BIGINT ,
"S_NAME" CHAR(25),
"S_ADDRESS" VARCHAR(40),
"S_NATIONKEY" INT,
"S_PHONE" CHAR(15),
"S_ACCTBAL" Decimal(12,2),
"S_COMMENT" VARCHAR(101) )
   WITH (
      DATA_SOURCE = SqlDataPool,
      DISTRIBUTION = ROUND_ROBIN
   );
GO
1. Ingest the data into the data pool:
INSERT INTO EXT_DP_SUPPLIER(
    "S_SUPPKEY",
"S_NAME",
"S_ADDRESS",
"S_NATIONKEY",
"S_PHONE",
"S_ACCTBAL",
"S_COMMENT"
)
SELECT "S_SUPPKEY",
"S_NAME",
"S_ADDRESS",
"S_NATIONKEY",
"S_PHONE",
"S_ACCTBAL",
"S_COMMENT"
FROM SUPPLIER
GO
After migrating the database to the data pool, you can use the following script as a test.
Note: The following script selects data only from tables in the data pool, Oracle, and HDFS.
/* TPC_H Query 8 - National Market Share */
SELECT O_YEAR,
SUM(CASE WHEN NATION = 'CHINA'
   THEN VOLUME
   ELSE 0
   END) / SUM(VOLUME) AS MKT_SHARE
FROM ( SELECT datepart(yy,O_ORDERDATE)  AS O_YEAR,
   L_EXTENDEDPRICE * (1-L_DISCOUNT) AS VOLUME,
   N2.N_NAME    AS NATION
  FROM EXT_DP_PART,  -- Points to Data Pool
   EXT_DP_SUPPLIER, -- Points to Data Pool
   EXT_DP_LINEITEM, -- Points to Data Pool
   EXT_DP_ORDERS, -- Points to Data Pool
   EXT_DP_CUSTOMER, -- Points to Data Pool
   EXT_DP_NATION N1, -- Points to Data Pool
   EXT_DP_NATION N2, -- Points to Data Pool
   EXT_DP_REGION  -- Points to Data Pool
  WHERE P_PARTKEY = L_PARTKEY AND
   S_SUPPKEY = L_SUPPKEY AND
   L_ORDERKEY = O_ORDERKEY AND
   O_CUSTKEY = C_CUSTKEY AND
   C_NATIONKEY = N1.N_NATIONKEY AND
   N1.N_REGIONKEY = R_REGIONKEY AND
   R_NAME  = 'ASIA' AND
   S_NATIONKEY = N2.N_NATIONKEY AND
   O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND
   P_TYPE  = 'SMALL BRUSHED COPPER'
) AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR
Reviewing performance was not the goal of this white paper; however, our SQL Server experts did look at performance. They first modified the 22 queries that the TPC‑H benchmark uses. They then ran queries against the data in the data pool, Oracle, and HDFS and compared the time it took to run those queries to the time it took to run the queries against SQL Server, Oracle, and HDFS. Before reviewing the result, note that the data pool was not designed for performance, and the amount of data that was migrated to the data pool was not large. Thus, any performance findings are not significant.
Of the 22 queries, 13 had a faster execution time running with the data pool than running with the SQL Server instance. Execution time is the amount of CPU time that is used for a query. A faster execution time uses less CPU time.
Using the data pool accelerated 59 percent of the TPC-H queries. The PowerFlex system supports performance tiers, providing the ability to allocate faster flash drives to the data pool. Depending on how your organization plans to use Big Data Clusters, you can configure the PowerFlex system to address your performance requirements.

Your Browser is Out of Date

Step 3: Migrate data to the data pool

Step 3: Migrate data to the data pool