Step 3: Migrate data to the data pool

Thank you for your feedback!

Data scientists and others at times might need to move data from the original source to the Big Data Cluster data pool. For example, they might need to work on a subset of the data for data analytics, or they might need to accelerate performance. The primary use of the data pool is to offload data processing from the SQL Server master instance and, in doing so, accelerate performance. The data pool can accelerate performance in the following ways:
- Offloading of data processing to servers in the data pool—The data pool can consist of one or more VxRail nodes running SQL Server. The VxRail system in our use case had three SQL Server data pool instances, each on a separate node. The capability to distribute processing can accelerate analytics.
- Parallelization of data processing—Beyond the advantage of the physical configuration enabling offloading of data processing to separate nodes, the SQL Server master instance can parse and delegate processing. Distribution of the workload across the SQL Server instances further accelerates data processing.
- Use of shards—Data that is ingested into the data pool is distributed across the SQL Server instances. Distribution across the instances benefits performance. Because each SQL Server instance in the data pool reads a small subset of information, the master SQL Server instance receives results more quickly.
In this use case, we ingested all the data from the stand-alone SQL Server instance into the data pool. The stand-alone SQL Server instance was selected because it contained TPC-H tables that were larger than those tables in the Oracle database.
Ingest the data as follows:
1. Create the external data source:
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlDataPool')
CREATE EXTERNAL DATA SOURCE SqlDataPool
WITH (LOCATION = 'sqldatapool://controller-svc/default');
GO
1. In the data pool, create an external table named BDC_DP_SUPPLIER:
IF NOT EXISTS(SELECT * FROM sys.external_tables WHERE name = 'BDC_DP_SUPPLIER')
   CREATE EXTERNAL TABLE [BDC_DP_SUPPLIER] (
"S_SUPPKEY" BIGINT ,
"S_NAME" CHAR(25),
"S_ADDRESS" VARCHAR(40),
"S_NATIONKEY" INT,
"S_PHONE" CHAR(15),
"S_ACCTBAL" Decimal(12,2),
"S_COMMENT" VARCHAR(101) )
   WITH (
      DATA_SOURCE = SqlDataPool,
      DISTRIBUTION = ROUND_ROBIN
   );
GO
1. Ingest the data into the data pool:
INSERT INTO BDC_DP_SUPPLIER(
    "S_SUPPKEY",
"S_NAME",
"S_ADDRESS",
"S_NATIONKEY",
"S_PHONE",
"S_ACCTBAL",
"S_COMMENT"
)
SELECT "S_SUPPKEY",
"S_NAME",
"S_ADDRESS",
"S_NATIONKEY",
"S_PHONE",
"S_ACCTBAL",
"S_COMMENT"
FROM EXT_SQL_SUPPLIER
GO
After migrating the database to the data pool, you can use the following script as a test.
Note: The following script selects data only from tables in the data pool, Oracle, and HDFS.
/* TPC_H Query 8 - National Market Share */
SELECT O_YEAR,
SUM(CASE WHEN NATION = 'CHINA'
   THEN VOLUME
   ELSE 0
   END) / SUM(VOLUME) AS MKT_SHARE
FROM ( SELECT datepart(yy,O_ORDERDATE)  AS O_YEAR,
   L_EXTENDEDPRICE * (1-L_DISCOUNT) AS VOLUME,
   N2.N_NAME    AS NATION
  FROM BDC_DP_PART,  -- Points to Data Pool
   BDC_DP_SUPPLIER, -- Points to Data Pool
   BDC_DP_LINEITEM, -- Points to Data Pool
   BDC_DP_ORDERS, -- Points to Data Pool
   BDC_DP_CUSTOMER, -- Points to Data Pool
   BDC_DP_NATION N1, -- Points to Data Pool
   BDC_DP_NATION N2, -- Points to Data Pool
   BDC_DP_REGION -- Points to Data Pool
  WHERE P_PARTKEY = L_PARTKEY AND
   S_SUPPKEY = L_SUPPKEY AND
   L_ORDERKEY = O_ORDERKEY AND
   O_CUSTKEY = C_CUSTKEY AND
   C_NATIONKEY = N1.N_NATIONKEY AND
   N1.N_REGIONKEY = R_REGIONKEY AND
   R_NAME  = 'ASIA' AND
   S_NATIONKEY = N2.N_NATIONKEY AND
   O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND
   P_TYPE  = 'SMALL BRUSHED COPPER'
) AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR
Reviewing performance was not the goal of this white paper; however, our SQL Server experts did observe performance. They first modified the 22 queries that the TPC‑H benchmark uses. They then ran queries against the data in the data pool and compared the time it took to run those queries to the time it took to run the queries against SQL Server, Oracle, MongoDB, and HDFS. Before reviewing the result, note that the data pool was not designed for performance, and the amount of data that was migrated to the data pool was not large. Thus, any performance findings are not significant.
All 22 queries ran faster on the data in the Big Data Cluster data pool than when they ran on the virtualized dataset outside of Big Data Cluster.
Using the data pool accelerated 78 percent of the TPC-H queries. Depending on how your organization plans to use Big Data Cluster, you can configure your VxRail system to address your performance requirements.

Your Browser is Out of Date

Step 3: Migrate data to the data pool

Step 3: Migrate data to the data pool