Home > Workload Solutions > SQL Server > Guides > Design Guide—Data Analytics with SQL Server 2022 on Red Hat OpenShift and Dell ObjectScale > Create TPC-DS table schema in Pyspark
Use the following steps to create a TPC-DS table schema in Pyspark.
All the .dat files for each of the 25 tables must be converted in the TPC-DS data set and then ingest them into the ObjectScale bucket.
To establish an external data source, it is necessary to specify the S3 storage URL and the credentials associated with the database.
The utilization of SQL Server external table streamlines the procedure of handling data beyond the confines of SQL Server. By leveraging PolyBase, the external table can effortlessly retrieve data that is stored externally to SQL Server, such as on ObjectScale object storage in our specific scenario.
Before creating the external table, perform the following steps:
The following figures demonstrate the creation of the external file in Delta Lake file format.
After creating the external file in Delta Lake format (DeltaTableFormat), the next step is to create the SQL Server External table. This can be achieved by utilizing the "LOCATION" and "DATA_SOURCE" parameters.
After the successful creation of the external table, a standard SQL Server "select" statement can be executed like any other tables within the SQL Server database. An illustration of running a "select" statement on the "customer_address" table is depicted in Figure 36
All 25 TPC-DS tables have been established as External tables within SQL Server. Consequently, the TPC-DS data is now stored externally to SQL Server, with solely the metadata for these tables remaining within the SQL Server instance. The next step is to run all 99 standard TPC-DS queries and conducting the necessary data analytics in accordance with the TPC-DS specifications.
Figure 37 illustrates one of the standard TPC-DS queries, among the 99 queries, along with its corresponding result. The query in question utilizes the date_dim (consisting of 73,049 records), store_sales (comprising 28,79,97,024 records), and store (containing 402 records) tables, all of which are located on ObjectScale.
The execution plan for this query demonstrates that plan includes a "remote scan" operator showcasing actual data coming from outside SQL Server at run time, as indicated in Figure 38.