Our lives are being transformed by the accelerating rate of technology innovation and adoption. With smart phone applications, we can manage our bank accounts, stay connected through a vast array of social media channels, and have near-instant access to news from around the world. This era of continuous technology advancement has created an explosion of data. IDC predicts an increase in the global datasphere from 33 zettabytes in 2018 to 175 zettabytes by 2025 (Data Age 2025: The Digitization of the World From Edge to Core, IDC, November 2018).
Data analysis can solve real-world problems to improve many aspects of our lives; however, the data-management requirements that are necessary to support analytics are the most limiting factor for most organizations. Traditional structured databases were not designed to support data on the scale of petabytes, exabytes, or greater. Innovations in distributed database technology address the Big Data challenge by distributing data across interconnected computer clusters. We believe that the use of distributed databases in universities, governments, and businesses holds great potential to produce additional insights that will lead to new life-improving transformations.
Microsoft SQL Server Big Data Clusters are designed to solve the Big Data challenge that faces most organizations today. With SQL Server Big Data Clusters, data professionals can distribute petabytes to exabytes of data across scale-out pools of compute and storage resources by using the Apache Spark and Hadoop Distributed File System (HDFS) open-source cluster computing frameworks. Spark provides the capability to parallelize data analysis across an entire cluster of computers, while HDFS handles the persistence and data access performance for large-scale datasets. Data scientists can choose from multiple tools, including Spark APIs and Transact-SQL (T-SQL), for use in developing new analytics insights.
SQL Server technologies also address the challenges of accessing data silos that are too dynamic or too large to bring into a single distributed database instance. Having the ability to connect to other data sources in the datasphere increases data variety, thus strengthening the range of analytics and increasing the accuracy of predictions. Among the more recent SQL Server innovations is data virtualization, which facilitates analytics with disparate data sources without the need to copy data between systems. Data virtualization can access a wide variety of source data systems and enables analysis without having to extract, transform, and load (ETL) into a common data warehouse. Instead, data virtualization enables querying the data directly by using the native languages and returning the results to a requestor application.
Data virtualization is possible through PolyBase technology and integration of that technology into SQL Server. Using SQL Server with PolyBase, your data analysts can access data in Hadoop, Oracle, Teradata, and MongoDB by using familiar T‑SQL queries. Together with SQL Server Big Data Clusters, PolyBase builds bridges by connecting Big Data technology with traditional relational data sources in the datasphere. Data scientists can use the most appropriate programming languages, including T‑SQL, Java, Python, R, and Scala, to perform data analytics across a broad spectrum of data sources.