Dell PowerProtect Hadoop Data Protection – a Modern and Cost-effective Solution for Big Data Apache Hadoop
Thu, 27 Jan 2022 19:10:36 -0000
|Read Time: 0 minutes
Big Data is no longer a buzzword. Over the past decade, big data and analytics have become the most reliable source of insight for businesses as they harness collective data to align strategy, help teams collaborate, uncover new opportunities, and compete in the global marketplace. The market size for business data analytics (BDA) was USD 206.95 billion in 2020 and USD 231.43 billion in 2021, according to Fortune Business Insights™ in its report, titled, “Big Data Analytics Market, 2021-2028”.
With big data analytics on the course to become the most mission critical enterprise application, enterprises are demanding a robust level of backup, recovery, and disaster recovery solutions for their big data environment, in particular Apache Hadoop®. Apache Hadoop is a clustered distributed system that is used to process massive amounts of data. It provides software and a framework for distributed storage and processing of big data, using various distributed processing models.
Some common mistakes that make data protection challenging in Apache Hadoop:
- Because Apache Hadoop has 3X replication by default, many administrators do not plan to back up the data again.
- A very common misconception is that Apache Hadoop is just a repository for data that resides in existing data warehouses or transactional systems, so the data can be reloaded.
- Data is unstructured and often huge in size and it takes a very long time for backups to complete.
- Administrators mistakenly believe that Apache Hadoop is missing built-in technologies such as replication and mirroring with Apache Falcon, distcp, HBase, and Hive Replication.
Why do we need a dedicated Data Protection solution in Apache Hadoop aside from replication and in-built snapshots?
- Apache Hadoop infrastructure with 3x replication is a costly solution and long-time retention is not possible for huge volumes of data, so we need a cost-effective solution that can save multiple copies of important data.
- Social media data, ML models, logs, third-party feeds, open APIs, IoT data, banking transactions, and other sources of data may not be reloadable, easily available, or in the enterprise at all. So, this is critical single-source data that must be backed up and stored forever for meeting internal or legal compliance requirements for RPO and RTO.
- Synchronously replicating data will negatively impact application performance. Synchronous replication will also intercept all writes to the file system. This can destabilize the production system and requires extensive testing prior to putting it into production.
- Some natural disaster that takes out an entire data center, an extended power outage that makes the Apache Hadoop platform unavailable, a DBA accidentally dropping an entire database, an application bug corrupting data stored on HDFS, or worse: a cyberattack such as ransomware.
Overview of Dell PowerProtect Hadoop Data Protection
Dell Technologies has developed a Hadoop File System driver for PowerProtect DD series appliances that allows Hadoop data management functions to transparently use DD series appliances for the efficient storage and retrieval of data. This driver is called DD Hadoop Compatible File System (DDHCFS).
DDHCFS is the Hadoop-compatible file system that is used by DD series appliances. It provides versioned backups and recoveries of the Hadoop Distributed File System (HDFS) by using DD snapshots. It provides source-level deduplication and direct backups to DD series appliances. DDHCFS can also distribute backup-and-restore streams across all HDFS DataNodes, making backups and restores faster. It can also be integrated with Kerberos and Hadoop local credentials.
PowerProtect Hadoop Data Protection requires minimal configuration and installs only on the one node of the Hadoop Cluster. It is tightly integrated into the Hadoop file system, and leverages Hadoop’s scale-out distributed processing architecture to parallelize data transfer from Hadoop to the PowerProtect DD series appliance. DD Boost provides a network-efficient data transfer with client-side deduplication, while PowerProtect DD provides storage efficiency through deduplication and compression. Together this makes it the most efficient method of moving large amounts of data from a Hadoop cluster to a PowerProtect DD series appliance. Internally standard Hadoop constructs like Distributed File Copy & HDFS/HBase snapshots are leveraged to accomplish backup tasks.
Some highlights of the PowerProtect Hadoop Data protection solution for Hadoop environments are:
- True point-in-time backup and recovery of HDFS data to the PowerProtect DD series appliance
- Full backups at the low cost of incremental (synthetic full) – incremental forever
- Flexible Licensing (included in the PowerProtect and DP Suite license)
- Bandwidth efficiency of DD Boost sends only unique data over the network
- Capacity savings of up to 70% by means of global deduplication and compression on DD series appliances
- Source level deduplication and load balancing via DDBoost interface group
- HDFS integration transparently works through the three way storage redundancy to back up one consistent copy of the data
- Uses standard Hadoop constructs (such as MapReduce and distcp) to spawn distributed DD Boost agents to parallelize data transfer to PowerProtect DD series appliances
- Self-service, script-based backup, restores, and listing. Every Hadoop administrator can readily use these commands and incorporate them into other workflows
- Backup operations can be further scheduled and automated by Oozie
- Audit log of configuration changes and backup monitoring
Conclusion
PowerProtect Hadoop Data Protection, which is part of the Dell Data Protection Suite Family, provides complete Hadoop data protection. Apache Hadoop customers further benefit from PowerProtect DD series appliances when using the power of DD Boost, with strong backup performance, reduced bandwidth requirements, and improved load balancing and reliability.
The PowerProtect DD series appliances’s Data Invulnerability Architecture provides outstanding data protection, ensuring that data from your Hadoop cluster can be recovered when needed and the data can be trusted. It provides storage efficiency through variable-length deduplication and compression, typically reducing storage requirements by 10-30x[1].
Throughput with DD Boost technology can scale from 7TB/hr on the DD3300 to 94TB/hr on the DD9900. What makes DD Boost special is the deduplication taking place at the source client: only unique data is sent through the network to the PowerProtect DD systems. According to IDC, PowerProtect DD is the #1 product in the Purpose Built Backup Appliance (PBBA) market[2].
If you are already using PowerProtect DD series appliances for other data protection needs, you can leverage the same processes and expertise to protect your big data Hadoop environment.
Author: Sonali Dwivedi LinkedIn
[1] Based on Dell internal testing, January 2020. https://www.delltechnologies.com/en-ca/data-protection/powerprotect-backup-appliances.htm#overlay=//www.dellemc.com/en-ca/collaterals/unauth/analyst-reports/products/data-protection/esg-techreview-next-generation-performance-with-powerprotect.pdf.
[2] Based on IDC WW Purpose-Built Backup Appliance Systems Tracker, 1Q21 (revenue), June 2021.
Related Blog Posts
Navigating the modern data landscape: the need for an all-in-one solution
Mon, 18 Mar 2024 19:56:59 -0000
|Read Time: 0 minutes
There are two revolutions brewing inside every enterprise. We are all very familiar with the first one - the frenzied rush to expand an organization's AI capabilities, which leads to an exponential growth in data creation, a rise in availability of high-performance computing systems with multi-threaded GPUs, and the rapid advancement of AI models. The situation creates a perfect storm that is reshaping the way enterprises operate. Then, there is a second revolution that makes the first one a reality – the ability to harness this awesome power and gain a competitive advantage to drive innovation. Enterprises are racing towards a modern data architecture that seeks to bring order to their chaotic data environment.
The Need For An All-In-One Solution
Data platforms are constantly evolving, despite a plethora of options such as data lakes, data warehouses, cloud data warehouses and even cloud data lakehouses, enterprise are still struggling. This is because the choices available today are suboptimal.
Cloud native solutions offer simplicity and scalability, but migrating all data to the cloud can be a daunting task and can end up being significantly more expensive over the long term. Moreover, concerns about the loss of control over proprietary data, particularly in the realm of AI, is a major cause for concern, as well. On the other hand, traditional on-premises solutions require significantly more expertise and resources to build and maintain. Many organizations simply lack the skills and capabilities needed to construct a robust data platform in-house.
A customer once told me – “We’ve heard from so many vendors but ultimately there is no easy button for us.”
When Dell Technologies set out to build that easy button, we started with what enterprises needed most: infrastructure, software, and services all seamlessly integrated. We created a tailor-made solution with right-sized compute and a highly performant query engine that is pre-integrated and pre-optimized to perfectly streamline IT operations. We incorporated built-in enterprise-grade security that also can seamlessly integrate with 3rd party security tools. To enable rapid support, we staffed a bench of experts, offering end-to-end maintenance and deployment services. We also knew the solution needed to be future proof – not only anticipating future innovations but also accommodating the diverse needs of users today. To support this idea, we made the choice to use open data formats, which means an organization’s data is no longer locked-in to a proprietary format or vendor. To make the transition easier, the solution makes use of built-in enterprise-ready connectors that ensures business continuity. Ultimately, our goal was to deliver an experience that is easy to install, easy to use, easy to manage, easy to scale, and easy to future-proof.
Dell Data Lakehouse’s Core Capabilities
Let’s dig into each component of the solution.
- Data Analytics Engine, powered by Starburst: A high performance distributed SQL query engine, built on top of Starburst, based on Trino, which can run fast analytic queries against data lakes, lakehouses and distributed data sources at internet-scale. It integrates global security with fine-grained access controls, supports ad-hoc and long-running ELT workloads and is a gateway to building high quality data products and power AI and Analytics workloads. Dell’s Data Analytics Engine also includes exclusive features that help dramatically improve performance when querying data lakes. Stay tuned for more info!
- Data Lakehouse System Software: This new system software is the central nervous system of the Dell Data Lakehouse. It simplifies lifecycle management of the entire stack, drives down IT OpEx with pre-built automation and integrated user management, provides visibility into the cluster health and ensures high availability, enables easy upgrades and patches and lets admins control all aspects of the cluster from one convenient control center. Based on Kubernetes, it’s what converts a data lakehouse into an easy button for enterprises of all sizes.
- Scale-out Lakehouse Compute: Purpose-built Dell Compute and Networking hardware perfectly matched for compute-intensive data lakehouse workloads come pre-integrated into the solution. Independently scale from storage by seamlessly adding more compute as needs grow.
- Scale-out Object Storage: Dell ECS, ObjectScale and PowerScale deliver cyber-secure, multi-protocol, resilient and scale-out storage for storing and processing massive amounts of data. Native support for Delta Lake and Iceberg ensures read / write consistency within and across sites for handling concurrent, atomic transactions.
- Dell Services: Accelerate AI outcomes with help at every stage from trusted experts. Align a winning strategy, validate data sets, quickly implement your data platform and maintain secure, optimized operations.
- ProSupport: Comprehensive, enterprise-class support on the entire Dell Data Lakehouse stack from hardware to software delivered by highly trained experts around the clock and around the globe.
- ProDeploy: Expert delivery and configuration assure that you are getting the most from the Dell Data Lakehouse on day one. With 35 years of experience building best-in-class deployment practices and tools, backed by elite professionals, we can deploy 3x faster1 than in-house administrators.
- Advisory Services Subscription for Data Analytics Engine: Receive a pro-active, dedicated expert to maximize value of your Dell Data Analytics Engine environment, guiding your team through design and rollout of new use cases to optimize and scale your environment.
- Accelerator Services for Dell Data Lakehouse: Fast track ROI with guided implementation of the Dell Data Lakehouse platform to accelerate AI and data analytics.
Learn More
With the combination of these capabilities, Dell continues to innovate alongside our customers to help them exceed their goals in the face of data challenges. We aim to allow our customers to take advantage of the revolution brewing that is AI and this rapid change in the market to harness the power of their data and gain a competitive advantage and drive innovation. Enterprises are racing towards a modern data architecture – it's critical they don’t get stuck at the starting line.
For detailed information on this exciting product, refer to our technical guide. For other information, visit Dell.com/datamanagement.
Source
1 Based on a May 2023 Principled Technologies study “Using Dell ProDeploy Plus Infrastructure can improve deployment times for Dell Technology”
KubeCon NA23, Google Cloud Anthos on Dell PowerFlex and More
Sun, 05 Nov 2023 23:26:43 -0000
|Read Time: 0 minutes
KubeCon will be here before you know it. There are so many exciting things to see and do. While you are making your plans, be sure to add a few things that will make things easier for you at the conference and afterwards.
Before we get into those things, did you know that the Google Cloud team and the Dell PowerFlex team have been collaborating? Recently Dell and Google Cloud published a reference architecture: Google Cloud Anthos and GDC Virtual on Dell PowerFlex. This illustrates how both teams are working together to enable consistency between cloud and on premises environments like PowerFlex. You will see this collaboration at KubeCon this year.
On Tuesday at KubeCon, after breakfast and the keynote, you should make your way to the Solutions Showcase in Hall F on Level 3 of the West building. Once there, make your way over to the Google Cloud booth and visit with the team! They want your questions about PowerFlex and are eager to share with you how Google Distributed Cloud (GDC) Virtual with PowerFlex provides a powerful on-premises container solution.
Also, be sure to catch the lightning sessions in the Google Cloud booth. You’ll get to hear from Dell PowerFlex engineer, Praphul Krottapalli. He will be digging into leveraging GDC Virtual on PowerFlex. That’s not the big thing though, he’ll also be looking at running a Postgres database distributed across on-premises PowerFlex nodes using GDC Virtual. Beyond that, they will look at how to protect these containerized database workloads. They’ll show you how to use Dell PowerProtect Data Manager to create application consistent backups of a containerized Postgres database instance.
We all know backups are only good if you can restore them. So, Praphul will show you how to recover the Postgres database and have it running again in no time.
Application consistency is an important thing to keep in mind with backups. Would you rather have a database backup where someone had just pulled the plug on the database (crash consistent) or would you like the backup to be as though someone had gracefully shut down the system (application consistent)? For all kinds of reasons (time, cost, sanity), the latter is highly preferable!
We talk about this more in a blog that covers the demo environment we used for KubeCon.
This highlights Dell and Google’s joint commitment to modern apps by ensuring that they can be run everywhere and that organizations can easily develop and deploy modern workloads.
If you are at KubeCon and would like to learn more about how containers work on Dell solutions, be sure to stop by both the Dell and Google Cloud booths. If it’s after KubeCon, be sure to reach out to your Dell representative for more details.
Author: Tony Foster