What's a Data Hoarder to Do?
Tue, 10 May 2022 19:18:45 -0000
|Read Time: 0 minutes
So you're buried in data, your can't afford to expand, your performance is bad and getting worse and your users can't find what they need. Yes it's a tsunami of data that's the root cause of your problems. You ask your Mom for advice and she says "Why don't you watch that TV show called Hoarders?" You watch a few episodes and can relate to the problem but they offer no formidable solutions for our excess data. Then you talk to Mike King over at Dell and he says "That problem has been around since the ENIAC". The bottom line is that almost all systems are designed to store certain kinds of data for a pre-determined amount of time (retention). If you don't have retention rules then you failed as an architect. The solution for data hoarding is much more recent evolving over the last 40 years or so. It was first called data archiving. That term is still used today by some. The concept is really simple one takes the data that is no longer needed and removes it from the system of record. If the data is still needed but just way less frequently then it would be move to a cheaper form of storage. The disciple that evolved around this practice was first called data lifecycle management (DLM) and later information lifecycle management (ILM). ILM considers many more aspects of the archiving process in a more holistic sense including policies, governance, classification, access, compliance, retention, redaction, privacy, recall, query and more. We won't get into all the ILM stuff in this post.
Let's take a concrete example to get started. We have a regional bank called Happy Piggy Bank. They do business in 30 states and have supporting ERP applications like Oracle EBS, databases such as Greenplum & SingleStore for analytics and hadoop for an integrated data warehouse and AI platform. The EBS db has six years of data and a stout 600TB of data. The Greenplum db is around 1PB and stores just 90 days of data. SingleStore is new but they have big plans and it's at 200TB today but will grow to 3PB in a year. The hadoop is the largest of all and has detail transaction and account statements going back 10 years and stores 10PB of raw data. Only the Greenplum db has a formal purge program that was actually written and put in production. Both the hadoop and EBS environments have no purge program. The first order of business is to determine how much data they should or need to retain. This is mostly a business activity. The next step is to determine the access patterns. In order to do data archiving one needs to determine the active portion of the data. In most systems perhaps 99% of the access is constrained to a smaller portion of the retention continuum. Let's consider that EBS db and it's six years of data. We might run some reports and do some analysis and it's highly likely that 90% of the data is less than 6 months old and let's say 99% is less than 1 year old. In this case we should target the 5 oldest years of retention (83% of the data or 498TB of the db) to migrate to a more cost effective platform. In a similar fashion we determine that 60% of the hadoop data is accessed less than 1% of the time so that's a 6PB chunk we can lop off of the hadoop system. So for Happy Piggy Bank we have determined we can remove 6.5PB of data from two of the systems which will yield the following benefits:
- Room for future growth will be created in the source systems
- Performance should improve in these systems
- Overall data storage costs will go down
- The source systems will be easier to manage
- We will likely avoid increased software licensing charges for Oracle and hadoop as compared to doing nothing
So ye ask what might the solution be? Enter Versity a partner of Dell Technologies enabled through our OEM channel. Versity is a full featured archiving solution that enables:
- High performance parallel archive
- Covers a wide variety of applications, databases and such
- Stores data in three successive tiers (local, NAS & object)
- Supports selective recall
The infrastructure includes:
- Versity software
- PE 15G servers such as R750s
- PowerVault locally attached arrays
- PowerScale NAS appliances
- ECS object appliances
A future post will cover more details on what this solution could look like for Happy Piggy Bank.
Versity targets customers that have 5PB of data or more that can be archived.