Wed, 24 Apr 2024 15:27:10 -0000
|Read Time: 0 minutes
Do you have a Big Data mess? Do you have separate infrastructure for the likes of NoSQL databases like Cassandra, MongoDB, Neo4j & Riak? I’ll bet that kafka, spark and elastic search are on separate gear too. Let’s throw in PostgreSQL, MariaDB, MySQL, Greenplum and another db or two. We don’t want to forget machine learning with sckit-learn and DASK nor deep learning with Tensorflow and Pytorch.
What if I told you you could run all of them including test/dev, qa, prod w/ perhaps multiple instances and different versions all on the same multi-tenant, containerized platform?
Enter Robin Systems and their cloud native platform. Some of the features I find useful include:
As for the use cases some ideas
Contact info for Mike King, Advisory System Engineer for DA / AI / Big Data, Dell Technologies | NA Data Center Workload Solutions
Links
Mon, 24 Apr 2023 14:12:49 -0000
|Read Time: 0 minutes
Robin Systems SymWorld Cloud, previously known as Cloud Native Platform (CNP), is a killer upstream K8S distribution that you should consider for many of your workloads. I’ve been working with this platform for several years and continue to be impressed with what it can do. Some of the things that I see value in for Symworld Cloud include but are not limited to the following:
When it comes to workloads there’s an extensive existing catalog. If you need something added it can be added. Workloads that could be deployed include:
With respect to use cases some options might be:
Interested in learning more? We have an upcoming event on May 19th @12N EST that is just what the doctor ordered. I’ll be a panelist for this webinar.
Topic: Solving the Challenges of deployment and management of Complex Data Analytics Pipeline
Register in advance for this webinar:
https://symphony.rakuten.com/dell-webinar-data-analytics-pipeline
After registering, you will receive a confirmation email containing information about joining the webinar
If you just can’t wait till then feel free to reach out to me @ Mike.King2@dell.com to discuss your challenge further.
Thu, 09 Feb 2023 20:47:00 -0000
|Read Time: 0 minutes
Cassandra is a popular NoSQL database in a crowded field of perhaps 225+ different NoSQL databases. Backing up a bit there is a taxonomy for NoSQL which has four types:
Cassandra is an excellent replacement for Hbase when migrating away from Hadoop to something like our Data Lakehouse solution here and here. More in a future post on this solution. What does wide column actually mean? It’s simple a key-value pair w/ an amorphous, typically large payload (value). One of the cool things I learned when designing my first Hbase db about nine years back was that the payload can vary from record to record which blew my mind at the time. All I could think of was garbage data, low quality data, no schema, …. What a mess. But for some strange reason folks don’t seem to care much about those items and are more concerned w/ handling growth, scale-out and performance.
Cassandra comes in two versions. The first is community and the second is DataStax edition, DSE. DataStax offers support for both and has excellent services capability after their purchase of Last Pickle. From my experience in my customer base I see about 50% of each. I think DSE is well worth the cost for most customers but then again that’s a choice and the voices against paying for it seem to be stronger.
Cassandra clusters should have a number of nodes evenly divisible by three. I like to start with six myself. As for storage one can probably get by with vSAS RI SSDs. More smaller capacity SSDs will give you more IOPS. 10GbE NICs should suffice but I favor 25GbE these days due to economics, value and future proofing. One can get 150% more throughput for about a 25% uplift. Sorry Cisco but 40GbE is dead and will go the way of the dodo bird. The cores you need can vary but tend to be in the 12-16core per socket range. Most of the time I’m looking for value here. I avoid top end processors due to cost and generally they’re not needed. If I need lots of cores I would look at some of our AMD servers. For this exercise we will consider Intel as it’s way more prevalent. For us at Dell this means and R650 Ice Lake server where we can squeeze a lot in 1U.
The specs for a six node cluster could look like this per node:
For your Cassandra needs contact me @ Mike.King2@dell.com to discuss your challenge further.
Mon, 06 Feb 2023 19:07:45 -0000
|Read Time: 0 minutes
By far the most popular pub/sub messaging software is kafka. Producers send data and messages to a broker for later use by consumers. Data is published to one or more topics which are queues. Consumers read messages from a topic and mark it as read. Most topics may have multiple consumers. Topics may be partitioned to enable parallel processing by brokers. Once all consumers have read the message it is logically deleted. Replicas create another copy of your data to help prevent data loss.
Regarding your platform choice there are many options including:
Some tips:
What might this look like on some PE Servers. For 15G Ice Lake servers the most attractive server would be an R650. It’s a 1U server with 10 drive bays, decent memory and a wide selection of processors. A middle of the road configuration might look something like the following:
For your kafka needs feel free to contact me @ Mike.King2@dell.com to discuss your challenge further.
Mon, 06 Feb 2023 18:44:06 -0000
|Read Time: 0 minutes
In the NoSQL Database Taxonomy there are four basic categories:
Although Graph is arguably the smallest category by several measures it is the richest when it comes to use cases. Here is a sampling of what I’ve seen to date:
We’re closely partnered with Tiger Graph and can cover the above use cases and many more.
If you’d like to hear more and work on solutions to your problem please do drop me an email at Mike.King2@Dell.com
Mon, 06 Feb 2023 18:42:50 -0000
|Read Time: 0 minutes
I’m amazed at how many companies I talk to that don’t have a discernable database strategy. Aggregate spending on database technology for software, services, servers, storage, networking & people runs six figures for most medium sized companies and into the tens of millions per annum for large companies. So anyway you slice it it’s a large investment that warrants a strategy.
First let’s consider the different kinds of database technologies out there. There’s relational, time series, geo-spatial, GPU, OLAP, OLTP, HTAP, New SQL, NoSQL including key-value, document, wide-column and graph. All together there’s probably 400ish different choices. Many large companies have 10 – 20 of these different one’s floating around.
How does one get started?
If you like a free consultation on your particular dilemna please do contact me at Mike.King2@Dell.com
Fri, 02 Dec 2022 04:58:29 -0000
|Read Time: 0 minutes
How about SingleStore for your database on 15G Dell PE Servers?
Singlestore is a distributed relational database that was previously called MemSQL. It is well suited to analytics workloads. There are two data structure constructs available. First is the column store which is on disk. Disk is typically SSDs. Second is a row store that is in memory and essentially a key-value database. Yes you can have both types in the same db and join across the two different table types. Data for the column store is arranged in leaves where the low level detail is stored and aggregators which are summarized data structures. Clients use the aggregators for queries via SQL.
Singlestore uses the MySQL protocol which makes it compatible with anything that can connect to MySQL.
Customers choose this database when they have demanding high performance analytics needs. We have many large financial customers that are very happy with it.
So what does it look like with the latest 15G IceLake servers for Dell.
Although it could run on most any server the leading candidate would be a Dell PowerEdge R650 for db sizes up to 400TBu. Environments that have larger db needs would use a Dell PowerEdge R750.
ROTs
Other items
5TB Env
100TB Env
400TB Env
If you need your Singlestore database on Dell PE Servers do let us know.
Thu, 15 Sep 2022 14:22:23 -0000
|Read Time: 0 minutes
How often do we hear a project has failed? Projected benefits were not achieved, ROI is less than expected, predictability results are degrading, and the list goes on.
Data Scientists blame it on not having enough data engineers. Data engineers blame it on poor source data. DBAs blame it on data ingest, streaming, software and such…Scape goats are easy to come by.
Have you ever thought why? Yes there are many reasons but one I run across constantly is data quality. Poor data quality is rampant through the vast majority of enterprises. It remains largely hidden. From what I see most companies say we’re a world class organization with top notch talent and we make lots of money and have happy customers therefore we must have world class data with high data quality. This is a pipe dream. Iif you’re not measuring it it’s almost certainly bad leading to inefficiencies, costly mistakes, bad decisions, high error rates, rework, lost customers and many other maladies.
When I’ve built systems & databases in past lives I’ve looked into data, mostly with a battery of SQL queries and found many a data horror, poor quality, defective items, wrong data and many more.
So if you want to know where you stand you must measure your data quality and have a plan to measure the impact of defects and repair them as justified. I think most folks that start down this path quit as they attempt to boil the ocean and fix all the problems they find. I think the best approach is to rank your data items in terms of importance and then measure perhaps the top 1-3% of them. In that way one can make the most impactful improvements with the least effort.
The dimensions of data are varied and can be complex but from a data quality perspective they fall into six or more categories:
Using a tool is highly recommended. Yes, you probably have to pay for one. I won’t get into all the players here.
So, if you don’t have a data quality program then you should get started today because you do have poor data quality.
In a future post I’ll go into more about data quality measures.
If you like a free consultation on your particular situation please do contact me at Mike.King2@Dell.com
Mon, 08 Aug 2022 17:00:07 -0000
|Read Time: 0 minutes
I consult with various customers on their AI/ML/DL needs while coming up with architectures, designs and solutions that are durable, scalable, flexible, efficient, performant, sensible and cost effective. After having seen perhaps 100 different opportunities I have some observations and yes suggestions on how to do things better.
Firstly, there’s an overwhelming desire for DIY. On the surface the appeal is that it’s easy to download the likes of tensor flow with pip, add some python code to point to four GPUs on four different servers, collect some data, train your model and put it into production. This thinking rarely considers concurrency, multi-tenancy, scheduling, management, sharing and many more. What I can safely state is that this path is the hardest, takes the longest, costs more, creates confusion, fosters low asset utilization and leads to high project rate failures.
Secondly most customers are not concerned with infrastructure, scalability, multi-tenancy, architecture and such at the outset. They are after thoughts and their main focus on building a house is let’s get started and so we can get finished sooner. We don’t need a plan do we?
Thirdly most customers are struggling so when they reach out to talk to their friends down the road they’re all moving slowly, doing it themselves, struggling so it’s ok right?
I think there’s a much better way and it all has to do with the software stack. A cultivated software stack that can manage jobs, configure the environment, share resources, scale-out easily, schedule jobs based on priorities and resource availability, support multi-tenancy, record and share results, etc….is just what the doctor ordered. It can be cheap, efficient, speed up projects and improve the success rate. Enter cnvrg.io, now owned by Intel, and you have a best of breed solution to these items and much more.
Recently I collected some of the reasons why I think cnvrg.io the the cultivated AI stack you need for all your AI/ML/DL projects:
Cnvrg.io is a Dell Technologies partner and we have a variety of solutions and platform options. If you’d like to hear more please do drop me an email at Mike.King2@Dell.com
Tue, 10 May 2022 19:18:45 -0000
|Read Time: 0 minutes
So you're buried in data, your can't afford to expand, your performance is bad and getting worse and your users can't find what they need. Yes it's a tsunami of data that's the root cause of your problems. You ask your Mom for advice and she says "Why don't you watch that TV show called Hoarders?" You watch a few episodes and can relate to the problem but they offer no formidable solutions for our excess data. Then you talk to Mike King over at Dell and he says "That problem has been around since the ENIAC". The bottom line is that almost all systems are designed to store certain kinds of data for a pre-determined amount of time (retention). If you don't have retention rules then you failed as an architect. The solution for data hoarding is much more recent evolving over the last 40 years or so. It was first called data archiving. That term is still used today by some. The concept is really simple one takes the data that is no longer needed and removes it from the system of record. If the data is still needed but just way less frequently then it would be move to a cheaper form of storage. The disciple that evolved around this practice was first called data lifecycle management (DLM) and later information lifecycle management (ILM). ILM considers many more aspects of the archiving process in a more holistic sense including policies, governance, classification, access, compliance, retention, redaction, privacy, recall, query and more. We won't get into all the ILM stuff in this post.
Let's take a concrete example to get started. We have a regional bank called Happy Piggy Bank. They do business in 30 states and have supporting ERP applications like Oracle EBS, databases such as Greenplum & SingleStore for analytics and hadoop for an integrated data warehouse and AI platform. The EBS db has six years of data and a stout 600TB of data. The Greenplum db is around 1PB and stores just 90 days of data. SingleStore is new but they have big plans and it's at 200TB today but will grow to 3PB in a year. The hadoop is the largest of all and has detail transaction and account statements going back 10 years and stores 10PB of raw data. Only the Greenplum db has a formal purge program that was actually written and put in production. Both the hadoop and EBS environments have no purge program. The first order of business is to determine how much data they should or need to retain. This is mostly a business activity. The next step is to determine the access patterns. In order to do data archiving one needs to determine the active portion of the data. In most systems perhaps 99% of the access is constrained to a smaller portion of the retention continuum. Let's consider that EBS db and it's six years of data. We might run some reports and do some analysis and it's highly likely that 90% of the data is less than 6 months old and let's say 99% is less than 1 year old. In this case we should target the 5 oldest years of retention (83% of the data or 498TB of the db) to migrate to a more cost effective platform. In a similar fashion we determine that 60% of the hadoop data is accessed less than 1% of the time so that's a 6PB chunk we can lop off of the hadoop system. So for Happy Piggy Bank we have determined we can remove 6.5PB of data from two of the systems which will yield the following benefits:
So ye ask what might the solution be? Enter Versity a partner of Dell Technologies enabled through our OEM channel. Versity is a full featured archiving solution that enables:
The infrastructure includes:
A future post will cover more details on what this solution could look like for Happy Piggy Bank.
Versity targets customers that have 5PB of data or more that can be archived.
Tue, 05 Oct 2021 19:01:57 -0000
|Read Time: 0 minutes
It’s a rare day that a free tool exists that can help profile customer workloads to the mutual benefit of all. Live Optics (previously DPack) is a gem in the rough that is truly a win-win proposition for customers and vendors such as Dell. I’ve been using it for years and found that it’s a rare day that I don’t learn something of use.
The tool is similar to SAR on steroids. Data is collected for each host. Hosts can be VMs. Servers can be from any manufacturer. The data collected is on IOPS (size and amount), memory usage, CPU usage and network activity. It can be run in local mode where the data doesn’t go anywhere else or it can be stored in a Dell private cloud. The later is more beneficial as it may be accessed by folks in many roles for various assessments. The data may also be mined to help Dell make better decisions of current and future products based on actual observed user profiles.
I use LiveOptics to profile database workloads like Greenplum and Vertica, Hadoop, NoSQL databases like MongoDB, Cassandra, Marklogic and more.
Upon inspection of the workload the data collected helps facilitate more meaningful discussions with various SMEs and to right size future designs. In one case I found a customer that was using less than half their memory during peak periods…so we suggested new server BOMs with much less memory as they didn’t need what they had.
Can we help you with assessing your workloads of interest on our servers or those of our competitors?
Some links of interest
Fri, 06 Aug 2021 21:31:26 -0000
|Read Time: 0 minutes
Robin Systems has a most excellent platform that is well suited to simultaneously running a mix of workloads in a containerized environment. Containers offer isolation of varied software stacks. Kubernetes is the control plane that deploys the workloads across nodes and allows for scale-out, adaptive processing. Robin adds customizable templates and life cycle management to the mix to create a killer platform.
AI which includes the likes of machine learning for things like scikit-learn with dask, H2o.ai, spark MLlib and PySpark along with deep learning which includes tensor flow, PyTorch, MXNET, keras and Caffe2 are all things that can be run simultaneously in Robin. Nodes are identified by their resources during provisioning for cores, memory, GPUs and storage.
Cultivated data pipelines can be constructed with a mix of components. Consider a use case with ingest from kafka, store to Cassandra and then run spark MLlib to find loans submitted from last week that will be denied. All that can be automated with Robin.
The as-a-service aspect for things like MLops & AutoML can be implemented with a combination of Robin capabilities and other software to deliver a true AI-as-a-Service experience.
Nodes to run these workloads on can support disaggregated compute and storage. Some sample servers might be a combination of Dell PowerEdge C6520s for compute & R750s for storage. The compute servers are very dense and can run four server hosts in 2U offering a full range of Intel Ice Lake processors. For storage nodes the R750s can have onboard NVMe or SSDs (up to 28). For the OS image a hot swappable m.2 BOSS card with self-contained RAID1 can be used for Linux with all 15G servers.
Wed, 02 Jun 2021 22:01:28 -0000
|Read Time: 0 minutes
First you might ask what a GPU database actually is. In a nutshell it's typically a relational database that can offload certain operations to a GPU so that queries run faster. There are three players in the space including Kinetica, Sqream and OmniSci. By all measures Kinetica is the leader which is one of the key reasons we've chosen to partner with them through our OEM channel.
The first thing one might ask is what kinds of things can a GPGPU Database do for me. Some ideas for your consideration might be:
One of the coolest things I've found to date with Kinetica is that it only runs queries on the GPU where it can be accelerated. Essentially joins, computations and math operations. Queries involving a string search would be run on the CPUs. In this matter collectively the entire workload can be accelerated.
These databases run on servers with direct attach storage capable of running NVidia GPUs. In the Dell 14G product family the most common servers are R740, R740XD and R940XA servers. For 15G the most appealing are R750, R750XA and XE8545 servers. Other models are certainly possible but less common. For purposes of this article we will focus on the R750XA. This brand new server is based on Ice Lake processors and sports two sockets with up to 40 cores per socket for a maximal possible number of cores per server of 80. A pair of top end A100 GPUs can configured with an NVLink bridge to enable interlinks of 600GB/s. Systems can be configured with up to 6TB of memory including the latest 200 series optane modules. Local storage is most common and this server can house up to eight 2.5" drives which can be either NVMe or SSD. I know you're thinking what if my database can't fit on a single server. Luckily the answer is simply to use more servers. Kinetica can shard the db across n nodes.
If you want to learn more about Kinetica on Dell PE servers drop me a line at Mike.King2@Dell.com