Home Storage ObjectScale and ECS Blogs

Dell ObjectScale Data Path Overview Part I

Wed, 10 Jan 2024 16:13:45 -0000

Read Time: 0 minutes

Jarvis Zhu

Dell ObjectScale is the next evolution of object storage from Dell Technologies. ObjectScale is scale-out, high-performance containerized object storage built for the toughest applications and workloads, from generative AI to analytics and more. ObjectScale features a containerized architecture built on the principles of microservices to promote resiliency and flexibility. It is a branch of Dell ECS’s codebase and has been re-platformed to utilize the native orchestration capabilities of Kubernetes—scheduling, load-balancing, self-healing, resource optimization, and so on.

This blog will focus solely on the data path of ObjectScale. Please refer to the ObjectScale overview and architecture documentation to learn more about ObjectScale.

Before we introduce the data path, let’s first review chunks, metadata, and B+ tree . If you are already familiar with these concepts, please refer to the Dell ObjectScale Data Path Overview Part II, which will introduce data protection and dataflow.

Chunks

A chunk is a logical container that ObjectScale uses to store all types of data, including object data, custom client provided metadata, and system metadata. Chunks contain 128 MB of data consisting of one or more objects, as shown in the following figure.

This is a example of a chunk, Chunks contain 128 MB of data consisting of one or more objects.

Figure 1. Composition of a chunk

An object created within ObjectScale includes writing data and metadata. Metadata encompasses journal chunks and btree chunks. Data chunks have either inline coding or triple mirror plus in-place erasure coding protection depending on the size of the object data, and the btree or journal chunks which store the metadata have triple mirror protection. For more information, please refer to the Dell ObjectScale Data Path Overview Part II.

Metadata

Metadata provides information about one or more aspects of the data, including system metadata like owner, creation date, size, and custom metadata. An example of customer metadata could be client=Dell, event=DellWorld, and ID=123.

Both system metadata and custom metadata are stored in B+ tree and journal chunks.

Data management

ObjectScale maintains the metadata that tracks data locations and transaction history in logical tables and journals.

The tables hold key-value pairs to store information relating to the objects. A hash function is used to do fast lookups of values associated with a key. These key-value pairs are stored in a B+ tree for fast indexing of data locations. In addition, to further enhance query performance of these logical tables, ObjectScale implements a two-level log-structured merge (LSM) tree. Thus, there are two tree-like structures in which a smaller tree is in memory (memory table) and the main B+ tree resides on disk. A lookup of key-value pairs first queries the memory table, and if the value is not memory, it looks in the main B+ tree on disk.

Transaction history is recorded in journal logs which are written to disks. The journals track the index transactions not yet committed to the B+ tree. After the transaction is logged into a journal, the memory table is updated. After a set period of time or when the table in memory becomes full, the table is merged, sorted, or dumped to the B+ tree on disk. A checkpoint is then recorded in the journal.

This picture shows the how the aforementioned transaction works

Figure 2. Metadata management process in ObjectScale

ObjectScale uses several different tables, which can get quite large. To optimize the performance of table lookups, each table is divided into partitions that are distributed across the nodes. The node where the partition is written then becomes the owner or authority of that partition or section of the table.

One such table is a chunk table, which tracks the physical location of chunk fragments and replica copies on disk. The following table shows a sample of a partition of the chunk table. Each chunk identifies its physical location by listing the disk within the node, the file within the disk, the offset within that file, and the length of the data. Chunk ID C1 is erasure coded, and chunk ID C2 is triple mirrored, as shown in table 1.

Table 1. Sample chunk table partition

Chunk ID	Chunk location
C1	Node1:Disk1:File1:offset1:length Node2:Disk1:File1:offset1:length Node3:Disk1:File1:offset1:length Node4:Disk1:File1:offset1:length Node5:Disk1:File1:offset1:length Node6:Disk1:File1:offset1:length Node7:Disk1:File1:offset1:length Node8:Disk1:File1:offset1:length Node1:Disk2:File1:offset1:length Node2:Disk2:File1:offset1:length Node3:Disk2:File1:offset1:length Node4:Disk2:File1:offset1:length Node5:Disk2:File1:offset1:length Node6:Disk2:File1:offset1:length Node7:Disk2:File1:offset1:length Node8:Disk2:File1:offset1:length
C2	Node1:Disk3:File1:offset1:length Node2:Disk3:File1:offset1:length Node3:Disk3:File1:offset1:length

Chunk ID

Chunk location

Node1:Disk1:File1:offset1:length

Node2:Disk1:File1:offset1:length

Node3:Disk1:File1:offset1:length

Node4:Disk1:File1:offset1:length

Node5:Disk1:File1:offset1:length

Node6:Disk1:File1:offset1:length

Node7:Disk1:File1:offset1:length

Node8:Disk1:File1:offset1:length

Node1:Disk2:File1:offset1:length

Node2:Disk2:File1:offset1:length

Node3:Disk2:File1:offset1:length

Node4:Disk2:File1:offset1:length

Node5:Disk2:File1:offset1:length

Node6:Disk2:File1:offset1:length

Node7:Disk2:File1:offset1:length

Node8:Disk2:File1:offset1:length

Node1:Disk3:File1:offset1:length

Node2:Disk3:File1:offset1:length

Node3:Disk3:File1:offset1:length

Another example is an object table, which is used for object name to chunk mapping. The following table shows an example of a partition of an object table that lists the chunk or chunks and shows where an object resides in the chunk or chunks.

Table 2. Sample object table

Object name	Chunk ID
ImgA	C1:offset:length
FileA	C4:offset:length C6:offset:length

Object name

Chunk ID

ImgA

C1:offset:length

FileA

C4:offset:length

C6:offset:length

A service called Atlas, which runs on all nodes, maintains the mapping of table partition owners. The following table shows an example of a portion of an atlas mapping table.

Table 3. Sample Atlas mapping table

Table ID	Table partition owner
Table 0 P1	Node 1
Table 0 P2	Node 2

We will cover ObjectScale data protection and workflow in the Dell ObjectScale Data Path Overview Part II.

Resources

The following Dell Technologies documentation provides additional information related to this blog. Access to these documents depends on an individual’s login credentials. For access to a document, contact a Dell Technologies representative.

Author: Jarvis Zhu

Tags:

Related Blog Posts

ObjectScale

Dell ObjectScale Data Path Overview Part II

Wed, 10 Jan 2024 16:14:05 -0000

Read Time: 0 minutes

This blog is a continuation of the Dell ObjectScale Data Path Overview Part I. Here, I will cover data protection and dataflow. If you haven’t already, feel free to check out the Dell ObjectScale Data Path Overview Part I to get some basic knowledge about chunks, metadata, and the B+ tree.

Data protection methods

An object created within ObjectScale includes writing data and metadata. ObjectScale metadata includes journal chunks and btree chunks. Each is written to a different logical chunk that will contain ~128 MB of data from one or more objects. ObjectScale uses a combination of triple mirroring and erasure coding to protect the data.

Triple mirroring ensures that three copies of data are written, protecting against two node failures.
Erasure coding provides enhanced data protection from disk and node failures, using the Reed Solomon erasure coding scheme which breaks up chunks into data and coding fragments that are equally distributed across nodes.
- ObjectScale uses 12+4 erasure coding. That means a chunk is broken into 12 data segments and another 4 coding (parity) segments.

Depending on the size and type of data, data is written using one of the data protection methods shown in the following table.

Table 1. Data protection methods based on data type and size

Type of data	Data protection method used
Journal chunks/ Btree chunks	Triple mirroring
Object data <128 MB	Triple mirroring plus in-place erasure coding
Object data >=128 MB	Inline erasure coding

Note: In the ObjectScale appliance (XF960) with NVMe architecture, the object data will have inline erasure coding when >=44MB. In this blog, we will talk about 128MB chunk as default.

Triple mirroring

The triple-mirror write method applies to the ObjectScale journal and Btree chunks, of which ObjectScale creates three replica copies. Each replica copy is written to a single disk on different nodes. This method protects the chunk data against two-node or two-disk failures.

Triple mirroring plus in-place erasure coding

This write method is applicable to the data from any object that is less than 128 MB in size.

As an object is created, it is written as follows:

One copy is written in fragments that are spread across different nodes and disks.
A second replica copy of the chunk is written to a single disk on a node.
A third replica copy of the chunk is written to a single disk on a different node.
Other objects are written to the same chunk until it contains ~128 MB of data. The erasure coding scheme calculates coding (parity) fragments for the chunk and writes these fragments to different disks.
The second and third replica copies are deleted from disk.
After this sequence is complete, the chunk is protected by erasure coding.

Figure 1. Process of triple mirroring plus in-place erasure coding

Inline erasure coding

This write method is applicable to the data from any object that is 128 MB or larger. Objects are broken up into 128 MB chunks. The Reed Solomon erasure coding scheme calculates coding (parity) fragments for each chunk. Each fragment is written to different disks across the nodes.

Any remaining portion of an object that is less than 128 MB is written using the triple mirroring plus in-place erasure coding scheme. As an example, if an object is 150 MB, 128 MB is written using inline erasure coding. The remaining 22 MB is written using triple mirroring plus in-place erasure coding.

Checksums

Checksums are done per write-unit, up to 2 MB. During write operations, the checksum is calculated in memory and then written to disk. On reads, data is read along with the checksum. The checksum is then calculated in memory from the data read and compared with the checksum stored in disk to determine data integrity. Additionally, the storage engine runs a consistency checker periodically in the background and does checksum verification over the entire dataset.

Data flow

ObjectScale was designed as a distributed architecture and includes a built-in load balancer (Metallb by default) that chooses the node in a cluster that will respond to a read or write request.

The following figure and steps provide a high-level overview of a write dataflow.

Figure 2. High-level overview of a write dataflow

A write object request is received. In this example, Node 1 processes the request.
Depending on the size of the object, the data is written to one or more chunks. Each chunk is protected using advanced data protection schemes such as triple mirroring plus in-place erasure coding and inline erasure coding. Before writing the data to disk, ObjectScale runs a checksum function and stores the result.
1. In this example, the object size is 150MB and will be divided into 128MB as chunk 1 and 22MB as chunk 2.
2. For the chunk 1 with 128MB size, it uses the inline erasure coding scheme (12+4).
3. For the chunk 2 with 22MB size, it uses the triple mirroring plus in-place erasure coding scheme.
After the object data is written successfully, the object metadata will be stored. In this example, Node 3 owns the partition of the object table in which this object belongs. As owner, Node 3 writes the object name and chunk ID to this partition of the object table’s journal logs. Journal logs are triple mirrored, so Node 3 sends replica copies to three different nodes in parallel—Node 2, Node 3, and Node 4 in this example.
Acknowledgment is sent to the client.
In a background process, the memory table is updated.

The following figure and steps provide a high-level overview of a read dataflow.

Figure 3. High-level overview of a read dataflow

A read request is received for ImgA. In this example, Node 1 processes the request.
Node 1 requests the chunk information from Node 2 (object table partition owner for ImgA).
Knowing that ImgA is in C1 at a particular offset and length, Node 1 requests the chunk’s physical location from Node 3 (chunk table partition owner for C1).
Now that Node 1 knows the physical location of ImgA, it requests that data from the node or nodes that contain the data fragment or fragments of that file. In this example, the location is Node 4 Disk 1. Then, Node 4 performs a byte offset read and returns the data to Node 1.
Node 1 validates the checksum together with the data payload and returns the data to the requesting client.

Note: In step 4, for NVMe architecture like ObjectScale appliance, each node can directly read data from the other node. This architecture contrasts with a hard-drive architecture in which each node can only read its own data store and then transfer to the request node.

Resources

Author: Jarvis Zhu

ObjectScale ObjectScale appliance

Announcing the Dell ObjectScale Appliance!

Tue, 21 Nov 2023 20:14:23 -0000

Read Time: 0 minutes

‘A major element fueling the ascent of object storage is the emergence of all-flash object storage and organizations’ growing recognition of the value “fast object” can provide them. All-flash object storage is clearly playing a key role in providing the modern infrastructure necessary to support these new initiatives.‘

This assertion is verified by a report produced by Scott Sinclair of ESG research.

Our latest ObjectScale release features the XF960, an enterprise-class, scale-out, high-performance, all-flash storage system built for the toughest applications and workloads. The XF960 hardware stack includes the server, switch, rack-mount equipment, and appropriate power cables, all optimized to run the ObjectScale software. It is built on Dell Technologies’ PowerEdge R760 server (new 16th generation).

We’re taking the ObjectScale experience to the next level by providing the latest software innovation as a fully integrated, turnkey solution.

Built with NVMe disks on Dell PowerEdge R760 servers, capacity for the ObjectScale appliance begins at 2.95PB raw capacity and scales up to 11.79PB raw capacity per rack to support the most data-intensive workloads, such as AI, machine learning, IoT, and real-time analytics applications.

ObjectScale appliance configurations include:

2U form factor
Dual Intel processor with 32 cores per processor
256GB memory
30TB with 24*NVMe TLC drives
Dual mirrored boot drives
Network:
- 100Gb Back-end switches (Dell provided)
- 25Gb Front-end switches (customer-provided and supported)

Note: The ObjectScale appliance supports a minimum of 4 node deployment and max 16 nodes in a single rack cluster.

Two major performance enhancements with ObjectScale 1.3 and XF960 enable application development teams and data science teams to innovate even faster:

16G PowerEdge server performance with full-stack NVMe connectivity. The XF960 embraces the NVMe over TCP protocol for its blazing fast 100Gb backend network, accelerating node-to-node communication and unlocking the true potential of the all-flash system’s throughput rate, especially in large-scale deployments.
We introduce shared memory in the NVMe IO path of the datahead service in the ObjectScale appliance running software version 1.3 to increase S3 throughput performance.

The ObjectScale appliance offers self-encrypting drives (SEDs), NVMe SSDs that transparently encrypt all on-disk data using an internal key and a drive access password. This encryption protects the system from data theft when a drive is removed. ObjectScale now only supports Secure Enterprise Key Manager (SKEM) in the iDRAC.

The ObjectScale appliance supports encryption for:

Boot Drive
Data Drive (providing better hardware deployment performance compared to software based D@RE)
Software based D@RE (enabled at a system or bucket level)

The ObjectScale appliance supports hardware monitoring by leveraging the iDRAC from servers. It can be enabled and disabled from the UI or API directly. The alerts include a detailed description of the issue with Severity type, Reason, and Impacted resource.

The following workflow summarizes how the hardware monitoring works:

UI or REST clients issue calls to manage hardware alerts.
ObjectScale enables hardware alerts.
iDRAC publishes alerts to the event listener and K8s events.
The event is finally delivered to SNMP and Dell Support.
ObjectScale pulls alerts every 60 seconds that display in the UI.

The ObjectScale appliance is deployed with the latest ObjectScale 1.3 software bundle package. For more information about Dell ObjectScale 1.3, see the latest Dell ObjectScale: Overview and Architecture white paper.

Conclusion

The Dell ObjectScale appliance gives organizations flexibility in deploying and managing enterprise-grade object storage. ObjectScale is a distributed system that has a layered architecture: every function is built as an independent layer, which provides horizontal scalability across all nodes and enables high availability. The ObjectScale software provides optimum protection, geo-replication, greater availability, and secure data access.

Resources

The following Dell Technologies documentation provides additional information about Dell ObjectScale. Access to these documents depends on an individual’s login credentials. For access to a document, contact a Dell Technologies representative.

ObjectScale and ECS white papers:
- ObjectScale and ECS Info Hub
Dell Support resources:

Author: Jarvis Zhu

Your Browser is Out of Date

Dell ObjectScale Data Path Overview Part I

Chunks

Metadata

Data management

Resources

Related Blog Posts

Dell ObjectScale Data Path Overview Part II

Data protection methods

Triple mirroring

Triple mirroring plus in-place erasure coding

Inline erasure coding

Checksums

Data flow

Resources

Announcing the Dell ObjectScale Appliance!

Conclusion

Resources