The use of pipeline analogies for describing data work is popular. However, generic discussions only go so far in developing strategies for choosing tools and processes appropriate to any specific use case. The first step in deciding the value potential from adopting a data platform for your organization is to develop the most complete data pipelines library possible. Remember that some data sources will be an important part of many pipelines, and other sources may be specific to a single analysis task.
The tracking of these details is important, since it impacts the needs for scalability and reliability when looking at the features of a data platform. It may also be useful to:
You may find that one platform is not adequate for all the needs of the organization, but in most situations there are many commonalities.
Figure 1: Generic data pipeline
Figure 1 is typical of generic data analytics pipeline that shows the end-to-end functional categories that are required for many types of data work. A high-level view like this is not enough for evaluating a data platform investment. The task details for a category like Collect (for example, how many and what types of data sources) significantly impact the features that you need from a data platform. The potential variety and complexity of the Enrich category is often underestimated in tools and storage performance assessments.
Each of the pipeline processing categories from Figure 1 is also a market for specialty software that applies only to that category. Different platforms and specialty applications may use different terminology than the Collect, Enrich, Report, Serve, and Predict terms as show here. However, the concepts and functional requirements are generally the same.
Data platforms that meet all or most of the needs of your data pipelines simplify the process of getting from raw source data to insights. Anytime data in the pipeline must move between platforms there is a real possibility of introducing complexity both in the development phase and in sustaining operations.
The value of implementing a robust data platform lies in a broad spectrum of data sources and types. This data can contain hidden or latent information that is combined with a common framework for applying a full suite of data analytics techniques. While there are common analytic applications that almost every organization knows about, there are probably as many or more yet to be discovered and developed. Many organizations acknowledge that the backlog of proposed applications that are based in part on analytics insight is overwhelming. Many sources of data in large organizations have yet to be profiled let alone enhanced and merged into an analytics pipeline. Such a pipeline feeds value into a software application or report.
All digital data has structure when it is committed to a storage medium. Some examples include:
These characteristics impact the requirements for a data platform. Some file systems are better suited to handling lots of small files while others are better at fewer large files. For audio and other stream based data, data engineers have a choice of buffer size and file creation characteristics that must be matched to the capabilities of the platform and may also impact the complexity of using the data for analysis.
If you have more knowledge about the final stages of how your analysis pipelines look, you can build more intelligence into the early stages of data management. If possible, one area that should be resisted is “down sampling” the data because of the capability or preferences of the reporting and modeling requirements. Although storing high-fidelity data when it is not required for analysis may seem wasteful, think of it as an insurance policy to protect against changing analysis requirements. Storing data in a form that matches the data generation process as closely as possible can provide many clues, should questions regarding data reliability or quality arise later. You can always look at using down sampling or other forms of compression that lose information for archive.
Another aspect of data management, that surprises IT professionals, is the storage that is required to manage multiple copies of data being used for analysis. Even the most seasoned data science professionals consume many copies of data that by some appearances are identical. There are several important reasons why this situation is necessary:
This list is not exhaustive, but should provide some ways to assess the sizing of a data platform. More importantly, it can help you assess the flexibility that is available for expanding and tiering storage that candidate platforms provide. Another requirement that derives in part from the data copy management challenge is tracking metadata that are associated with transformation logic and history. Creating many copies of the same data may seem reasonable in the heat of shipping a project, but it will be difficult to ascertain why six months later.
There is a growing interest in platforms that include feature stores. The concept is to both better track logic and metadata and to promote a more disaggregated approach to data management. If the only difference between two datasets is how the customer dimension is managed, then you should keep two copies of that feature, rather than two copies of the entire dataset. This is a simple example to explain the basic idea. Reusing transformation logic to manage frequently used dimensions - like customers and products - independently from all the other features, and all the other analysis datasets in which they are used, could greatly simplify data management.