The value of implementing a robust data platform lies in a broad spectrum of data sources and types. This data can contain hidden or latent information that is combined with a common framework for applying a full suite of data analytics techniques. While there are common analytic applications that almost every organization knows about, there are probably as many or more yet to be discovered and developed. Many organizations acknowledge that the backlog of proposed applications that are based in part on analytics insight is overwhelming. Many sources of data in large organizations have yet to be profiled let alone enhanced and merged into an analytics pipeline. Such a pipeline feeds value into a software application or report.
All digital data has structure when it is committed to a storage medium. Some examples include:
- Files have a size property and a filetype (application, text, binary).
- Text files have an encoding scheme.
- Images have dimensional size and color depth encoding.
- Audio has bit rate and frequency range.
These characteristics impact the requirements for a data platform. Some file systems are better suited to handling lots of small files while others are better at fewer large files. For audio and other stream based data, data engineers have a choice of buffer size and file creation characteristics that must be matched to the capabilities of the platform and may also impact the complexity of using the data for analysis.
If you have more knowledge about the final stages of how your analysis pipelines look, you can build more intelligence into the early stages of data management. If possible, one area that should be resisted is “down sampling” the data because of the capability or preferences of the reporting and modeling requirements. Although storing high-fidelity data when it is not required for analysis may seem wasteful, think of it as an insurance policy to protect against changing analysis requirements. Storing data in a form that matches the data generation process as closely as possible can provide many clues, should questions regarding data reliability or quality arise later. You can always look at using down sampling or other forms of compression that lose information for archive.
Another aspect of data management, that surprises IT professionals, is the storage that is required to manage multiple copies of data being used for analysis. Even the most seasoned data science professionals consume many copies of data that by some appearances are identical. There are several important reasons why this situation is necessary:
- Both report and model development must be isolated from uncontrolled change. This initial copy is typically a direct copy of source with little or no transformation. This measure ensures that the developers can always return to a ground truth version of the data. That data can be used to compare alternate transformation schemes with repeatability.
- Managing those alternate transformations. One common pattern is grouping and counting events by various factors such as time, geography, market segment, and so on.
- Efficiency. Complex data transformation pipelines should get developed in stages. It may be too inefficient to go all the way back to source data for testing an incremental set of tasks late in the pipeline. Data scientists may prefer to stage intermediate steps to reduce the complexity and time investment to run the pipeline from the absolute beginning.
This list is not exhaustive, but should provide some ways to assess the sizing of a data platform. More importantly, it can help you assess the flexibility that is available for expanding and tiering storage that candidate platforms provide. Another requirement that derives in part from the data copy management challenge is tracking metadata that are associated with transformation logic and history. Creating many copies of the same data may seem reasonable in the heat of shipping a project, but it will be difficult to ascertain why six months later.
There is a growing interest in platforms that include feature stores. The concept is to both better track logic and metadata and to promote a more disaggregated approach to data management. If the only difference between two datasets is how the customer dimension is managed, then you should keep two copies of that feature, rather than two copies of the entire dataset. This is a simple example to explain the basic idea. Reusing transformation logic to manage frequently used dimensions - like customers and products - independently from all the other features, and all the other analysis datasets in which they are used, could greatly simplify data management.