IT departments support the ability to run code in the form of applications and software running as a server, virtual machine, or container. Changes, such as updates, upgrades, and patching, are planned so that they do not interrupt the users of the application or software. Data scientists need to run code in an environment that is better able to support dynamic changes and updates to the underlying libraries and other binaries. Customers and internal groups who support data science efforts at Dell EMC indicate that competing interests exist:
This conflict of interest generates friction and can result in scenarios of “shadow IT,” where knowledgeable users manage equipment running in a closet or under their desk. As a deterrent, IT leaders must provide data science teams with tools and processes that enable work to be done quickly inside the managed, secure, and stable environment into which IT has visibility.
The Domino Data Science Platform is built for open data science. It uses an execution model that uses containers to encapsulate the specific requirements of each data science project, as shown in the following figure:
Figure 3. Domino Data Science Platform for Administrators
Each of these container images is called an “environment.” The container image can contain specific library and software versions and other items that are typically managed at an IT level, such as certificate keys, proxy settings, application configurations, and proprietary run-times. With this approach, IT can be confident that any stability or performance issues that are caused by a library version changing are isolated and do not affect the experience of other Domino users.
Domino Data Science Platform simplifies collaboration by allowing projects to be shared with individuals or teams, or to be open to the entire organization. Domino Data Science Platform provides a search functionality that enables users to search quickly for examples of data processing or model building code that are used in projects. Projects also have discussion boards to separate feedback clearly for individual efforts.
A benefit of having this functionality inside the Domino application is that it persists. When new members join the team, they instantly have access to the entire history of the project and can view any discussions, revisions, and results. This functionality accelerates data science in the organization by centralizing the institutional knowledge that is developed by data science teams in a single location.
Domino Data Science Platform accelerates data science by giving data science teams a tool that enables self-service. Implementing this method effectively presents challenges. Access to certain features and settings must be restricted to users with detailed knowledge or appropriate authority. Domino Data Science Platform provides a flexible privilege delegation scheme that is based on roles to limit the scope of individual users or groups of users. Commonly, IT enables some power users within the data science team to manage environments with support from IT on best practices and complex configurations. For example, during the evaluation of Domino Data Science Platform by the Ready Solutions engineering team, we allowed any user to create environment definitions but limited full administrative privileges to ensure that global configuration parameters in Domino Data Science Platform were not modified by our data scientist.
Also, Domino Data Science Platform uses Keycloak for authentication services, which can federate authentication to LDAP and Active Directory. This method ensures that IT has insight into who has access without the need to directly manage the granular details of roles within the data science team. The following figure shows an example of the authentication structure:
Figure 4. Keycloak Users window
Domino Data Science Platform tracks the consumption of resources by each team, as shown in the following figure. This data can be aggregated to show the total utilization of the resources across each team for forecasting demand or cost analysis. Tracking the use of specialized compute resources such as GPU accelerators is beneficial for planning when to increase resources.
Figure 5. Domino Data Science Platform resource tracking
Cutting-edge and open-source libraries and frameworks enable data science but create a difficulty for system administrators who prefer uniform, validated distributions to reduce compatibility issues. As data science projects grow and mature, they often have different dependency chains. While working with Intel to grow the adoption of Analytics Zoo and BigDL, frameworks for machine learning and deep learning on Apache Spark, we often ran demos on stable releases. We also ran multiple development builds in parallel as we experimented in the Dell EMC HPC and AI Innovation Lab. Many organizations use packaged environments through, for example, conda and virtualenv because they are used in software development for this issue. However, these packaged environments only focus on the code and libraries, and lack the ability to track metadata, reproduce training jobs, and generally fall short of being a good solution for data scientists.
The Domino Data Science Platform solves this problem by using a containerized approach. Each environment starts as a base Domino Data Analytics Distribution image. Then, an environment image is built with a collection of libraries, frameworks, configuration settings, and artifacts that are defined in the environment definition, which is like a Dockerfile. This approach creates a “snapshot in time” that ensures that future runs of the same code, whether training a model or returning a prediction, work with the executor instance powering it. New executor images can be built as new top-level images. Domino Data Science Platform also tracks revisions of each image so that you can safely test changes without the fear of losing the ability to run the code that works.
The following figure shows the data science environment:
Figure 6. Flexible, extensible, and collaborative data science environments
Launchers and scheduled reports are easy to set up and can use scripts or an existing Jupyter notebook. With a launcher, you can quickly create a portal for subject matter experts in the business process to select appropriate variables. Using the model API or hosting a web application requires more configuration but gives the data scientist the most flexibility to deliver the predictions and insight.
During the evaluation of Domino Data Science Platform, the team at Dell EMC deployed several of our AI use case demos that are powered by Flask web applications. Running the use cases on Domino Data Science Platform was simple and required minimal changes to the web applications code. For more information about the changes that are necessary to make Flask applications work in Domino Data Science Platform, see this blog post.
The ability to reproduce results is a hot topic in data science. As models move from experiments to AI, they will be used to make decisions that are subject to regulatory authority and scrutiny. For example, when evaluating applicants for an apartment rental, certain demographic information cannot be considered. Organizations looking to develop AI in this space must be able to show regulators the training data and decision-making logic in their models. For this reason, Domino Data Science Platform includes several features that enable each job to be run again with the same code, the same executor environment, and the same result.
Domino Data Science Platform combines several features to ensure that each experiment is reproducible: