Build or Buy Data Science Solutions

Should You Build In-House or Buy Ready-Made Data Science Solutions?

Introduction

Astasia Myers, Introducing Redpoint’s ML Workflow Landscape (https://medium.com/memory-leak/introducing-redpoints-ml-workflow-landscape-312ca3c91b2f)

Questions To Consider

  • Is it a long-term or one-off data science project? If the project is intended to be long-term, acquiring an outsourced solution that is able to accommodate the fast-evolving data technologies may be relevant; while for a short-term project, it may be possible to build something in-house based on open-source frameworks.
  • Is it a large-scale or small-scale data science project? On small-scale data science projects, you can build in-house with little overarching cohesion or coherence. If the broader vision is to expend data science to augment a significant part of the organization’s activity, it would be smart to buy an off-the-shelf solution, because your project most likely requires including many different skills and processes to design and operationalize the projects.
  • Is data science your core competency? This is probably the most important question. Does your organization have a team of data scientists and machine learning engineers dedicated to the project alone? Is your company willing to dedicate engineering time away from improving your product features, understanding your consumer’s behavior, or building core infrastructure? If the answer is yes, then utilizing the myriad of open-source machine learning toolkits that exist and/or building & maintaining in-house makes sense. If the answer is no, then you are better off choosing the best data science solution for your needs.
  • The first reason is that building machine learning in-house takes considerable time, effort, and upfront investment. You need a team of data scientists and machine learning engineers, the right infrastructure, tons of data, and a lot of time to build a production-ready machine learning solution. In addition to the time spent actually building and optimizing the machine learning solution, there’s obviously a considerable amount of time involved in hiring the right people and getting the budgets approved.
The hidden technical debt in machine learning systems (https://www.researchgate.net/figure/The-hidden-technical-debt-in-machine-learning-systems-adapted-from-6_fig5_318496497)
  • The second reason is that it’s challenging to maintain and pay off the technical debt related to machine learning. This famous paper from Google Research states that machine learning systems can easily incur technical debt because they have all of the maintenance problems of traditional code plus an additional set of machine-learning-specific issues. This debt may be difficult to detect because it exists at the system level rather than the code level.
  • The third reason is that it’s smart to leverage the shared infrastructure of a data science solution. For organizations without an in-house data science team, an outsourced data science platform can provide a full end-to-end solution that enables your engineering team to take advantage of machine learning with a simple API. For organizations with an in-house data science team, such a platform can allow the team to focus on the fun parts of machine learning (predictive modeling) without having to worry about the hassles of DevOps (maintaining infrastructure).
Ant Rozetsky — Unsplash (https://unsplash.com/photos/io7dX_1EFCg)

Build In-House

  • The backbone of your platform is an orchestration engine, which can optimize resource usage, predict need, handle queuing and more. A simple practice is to base your engine on Virtual Machines, platforms such as Kubernetes, or by directly orchestrating your own hardware.
  • For your data scientists to be able to reproduce experiments easily, you need to version control and log the data, parameters, hardware and software environment, logs and more. A good starting point for this is using Git for code versioning and Docker for the environment.
  • You also want to sandbox experiments into projects and manage privileges for users and teams. You should define access control, like which resources users are able to use, how to queue jobs by different users, which data is available to whom and more. This would help remove any silos and enable transparency so that your data scientists can learn from each other.
  • You likely want to build a robust machine learning pipeline, where the workflow can be split up into different stages, starting from exploration and continuing with batch processing, normalization, training, deployment and many other steps in between. You need to build in support for handing over the results from one step onto the next one. You should also put in advanced triggers that enable automatic re-running of parts of the pipeline when data in your data lake changes significantly.
  • Data scientists should be able to continue using the tools and frameworks they are used to, so there should be different ways of interacting with the infrastructure depending on the situation. A solid practice is to build everything on top of an open API that can then be accessed from a command line/user interface, or even directly from a Jupyter notebook.
  • As data scientists are working with the models, they need to understand how the individual steps are progressing — from training accuracies to hyper-parameter configurations. For this purpose, you need to build a way to show real-time training data in a meaningful way, in other words, a customized visualization library.
  • The final piece of your solution is model deployment. Here, you want to make sure your data scientists are self-sufficient when it comes to deploying new models for the software teams that will be integrating the predictive models into your business applications. A good solution is Kubernetes clusters with the Docker platform, which is most commonly used in the industry for inference deployment in the cloud.

Buy Ready-Made

  • It puts data scientists first by removing the time sink of infrastructure management and standardize environments across the entire team.
  • It enables reproducibility by automating version control so that data scientists never forget to commit.
  • For enterprise organizations, Saturn reduces the control costs by maximizing the efficiency of your data science teams and hardware resources. Teams can specify cloud cost quotas by individuals or teams, as well as alerts and spending limits to ensure that resources and budget are in control

Conclusion