Build or Buy Data Science Solutions
Should You Build In-House or Buy Ready-Made Data Science Solutions?
Most executives know that machine learning has the power to change almost everything about the way they do business. But what many business leaders don’t know is how to deploy machine learning models, not just in a pilot here or there, but throughout the organization, where it can create maximum value. If you’re considering machine learning for your business, it’s time to scale up or give up.
Unfortunately, it is very difficult to make an informed decision given that there are so many possible options to do data science in the enterprise. There is the traditional ‘build-in-house’ route by hiring data experts and using open-source software. However, hiring data experts is expensive and the best ones are usually unavailable (considering they are mostly taken by the Googles and Facebooks of the world). Using open-source software is critical, but keeping up with the rapid pace of innovation is difficult for enterprise-sized organizations considering that only a small number of people are proficient enough to use them.
That’s why in recent years, the space of data science and machine learning platforms has been exploding. These platforms allow teams to start analyzing data pretty much from day one and enable vast productivity and efficiency. I found this ecosystem to be as fascinating as ever, as evidenced by the image below from Redpoint Ventures.
Questions To Consider
Here are the key questions that you should consider when making a decision about in-housing or buying an outsourced data science solution:
- Is it a long-term or one-off data science project? If the project is intended to be long-term, acquiring an outsourced solution that is able to accommodate the fast-evolving data technologies may be relevant; while for a short-term project, it may be possible to build something in-house based on open-source frameworks.
- Is it a large-scale or small-scale data science project? On small-scale data science projects, you can build in-house with little overarching cohesion or coherence. If the broader vision is to expend data science to augment a significant part of the organization’s activity, it would be smart to buy an off-the-shelf solution, because your project most likely requires including many different skills and processes to design and operationalize the projects.
- Is data science your core competency? This is probably the most important question. Does your organization have a team of data scientists and machine learning engineers dedicated to the project alone? Is your company willing to dedicate engineering time away from improving your product features, understanding your consumer’s behavior, or building core infrastructure? If the answer is yes, then utilizing the myriad of open-source machine learning toolkits that exist and/or building & maintaining in-house makes sense. If the answer is no, then you are better off choosing the best data science solution for your needs.
There are a couple of reasons for this.
- The first reason is that building machine learning in-house takes considerable time, effort, and upfront investment. You need a team of data scientists and machine learning engineers, the right infrastructure, tons of data, and a lot of time to build a production-ready machine learning solution. In addition to the time spent actually building and optimizing the machine learning solution, there’s obviously a considerable amount of time involved in hiring the right people and getting the budgets approved.
- The second reason is that it’s challenging to maintain and pay off the technical debt related to machine learning. This famous paper from Google Research states that machine learning systems can easily incur technical debt because they have all of the maintenance problems of traditional code plus an additional set of machine-learning-specific issues. This debt may be difficult to detect because it exists at the system level rather than the code level.
- The third reason is that it’s smart to leverage the shared infrastructure of a data science solution. For organizations without an in-house data science team, an outsourced data science platform can provide a full end-to-end solution that enables your engineering team to take advantage of machine learning with a simple API. For organizations with an in-house data science team, such a platform can allow the team to focus on the fun parts of machine learning (predictive modeling) without having to worry about the hassles of DevOps (maintaining infrastructure).
Next, I’ll take a look at which parts a data science platform consists of and compare building your own solution in-house to buying a ready-made service that does everything for you. Again, the end-goal in both buying and building a solution is that your data scientists spend as much of their time as possible on understanding their data, building models and moving them to production; while as little time as possible on infrastructure and boilerplate code.
To build an in-house solution, here are the key components of the data science workflow that your organization needs to be aware of:
- The backbone of your platform is an orchestration engine, which can optimize resource usage, predict need, handle queuing and more. A simple practice is to base your engine on Virtual Machines, platforms such as Kubernetes, or by directly orchestrating your own hardware.
- For your data scientists to be able to reproduce experiments easily, you need to version control and log the data, parameters, hardware and software environment, logs and more. A good starting point for this is using Git for code versioning and Docker for the environment.
- You also want to sandbox experiments into projects and manage privileges for users and teams. You should define access control, like which resources users are able to use, how to queue jobs by different users, which data is available to whom and more. This would help remove any silos and enable transparency so that your data scientists can learn from each other.
- You likely want to build a robust machine learning pipeline, where the workflow can be split up into different stages, starting from exploration and continuing with batch processing, normalization, training, deployment and many other steps in between. You need to build in support for handing over the results from one step onto the next one. You should also put in advanced triggers that enable automatic re-running of parts of the pipeline when data in your data lake changes significantly.
- Data scientists should be able to continue using the tools and frameworks they are used to, so there should be different ways of interacting with the infrastructure depending on the situation. A solid practice is to build everything on top of an open API that can then be accessed from a command line/user interface, or even directly from a Jupyter notebook.
- As data scientists are working with the models, they need to understand how the individual steps are progressing — from training accuracies to hyper-parameter configurations. For this purpose, you need to build a way to show real-time training data in a meaningful way, in other words, a customized visualization library.
- The final piece of your solution is model deployment. Here, you want to make sure your data scientists are self-sufficient when it comes to deploying new models for the software teams that will be integrating the predictive models into your business applications. A good solution is Kubernetes clusters with the Docker platform, which is most commonly used in the industry for inference deployment in the cloud.
“When I purchase a platform I am responsible for integrating the solution into our products. When we build the solution ourselves we have to own the full stack. We get the burden of building and maintaining the solution (technical debt, library compatibility, compute and storage resources etc). When you look at the total cost of ownership and associated headaches the decision quickly swings in favor of buying a solution.”
- Matt Knoop, Director of Integration and Analytics @ AP Intego Insurance
Understandably, enterprise organizations are reluctant to commit to data science platforms that might lock them into a certain system or specific tools. The key here is to choose a platform that is not only built to handle various parts of the data science workflow but also is flexible and open to integrating with other technologies.
Saturn Cloud is a data science platform that makes it simple and fast to run machine learning workloads of any scale and complexity on any infrastructure.
- It puts data scientists first by removing the time sink of infrastructure management and standardize environments across the entire team.
- It enables reproducibility by automating version control so that data scientists never forget to commit.
- For enterprise organizations, Saturn reduces the control costs by maximizing the efficiency of your data science teams and hardware resources. Teams can specify cloud cost quotas by individuals or teams, as well as alerts and spending limits to ensure that resources and budget are in control
“When I was at Anaconda, we were very aware of how challenging it can be to manage data science projects while also maintaining the underlying software stack, which are two fundamentally different things. One of the reasons I’ve started Saturn Cloud is to be able to provide data scientists with next-level tools that are unlikely to be built inside their organization. Every day we strive to make a data scientist’s life easier.”
- Hugo Shi, Founder of Saturn Cloud
I hope that this article helps you understand the benefits of adopting data science solutions within enterprise organizations, which allow for the scalability, flexibility, and control of the technology. In detail, these solutions provide a framework for: (1) collaboration as a way for non-technical folks to contribute to data projects along with data scientists and data engineers, (2) data governance as a way for team leaders to monitor the machine learning workflows, (3) efficiency as a way to save time throughout the data-to-insights process, (4) automation as a potential way to automate certain parts of the data pipeline to alleviate inefficiencies, and (5) operationalization as a way to deploy data projects into production quickly and safely.
In their most basic form, data science solutions enable people within an enterprise organization to (1) use the data to produce machine learning solutions, (2) scale their products by providing transparency and reproducibility throughout the team and within a project, and (3) access all the data and collaborate on data projects in a central hub. Ultimately, data science solutions save organizations valuable time in all parts of the machine learning process and ease the burden of getting started in data science while providing a framework that allows the organizations to learn as they go. Your organization can thus focus on your core competencies while leveraging all of the best things data science can let you do.
If you would like to follow my work on Recommendation Systems, Deep Learning, and Data Science Journalism, you can check out my Medium and GitHub, as well as other projects at https://jameskle.com/. You can also tweet at me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts right at your inbox!