Moving From Prototype to Production in R

a look inside the machine learning infrastructure at



Bryan Galvin | useR! 2018

This talk

  • after the prototype

  • an ideal machine learning system

  • introducing Metaflow

  • some lessons learned
 + 

30% useR as primary data analysis tool

  • RStudio Server (open source) running in containers

  • Rstudio-Connect cluster


A/B Testing
  • highly optimized Rcpp for analyzing test results
  • models simulate counterfactuals to estimate treatment effects

    Team:

         - dedicated engineers and UX designers building test platform

         - statisticians build analytics engine

Prototyping Machine Learning Models

Now What?
Sources of Friction

Scaling beyond a laptop

     -   special knowledge required for managing cloud infrastructure and parallel R


Reproducing results

     -   model & data versioning


Going to "production"

     -   model hosting?

     -   job scheduling?

What would an ideal machine learning system for R look like?

Low overhead

  • R with no limitations

Low overhead



Productive

  • easy to debug at scale

  • resume computation anywhere

Low overhead

Productive


Reproducible

  • "git for data"

  • recreate any past run easily

  • enable sharing and collaboration

Low overhead

Productive

Reproducible

Scalable

  • flexibility to select compute required

  • same user experience on a laptop or on thousands of servers

Low overhead

Productive

Reproducible

Scalable

Production ready

  • eliminate barriers to model scheduling and hosting





Low overhead

Productive

Reproducible

Scalable

Production ready







Metaflow







A microframework written in Python for making data scientists at Netflix more productive by enabling them to deliver on their own





Meta flow at 10,000 ft


A flow is a directed graph of operations:



Meta flow at 10,000 ft


Each step in the flow is checkpointed

    resume and debug without rerunning the entire graph


Meta flow at 10,000 ft


Data is persisted to s3 at every step

    easily compare objects across model runs

Meta flow at 10,000 ft

Allows for different graph structures

Meta flow at 10,000 ft

Vertical scalability

Meta flow at 10,000 ft

Horizontal scalability




Meta flow at 10,000 ft

Sharing & Collaboration


Meta flow at 10,000 ft

Production deployments



Meta flow at 10,000 ft

Model Hosting



 + 

Steps as R functions
  • packages loaded within function
  • persisted data assigned to self






default runs flow locally and persists everything to s3 that has been assigned to `self`







steps can selectively be run in containers with selected resources (cpu, gpu, memory, disk and network are configurable)


*Titus is a recently open sourced container platform from Netflix







flows can be scheduled to run at regular intervals or based off of data triggers such as a table updating


*Meson is an internal Netflix workflow orchestration and scheduling framework

Consuming Results



bcgalvin@gmail.com

quiltdata.com
pachyderm.io
databricks.com/mlflow
ropensci.github.io/drake/

slides: bryangalvin.com/useR-2018/