程序代写代做代考 flex android database cache distributed system file system cuda compiler python Excel c++ x86 Hive algorithm GPU deep learning scheme This paper is included in the Proceedings of the

This paper is included in the Proceedings of the
12th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’16).
November 2–4, 2016 • Savannah, GA, USA

ISBN 978-1-931971-33-1

Open access to the Proceedings of the
12th USENIX Symposium on Operating Systems

Design and Implementation
is sponsored by USENIX.

TensorFlow: A System for Large-Scale
Machine Learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,

Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

TensorFlow: A system for large-scale machine learning

Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,

Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,

Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,

Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Google Brain

Abstract

TensorFlow is a machine learning system that operates at

large scale and in heterogeneous environments. Tensor-

Flow uses dataflow graphs to represent computation,

shared state, and the operations that mutate that state. It

maps the nodes of a dataflow graph across many machines

in a cluster, and within a machine across multiple com-

putational devices, including multicore CPUs, general-

purpose GPUs, and custom-designed ASICs known as

Tensor Processing Units (TPUs). This architecture gives

flexibility to the application developer: whereas in previ-

ous “parameter server” designs the management of shared

state is built into the system, TensorFlow enables develop-

ers to experiment with novel optimizations and training al-

gorithms. TensorFlow supports a variety of applications,

with a focus on training and inference on deep neural net-

works. Several Google services use TensorFlow in pro-

duction, we have released it as an open-source project, and

it has become widely used for machine learning research.

In this paper, we describe the TensorFlow dataflow model

and demonstrate the compelling performance that Tensor-

Flow achieves for several real-world applications.

1 Introduction

In recent years, machine learning has driven advances in

many different fields [3, 5, 24, 25, 29, 31, 42, 47, 50,

52, 57, 67, 68, 72, 76]. We attribute this success to the

invention of more sophisticated machine learning mod-

els [44, 54], the availability of large datasets for tack-

ling problems in these fields [9, 64], and the develop-

ment of software platforms that enable the easy use of

large amounts of computational resources for training

such models on these large datasets [14, 20].

We have developed the TensorFlow system for ex-

perimenting with new models, training them on large

datasets, and moving them into production. We have

based TensorFlow on many years of experience with our

first-generation system, DistBelief [20], both simplify-

ing and generalizing it to enable researchers to explore

a wider variety of ideas with relative ease. TensorFlow

supports both large-scale training and inference: it effi-

ciently uses hundreds of powerful (GPU-enabled) servers

for fast training, and it runs trained models for inference in

production on various platforms, ranging from large dis-

tributed clusters in a datacenter, down to running locally

on mobile devices. At the same time, it is flexible enough

to support experimentation and research into new machine

learning models and system-level optimizations.

TensorFlow uses a unified dataflow graph to repre-

sent both the computation in an algorithm and the state

on which the algorithm operates. We draw inspiration

from the high-level programming models of dataflow sys-

tems [2, 21, 34] and the low-level efficiency of parame-

ter servers [14, 20, 49]. Unlike traditional dataflow sys-

tems, in which graph vertices represent functional compu-

tation on immutable data, TensorFlow allows vertices to

represent computations that own or update mutable state.

Edges carry tensors (multi-dimensional arrays) between

nodes, and TensorFlow transparently inserts the appropri-

ate communication between distributed subcomputations.

By unifying the computation and state management in a

single programming model, TensorFlow allows program-

mers to experiment with different parallelization schemes

that, for example, offload computation onto the servers

that hold the shared state to reduce the amount of network

traffic. We have also built various coordination protocols,

and achieved encouraging results with synchronous repli-

cation, echoing recent results [10, 18] that contradict the

commonly held belief that asynchronous replication is re-

quired for scalable learning [14, 20, 49].

Over the past year, more than 150 teams at Google have

used TensorFlow, and we have released the system as an

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 265

open-source project.1 Thanks to our large community of

users we have gained experience with many different ma-

chine learning applications. In this paper, we focus on

neural network training as a challenging systems problem,

and select two representative applications from this space:

image classification and language modeling. These ap-

plications stress computational throughput and aggregate

model size respectively, and we use them both to demon-

strate the extensibility of TensorFlow, and to evaluate the

efficiency and scalability of our present implementation.

2 Background & motivation

We begin by describing the limitations of our previous

system (§2.1) and outlining the design principles that we
used in the development of TensorFlow (§2.2).

2.1 Previous system: DistBelief

TensorFlow is the successor to DistBelief, which is

the distributed system for training neural networks that

Google has used since 2011 [20]. DistBelief uses the pa-

rameter server architecture, and here we criticize its lim-

itations, but other systems based on this architecture have

addressed these limitations in other ways [11, 14, 49]; we

discuss those systems in Subsection 2.3.

In the parameter server architecture, a job comprises

two disjoint sets of processes: stateless worker processes

that perform the bulk of the computation when training a

model, and stateful parameter server processes that main-

tain the current version of the model parameters. Dist-

Belief’s programming model is similar to Caffe’s [38]: the

user defines a neural network as a directed acyclic graph

of layers that terminates with a loss function. A layer is

a composition of mathematical operators: for example, a

fully connected layer multiplies its input by a weight ma-

trix, adds a bias vector, and applies a non-linear function

(such as a sigmoid) to the result. A loss function is a scalar

function that quantifies the difference between the pre-

dicted value (for a given input data point) and the ground

truth. In a fully connected layer, the weight matrix and

bias vector are parameters, which a learning algorithm

will update in order to minimize the value of the loss func-

tion. DistBelief uses the DAG structure and knowledge

of the layers’ semantics to compute gradients for each

of the model parameters, via backpropagation [63]. Be-

cause the parameter updates in many algorithms are com-

mutative and have weak consistency requirements [61],

the worker processes can compute updates independently

1Software available from https://tensorflow.org.

and write back “delta” updates to each parameter server,

which combines the updates with its current state.

Although DistBelief has enabled many Google prod-

ucts to use deep neural networks and formed the basis of

many machine learning research projects, we soon began

to feel its limitations. Its Python-based scripting interface

for composing pre-defined layers was adequate for users

with simple requirements, but our more advanced users

sought three further kinds of flexibility:

Defining new layers For efficiency, we implemented

DistBelief layers as C++ classes. Using a separate, less

familiar programming language for implementing layers

is a barrier for machine learning researchers who seek to

experiment with new layer architectures, such as sampled

softmax classifiers [37] and attention modules [53].

Refining the training algorithms Many neural net-

works are trained using stochastic gradient descent

(SGD), which iteratively refines the parameters of the net-

work by moving them in the direction that maximally de-

creases the value of the loss function. Several refinements

to SGD accelerate convergence by changing the update

rule [23, 66]. Researchers often want to experiment with

new optimization methods, but doing that in DistBelief

involves modifying the parameter server implementation.

Moreover, the get() and put() interface for the pa-

rameter server is not ideal for all optimization methods:

sometimes a set of related parameters must be updated

atomically, and in many cases it would be more efficient

to offload computation onto the parameter server, and

thereby reduce the amount of network traffic.

Defining new training algorithms DistBelief workers

follow a fixed execution pattern: read a batch of input data

and the current parameter values, compute the loss func-

tion (a forward pass through the network), compute gra-

dients for each of the parameter (a backward pass), and

write the gradients back to the parameter server. This pat-

tern works for training simple feed-forward neural net-

works, but fails for more advanced models, such as recur-

rent neural networks, which contain loops [39]; adversar-

ial networks, in which two related networks are trained al-

ternately [26]; and reinforcement learning models, where

the loss function is computed by some agent in a separate

system, such as a video game emulator [54]. Moreover,

there are many other machine learning algorithms—such

as expectation maximization, decision forest training, and

latent Dirichlet allocation—that do not fit the same mold

as neural network training, but could also benefit from a

common, well-optimized distributed runtime.

In addition, we designed DistBelief with a single plat-

form in mind: a large distributed cluster of multicore

266 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

https://tensorflow.org

# 1. Construct a graph representing the model.

x = tf.placeholder(tf.float32, [BATCH_SIZE, 784]) # Placeholder for input.

y = tf.placeholder(tf.float32, [BATCH_SIZE, 10]) # Placeholder for labels.

W_1 = tf.Variable(tf.random_uniform([784, 100])) # 784×100 weight matrix.

b_1 = tf.Variable(tf.zeros([100])) # 100-element bias vector.

layer_1 = tf.nn.relu(tf.matmul(x, W_1) + b_2) # Output of hidden layer.

W_2 = tf.Variable(tf.random_uniform([100, 10])) # 100×10 weight matrix.

b_2 = tf.Variable(tf.zeros([10])) # 10-element bias vector.

layer_2 = tf.matmul(layer_1, W_2) + b_2 # Output of linear layer.

# 2. Add nodes that represent the optimization algorithm.

loss = tf.nn.softmax_cross_entropy_with_logits(layer_2, y)

train_op = tf.train.AdagradOptimizer(0.01).minimize(loss)

# 3. Execute the graph on batches of input data.

with tf.Session() as sess: # Connect to the TF runtime.

sess.run(tf.initialize_all_variables()) # Randomly initialize weights.

for step in range(NUM_STEPS): # Train iteratively for NUM_STEPS.

x_data, y_data = … # Load one batch of input data.

sess.run(train_op, {x: x_data, y: y_data}) # Perform one training step.

Figure 1: An image classifier written using TensorFlow’s Python API. This program is a simple solution to the MNIST

digit classification problem [48], with 784-pixel images and 10 output classes.

servers [20]. We were able to add support for GPU ac-

celeration, when it became clear that this acceleration

would be crucial for executing convolutional kernels effi-

ciently [44], but DistBelief remains a heavyweight system

that is geared for training deep neural networks on huge

datasets, and is difficult to scale down to other environ-

ments. In particular, many users want to hone their model

locally on a GPU-powered workstation, before scaling the

same code to train on a much larger dataset. After train-

ing a model on a cluster, the next step is to push the

model into production, which might involve integrating

the model into an online service, or deploying it onto a

mobile device for offline execution. Each of these tasks

has some common computational structure, but our col-

leagues found it necessary to use or create separate sys-

tems that satisfy the different performance and resource

requirements of each platform. TensorFlow provides a

single programming model and runtime system for all of

these environments.

2.2 Design principles

We designed TensorFlow to be much more flexible than

DistBelief, while retaining its ability to satisfy the de-

mands of Google’s production machine learning work-

loads. TensorFlow provides a simple dataflow-based pro-

gramming abstraction that allows users to deploy appli-

cations on distributed clusters, local workstations, mo-

bile devices, and custom-designed accelerators. A high-

level scripting interface (Figure 1) wraps the construction

of dataflow graphs and enables users to experiment with

different model architectures and optimization algorithms

without modifying the core system. In this subsection, we

briefly highlight TensorFlow’s core design principles:

Dataflow graphs of primitive operators Both Tensor-

Flow and DistBelief use a dataflow representation for their

models, but the most striking difference is that a Dist-

Belief model comprises relatively few complex “layers”,

whereas the corresponding TensorFlow model represents

individual mathematical operators (such as matrix mul-

tiplication, convolution, etc.) as nodes in the dataflow

graph. This approach makes it easier for users to com-

pose novel layers using a high-level scripting interface.

Many optimization algorithms require each layer to have

defined gradients, and building layers out of simple oper-

ators makes it easy to differentiate these models automat-

ically (§4.1). In addition to the functional operators, we
represent mutable state, and the operations that update it,

as nodes in the dataflow graph, thus enabling experimen-

tation with different update rules.

Deferred execution A typical TensorFlow application

has two distinct phases: the first phase defines the pro-

gram (e.g., a neural network to be trained and the update

rules) as a symbolic dataflow graph with placeholders for

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 267

the input data and variables that represent the state; and

the second phase executes an optimized version of the

program on the set of available devices. By deferring the

execution until the entire program is available, Tensor-

Flow can optimize the execution phase by using global

information about the computation. For example, Tensor-

Flow achieves high GPU utilization by using the graph’s

dependency structure to issue a sequence of kernels to the

GPU without waiting for intermediate results. While this

design choice makes execution more efficient, we have

had to push more complex features—such as dynamic

control flow (§3.4)—into the dataflow graph, so that mod-
els using these features enjoy the same optimizations.

Common abstraction for heterogeneous accelerators

In addition to general-purpose devices such as multicore

CPUs and GPUs, special-purpose accelerators for deep

learning can achieve significant performance improve-

ments and power savings. At Google, our colleagues

have built the Tensor Processing Unit (TPU) specifically

for machine learning; TPUs yield an order of magnitude

improvement in performance-per-watt compared to alter-

native state-of-the-art technology [40]. To support these

accelerators in TensorFlow, we define a common abstrac-

tion for devices. At a minimum, a device must implement

methods for (i) issuing a kernel for execution, (ii) allocat-

ing memory for inputs and outputs, and (iii) transferring

buffers to and from host memory. Each operator (e.g.,

matrix multiplication) can have multiple specialized im-

plementations for different devices. As a result, the same

program can easily target GPUs, TPUs, or mobile CPUs

as required for training, serving, and offline inference.

TensorFlow uses tensors of primitive values as a com-

mon interchange format that all devices understand. At

the lowest level, all tensors in TensorFlow are dense;

sparse tensors can be represented in terms of dense ones

(§3.1). This decision ensures that the lowest levels of the
system have simple implementations for memory alloca-

tion and serialization, thus reducing the framework over-

head. Tensors also enable other optimizations for memory

management and communication, such as RDMA and di-

rect GPU-to-GPU transfer.

The main consequence of these principles is that in

TensorFlow there is no such thing as a parameter server.

On a cluster, we deploy TensorFlow as a set of tasks

(named processes that can communicate over a network)

that each export the same graph execution API and con-

tain one or more devices. Typically a subset of those tasks

assumes the role that a parameter server plays in other

systems [11, 14, 20, 49], and we therefore call them PS

tasks; the others are worker tasks. However, since a PS

task is capable of running arbitrary TensorFlow graphs,

it is more flexible than a conventional parameter server:

users can program it with the same scripting interface that

they use to define models. This flexibility is the key dif-

ference between TensorFlow and contemporary systems,

and in the rest of the paper we will discuss some of the

applications that this flexibility enables.

2.3 Related work

Single-machine frameworks Many machine learning

researchers carry out their work on a single—often GPU-

equipped—computer [43, 44], and several single-machine

frameworks support this scenario. Caffe [38] is a high-

performance framework for training declaratively speci-

fied neural networks on multicore CPUs and GPUs. As

discussed above, its programming model is similar to

DistBelief (§2.1), so it is easy to compose models from
existing layers, but relatively difficult to add new layers

or optimizers. Theano [2] allows programmers to express

a model as a dataflow graph of primitive operators, and

generates efficient compiled code for training that model.

Its programming model is closest to TensorFlow, and it

provides much of the same flexibility in a single machine.

Unlike Caffe, Theano, and TensorFlow, Torch [17] of-

fers a powerful imperative programming model for sci-

entific computation and machine learning. It allows fine-

grained control over the execution order and memory uti-

lization, which enables power users to optimize the per-

formance of their programs. While this flexibility is use-

ful for research, Torch lacks the advantages of a dataflow

graph as a portable representation across small-scale ex-

perimentation, production training, and deployment.

Batch dataflow systems Starting with MapRe-

duce [21], batch dataflow systems have been applied

to a large number of machine learning algorithms [70],

and more recent systems have focused on increasing

expressivity and performance. DryadLINQ [74] adds a

high-level query language that supports more sophisti-

cated algorithms than MapReduce. Spark [75] extends

DryadLINQ with the ability to cache previously com-

puted datasets in memory, and is therefore better suited to

iterative machine learning algorithms (such as k-means
clustering and logistic regression) when the input data fit

in memory. Dandelion extends DryadLINQ with code

generation for GPUs [62] and FPGAs [16].

The principal limitation of a batch dataflow system is

that it requires the input data to be immutable, and all

of the subcomputations to be deterministic, so that the

system can re-execute subcomputations when machines

in the cluster fail. This feature—which is beneficial for

many conventional workloads—makes updating a ma-

268 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Preprocessing

Parameters

Mod

Shard 0

Shard 1

SumPart Stitch

Gather

Gather

Input

data
Reader

Read params Apply grads

Training

Fwd Back
Dist. FS

Periodic

checkpoint

True branch

False branch

Switch Merge

True

Training libraries Inference libs

Python client C++ client

C API

Networking layer

RPC …

Device layer

CPU …GPURDMA

Distributed master Dataflow executor

Kernel implementations

Const Var MatMul Conv2D ReLU Queue …

Shuffle queue Queue

Figure 2: A schematic TensorFlow dataflow graph for a training pipeline, containing subgraphs for reading input data,

preprocessing, training, and checkpointing state.

chine learning model an expensive operation. For ex-

ample, the SparkNet system for training deep neural net-

works on Spark takes 20 seconds to broadcast weights and

collect updates from five workers [55]. As a result, in

these systems, each model update step must process larger

batches, slowing convergence [8]. We show in Subsec-

tion 6.3 that TensorFlow can train larger models on larger

clusters with step times as short as 2 seconds.

Parameter servers As we discuss in Subsection 2.1, a

parameter server architecture uses a set of servers to man-

age shared state that is updated by a set of parallel work-

ers. This architecture emerged in work on scalable topic

modeling [65], and DistBelief showed how it can apply

to deep neural network training. Project Adam [14] fur-

ther applied this architecture for the efficient training of

convolutional neural networks; and Li et al.’s “Parame-

ter Server” [49] added innovations in consistency mod-

els, fault tolerance, and elastic rescaling. Despite earlier

skepticism that parameter servers would be compatible

with GPU acceleration [14], Cui et al. recently showed

that a parameter server specialized for use with GPUs can

achieve speedups on small clusters [18].

MXNet [11] is perhaps the closest system in design

to TensorFlow. It uses a dataflow graph to represent the

computation at each worker, and uses a parameter server

to scale training across multiple machines. The MXNet

parameter server exports a key-value store interface that

supports aggregating updates sent from multiple devices

in each worker, and using an arbitrary user-provided func-

tion to combine incoming updates with the current value.

The MXNet key-value store interface [22] does not cur-

rently allow sparse gradient updates within a single value,

which are crucial for the distributed training of large mod-

els (§4.2), and adding this feature would require modifi-
cations to the core system.

The parameter server architecture meets many of our

requirements, and with sufficient engineering effort it

would be possible to build most of the features that we

describe in this paper into a parameter server. For Tensor-

Flow we sought a high-level programming model that al-

lows users to customize the code that runs in all parts of

the system, so that the cost of experimentation with new

optimization algorithms and model architectures is lower.

In the next section, we describe the building blocks of a

TensorFlow program in more detail.

3 TensorFlow execution model

TensorFlow uses a single dataflow graph to represent

all computation and state in a machine learning algo-

rithm, including the individual mathematical operations,

the parameters and their update rules, and the input pre-

processing (Figure 2). The dataflow graph expresses the

communication between subcomputations explicitly, thus

making it easy to execute independent computations in

parallel and to partition computations across multiple de-

vices. TensorFlow differs from batch dataflow systems

(§2.3) in two respects:

• The model supports multiple concurrent executions
on overlapping subgraphs of the overall graph.

• Individual vertices may have mutable state that can
be shared between different executions of the graph.

The key observation in the parameter server architec-

ture [14, 20, 49] is that mutable state is crucial when

training very large models, because it becomes possible to

make in-place updates to very large parameters, and prop-

agate those updates to parallel training steps as quickly

as possible. Dataflow with mutable state enables Tensor-

Flow to mimic the functionality of a parameter server,

but with additional flexibility, because it becomes pos-

sible to execute arbitrary dataflow subgraphs on the ma-

chines that host the shared model parameters. As a re-

sult, our users have been able to experiment with different

optimization algorithms, consistency schemes, and paral-

lelization strategies.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 269

3.1 Dataflow graph elements

In a TensorFlow graph, each vertex represents a unit of

local computation, and each edge represents the output

from, or input to, a vertex. We refer to the computation

at vertices as operations, and the values that flow along

edges as tensors. In this subsection, we describe the com-

mon types of operations and tensors.

Tensors In TensorFlow, we model all data as tensors

(n-dimensional arrays) with the elements having one
of a small number of primitive types, such as int32,

float32, or string (where string can represent ar-

bitrary binary data). Tensors naturally represent the inputs

to and results of the common mathematical operations in

many machine learning algorithms: for example, a matrix

multiplication takes two 2-D tensors and produces a 2-D

tensor; and a batch 2-D convolution takes two 4-D tensors

and produces another 4-D tensor.

At the lowest level, all TensorFlow tensors are dense,

for the reasons we discuss in Subsection 2.2. TensorFlow

offers two alternatives for representing sparse data: either

encode the data into variable-length string elements of

a dense tensor, or use a tuple of dense tensors (e.g., an

n-D sparse tensor with m non-zero elements can be rep-
resented in coordinate-list format as an m × n matrix of
coordinates and a length-m vector of values). The shape
of a tensor can vary in one or more of its dimensions,

which makes it possible to represent sparse tensors with

differing numbers of elements.

Operations An operation takes m ≥ 0 tensors as input
and produces n ≥ 0 tensors as output. An operation has
a named “type” (such as Const, MatMul, or Assign)

and may have zero or more compile-time attributes that

determine its behavior. An operation can be polymorphic

and variadic at compile-time: its attributes determine both

the expected types and arity of its inputs and outputs.

For example, the simplest operation Const has no in-

puts and a single output; its value is a compile-time at-

tribute. For example, AddN sums multiple tensors of the

same element type, and it has a type attribute T and an

integer attribute N that define its type signature.

Stateful operations: variables An operation can con-

tain mutable state that is read and/or written each time

it executes. A Variable operation owns a mutable

buffer that may be used to store the shared parameters

of a model as it is trained. A Variable has no inputs,

and produces a reference handle, which acts as a typed

capability for reading and writing the buffer. A Read

operation takes a reference handle r as input, and out-
puts the value of the variable (State[r]) as a dense ten-
sor. Other operations modify the underlying buffer: for

example, AssignAdd takes a reference handle r and a
tensor value x, and when executed performs the update
State′[r] ← State[r] + x. Subsequent Read(r) opera-
tions produce the value State′[r].

Stateful operations: queues TensorFlow includes sev-

eral queue implementations, which support more ad-

vanced forms of coordination. The simplest queue is

FIFOQueue, which owns an internal queue of tensors,

and allows concurrent access in first-in-first-out order.

Other types of queues dequeue tensors in random and pri-

ority orders, which ensure that input data are sampled ap-

propriately. Like a Variable, the FIFOQueue opera-

tion produces a reference handle that can be consumed by

one of the standard queue operations, such as Enqueue

and Dequeue. These operations push their input onto the

tail of the queue and, respectively, pop the head element

and output it. Enqueue will block if its given queue is

full, and Dequeue will block if its given queue is empty.

When queues are used in an input preprocessing pipeline,

this blocking provides backpressure; it also supports syn-

chronization (§4.4). The combination of queues and dy-
namic control flow (§3.4) can also implement a form of
streaming computation between subgraphs.

3.2 Partial and concurrent execution

TensorFlow uses a dataflow graph to represent all possible

computations in a particular application. The API for ex-

ecuting a graph allows the client to specify declaratively

the subgraph that should be executed. The client selects

zero or more edges to feed input tensors into the dataflow,

and one or more edges to fetch output tensors from the

dataflow; the runtime then prunes the graph to contain the

necessary set of operations. Each invocation of the API is

called a step, and TensorFlow supports multiple concur-

rent steps on the same graph. Stateful operations allow

steps to share data and synchronize when necessary.

Figure 2 shows a typical training application, with

multiple subgraphs that execute concurrently and interact

through shared variables and queues. The core training

subgraph depends on a set of model parameters and on in-

put batches from a queue. Many concurrent steps of the

training subgraph update the model based on different in-

put batches, to implement data-parallel training. To fill

the input queue, concurrent preprocessing steps transform

individual input records (e.g., decoding images and apply-

ing random distortions), and a separate I/O subgraph reads

records from a distributed file system. A checkpointing

subgraph runs periodically for fault tolerance (§4.3).

Partial and concurrent execution is responsible for

much of TensorFlow’s flexibility. Adding mutable state

270 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

and coordination via queues makes it possible to spec-

ify a wide variety of model architectures in user-level

code, which enables advanced users to experiment with-

out modifying the internals of the TensorFlow runtime.

By default, concurrent executions of a TensorFlow sub-

graph run asynchronously with respect to one another.

This asynchrony makes it straightforward to implement

machine learning algorithms with weak consistency re-

quirements [61], which include many neural network

training algorithms [20]. As we discuss later, TensorFlow

also provides the primitives needed to synchronize work-

ers during training (§4.4), which has led to promising re-
sults on some learning tasks (§6.3).

3.3 Distributed execution

Dataflow simplifies distributed execution, because it

makes communication between subcomputations explicit.

It enables the same TensorFlow program to be deployed

to a cluster of GPUs for training, a cluster of TPUs for

serving, and a cellphone for mobile inference.

Each operation resides on a particular device, such as a

CPU or GPU in a particular task. A device is responsible

for executing a kernel for each operation assigned to it.

TensorFlow allows multiple kernels to be registered for

a single operation, with specialized implementations for

a particular device or data type (see §5 for details). For
many operations, such as element-wise operators (Add,

Sub, etc.), we can compile a single kernel implementation

for CPU and GPU using different compilers.

The TensorFlow runtime places operations on devices,

subject to implicit or explicit constraints in the graph.

The placement algorithm computes a feasible set of de-

vices for each operation, calculates the sets of operations

that must be colocated, and selects a satisfying device for

each colocation group. It respects implicit colocation con-

straints that arise because each stateful operation and its

state must be placed on the same device. In addition,

the user may specify partial device preferences such as

“any device in a particular task”, or “a GPU in any task”,

and the runtime will respect these constraints. A typical

training application will use client-side programming con-

structs to add constraints such that, for example, parame-

ters are distributed among a set of “PS” tasks (§4.2).

TensorFlow thus permits great flexibility in how opera-

tions in the dataflow graph are mapped to devices. While

simple heuristics yield adequate performance for novice

users, expert users can optimize performance by manually

placing operations to balance the computation, memory,

and network requirements across multiple tasks and mul-

tiple devices within those tasks. An open question is how

input = … # A sequence of tensors

state = 0 # Initial state

w = … # Trainable weights

for i in range(len(input)):

state, out[i] = f(state, w, input[i])

Figure 3: Pseudocode for an abstract RNN (§3.4). The
function f typically comprises differentiable operations

such as matrix multiplications and convolutions [32].

TensorFlow implements the loop in its dataflow graph.

TensorFlow can automatically determine placements that

achieve close to optimal performance on a given set of de-

vices, thus freeing users from this concern. Even without

such automation, it may be worthwhile to separate place-

ment directives from other aspects of model definitions,

so that, for example, it would be trivial to modify place-

ments after a model has been trained.

Once the operations in a graph have been placed, and

the partial subgraph has been computed for a step (§3.2),
TensorFlow partitions the operations into per-device sub-

graphs. A per-device subgraph for device d contains all
of the operations that were assigned to d, with additional
Send and Recv operations that replace edges across de-

vice boundaries. Send transmits its single input to a spec-

ified device as soon as the tensor is available, using a ren-

dezvous key to name the value. Recv has a single output,

and blocks until the value for a specified rendezvous key

is available locally, before producing that value. Send

and Recv have specialized implementations for several

device-type pairs; we describe some of these in Section 5.

We optimized TensorFlow for executing large sub-

graphs repeatedly with low latency. Once the graph for

a step has been pruned, placed, and partitioned, its sub-

graphs are cached in their respective devices. A client

session maintains the mapping from step definitions to

cached subgraphs, so that a distributed step on a large

graph can be initiated with one small message to each par-

ticipating task. This model favors static, reusable graphs,

but it can support dynamic computations using dynamic

control flow, as the next subsection describes.

3.4 Dynamic control flow

TensorFlow supports advanced machine learning algo-

rithms that contain conditional and iterative control flow.

For example, a recurrent neural network (RNN) [39] such

as an LSTM [32] can generate predictions from sequential

data. Google’s Neural Machine Translation system uses

TensorFlow to train a deep LSTM that achieves state-of-

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 271

the-art performance on many translation tasks [73]. The

core of an RNN is a recurrence relation, where the output

for sequence element i is a function of some state that ac-
cumulates across the sequence (Figure 3). In this case, dy-

namic control flow enables iteration over sequences that

have variable lengths, without unrolling the computation

to the length of the longest sequence.

As we discussed in Subsection 2.2, TensorFlow uses

deferred execution via the dataflow graph to offload larger

chunks of work to accelerators. Therefore, to imple-

ment RNNs and other advanced algorithms, we add con-

ditional (if statement) and iterative (while loop) program-

ming constructs in the dataflow graph itself. We use

these primitives to build higher-order constructs, such as

map(), fold(), and scan() [2].

For this purpose, we borrow the Switch and

Merge primitives from classic dynamic dataflow archi-

tectures [4]. Switch is a demultiplexer: it takes a data

input and a control input, and uses the control input to

select which of its two outputs should produce a value.

The Switch output not taken receives a special dead

value, which propagates recursively through the rest of

the graph until it reaches a Merge operation. Merge is

a multiplexer: it forwards at most one non-dead input to

its output, or produces a dead output if both of its inputs

are dead. The conditional operator uses Switch to ex-

ecute one of two branches based on the runtime value of

a boolean tensor, and Merge to combine the outputs of

the branches. The while loop is more complicated, and

uses Enter, Exit, and NextIteration operators to

ensure that the loop is well-formed [56].

The execution of iterations can overlap, and Tensor-

Flow can also partition conditional branches and loop

bodies across multiple devices and processes. The par-

titioning step adds logic to coordinate the start and ter-

mination of each iteration on each device, and to decide

the termination of the loop. As we will see in Subsec-

tion 4.1, TensorFlow also supports automatic differenti-

ation of control flow constructs. Automatic differentia-

tion adds the subgraphs for computing gradients to the

dataflow graph, which TensorFlow partitions across po-

tentially distributed devices to compute the gradients in

parallel.

4 Extensibility case studies

By choosing a unified representation for all computation

in TensorFlow, we enable users to experiment with fea-

tures that were hard-coded into the DistBelief runtime. In

this section, we discuss four extensions that we have built

using dataflow primitives and “user-level” code.

4.1 Differentiation and optimization

Many learning algorithms train a set of parameters using

some variant of SGD, which entails computing the gradi-

ents of a loss function with respect to those parameters,

then updating the parameters based on those gradients.

TensorFlow includes a user-level library that differentiates

a symbolic expression for a loss function and produces a

new symbolic expression representing the gradients. For

example, given a neural network as a composition of lay-

ers and a loss function, the library will automatically de-

rive the backpropagation code.

The differentiation algorithm performs breadth-first

search to identify all of the backwards paths from the tar-

get operation (e.g., a loss function) to a set of parameters,

and sums the partial gradients that each path contributes.

Our users frequently specialize the gradients for some op-

erations, and they have implemented optimizations like

batch normalization [33] and gradient clipping [60] to ac-

celerate training and make it more robust. We have ex-

tended the algorithm to differentiate conditional and it-

erative subcomputations (§3.4) by adding nodes to the
graph that record the control flow decisions in the for-

ward pass, and replaying those decisions in reverse during

the backward pass. Differentiating iterative computations

over long sequences can lead to a large amount of inter-

mediate state being accumulated in memory, and we have

developed techniques for managing limited GPU memory

on these computations.

TensorFlow users can also experiment with a wide

range of optimization algorithms, which compute new

values for the parameters in each training step. SGD is

easy to implement in a parameter server: for each param-

eter W , gradient ∂L/∂W , and learning rate α, the update
rule is W ′ ←W − α× ∂L/∂W . A parameter server can
implement SGD by using -= as the write operation, and

writing α× ∂L/∂W to each W after a training step.

However, there are many more advanced optimization

schemes that are difficult to express as a single write op-

eration. For example, the Momentum algorithm accumu-

lates a “velocity” for each parameter based on its gradi-

ent over multiple iterations, then computes the parameter

update from that accumulation; and many refinements to

this algorithm have been proposed [66]. Implementing

Momentum in DistBelief [20], required modifications to

the parameter server implementation to change the rep-

resentation of parameter data, and execute complex logic

in the write operation; such modifications are challeng-

ing for many users. Optimization algorithms are the topic

of active research, and researchers have implemented sev-

eral on top of TensorFlow, including Momentum, Ada-

Grad, AdaDelta, RMSProp, Adam, and L-BFGS. These

272 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Preprocessing

Parameters

Mod

Shard 0

Shard 1

SumPart Stitch

Gather

Gather

Input

data
Reader

Read params Apply grads

Training

Fwd Back
Dist. FS

Periodic

checkpoint

True branch

False branch

Switch Merge

True

Training libraries Inference libs

Python client C++ client

C API

Networking layer

RPC …

Device layer

CPU …GPURDMA

Distributed master Dataflow executor

Kernel implementations

Const Var MatMul Conv2D ReLU Queue …

Shuffle queue Queue

Figure 4: Schematic dataflow for an embedding layer

(§4.2) with a two-way sharded embedding matrix.

can be built in TensorFlow using Variable operations

and primitive mathematical operations without modifying

the underlying system, so it is easy to experiment with

new algorithms as they emerge.

4.2 Training very large models

To train a model on high-dimensional data, such as words

in a corpus of text [7], it is common to use a distributed

representation, which embeds a training example as a pat-

tern of activity across several neurons, and which can be

learned by backpropagation [30]. For example, in a lan-

guage model, a training example might be a sparse vector

with non-zero entries corresponding to the IDs of words

in a vocabulary, and the distributed representation for each

word will be a lower-dimensional vector [6]. “Wide and

deep learning” creates distributed representations from

cross-product transformations on categorical features, and

the implementation on TensorFlow is used to power the

Google Play app store recommender system [12].

Inference begins by multiplying a batch of b sparse vec-
tors against an n × d embedding matrix, where n is the
number of words in the vocabulary, and d is the desired
dimensionality, to produce a much smaller b × d dense
matrix representation; for training, most optimization al-

gorithms modify only the rows of the embedding matrix

that were read by the sparse multiplication. In TensorFlow

models that process sparse data, n × d can amount to gi-
gabytes of parameters: e.g., a large language model may

use over 109 parameters with a vocabulary of 800,000
words [41], and we have experience with document mod-

els [19] where the parameters occupy several terabytes.

Such models are too large to copy to a worker on every

use, or even to store in RAM on a single host.

We implement sparse embedding layers in the Tensor-

Flow graph as a composition of primitive operations. Fig-

ure 4 shows a simplified graph for an embedding layer

that is split across two parameter server tasks. The core

operation of this subgraph is Gather, which extracts a

sparse set of rows from a tensor, and TensorFlow colo-

cates this operation with the variable on which it operates.

The dynamic partition (Part) operation divides the in-

coming indices into variable-sized tensors that contain the

indices destined for each shard, and the dynamic stitching

(Stitch) operation reassembles the partial results from

each shard into a single result tensor. Each of these op-

erations has a corresponding gradient, so it supports au-

tomatic differentiation (§4.1), and the result is a set of
sparse update operations that act on just the values that

were originally gathered from each of the shards.

Users writing a TensorFlow model typically do not con-

struct graphs like Figure 4 manually. Instead TensorFlow

includes libraries that expose the abstraction of a sharded

parameter, and build appropriate graphs of primitive op-

erations based on the desired degree of distribution.

While sparse reads and updates are possible in a pa-

rameter server [49], TensorFlow adds the flexibility to

offload arbitrary computation onto the devices that host

the shared parameters. For example, classification mod-

els typically use a softmax classifier that multiplies the

final output by a weight matrix with c columns, where c
is the number of possible classes; for a language model,

c is the size of the vocabulary, which can be large. Our
users have experimented with several schemes to accel-

erate the softmax calculation. The first is similar to an

optimization in Project Adam [14], whereby the weights

are sharded across several tasks, and the multiplication

and gradient calculation are colocated with the shards.

More efficient training is possible using a sampled soft-

max [37], which performs a sparse multiplication based

on the true class for an example and a set of randomly

sampled false classes. We compare the performance of

these two schemes in §6.4.

4.3 Fault tolerance

Training a model can take several hours or days, even us-

ing a large number of machines [14, 20]. We often need to

train a model using non-dedicated resources, for example

using the Borg cluster manager [71], which does not guar-

antee availability of the same resources for the duration of

the training process. Therefore, a long-running Tensor-

Flow job is likely to experience failure or pre-emption,

and we require some form of fault tolerance. It is un-

likely that tasks will fail so often that individual opera-

tions need fault tolerance, so a mechanism like Spark’s

RDDs [75] would impose significant overhead for little

benefit. There is no need to make every write to the pa-

rameter state durable, because we can recompute any up-

date from the input data, and many learning algorithms do

not require strong consistency [61].

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 273

(a) Asynchronous replication (b) Synchronous replication (c) Synchronous w/ backup worker

PS

Worker 1

Worker 2

Worker 3

Figure 5: Three synchronization schemes for parallel SGD. Each color represents a different starting parameter value;

a white square is a parameter update. In (c), a dashed rectangle represents a backup worker whose result is discarded.

We implement user-level checkpointing for fault tol-

erance, using two operations in the graph (Figure 2):

Save writes one or more tensors to a checkpoint file, and

Restore reads one or more tensors from a checkpoint

file. Our typical configuration connects each Variable

in a task to the same Save operation, with one Save per

task, to maximize the I/O bandwidth to a distributed file

system. The Restore operations read named tensors

from a file, and a standard Assign stores the restored

value in its respective variable. During training, a typi-

cal client runs all of the Save operations periodically to

produce a new checkpoint; when the client starts up, it

attempts to Restore the latest checkpoint.

TensorFlow includes a client library for constructing

the appropriate graph structure and for invoking Save

and Restore as necessary. This behavior is customiz-

able: the user can apply different policies to subsets of the

variables in a model, or customize the checkpoint reten-

tion scheme. For example, many users retain checkpoints

with the highest score in a custom evaluation metric. The

implementation is also reusable: it may be used for model

fine-tuning and unsupervised pre-training [45, 47], which

are forms of transfer learning, in which the parameters of

a model trained on one task (e.g., recognizing general im-

ages) are used as the starting point for another task (e.g.,

recognizing breeds of dog). Having checkpoint and pa-

rameter management as programmable operations in the

graph gives users the flexibility to implement schemes like

these and others that we have not anticipated.

The checkpointing library does not attempt to produce

consistent checkpoints: if training and checkpointing ex-

ecute concurrently, the checkpoint may include none, all,

or some of the updates from the training step. This be-

havior is compatible with the relaxed guarantees of asyn-

chronous SGD [20]. Consistent checkpoints require ad-

ditional synchronization to ensure that update operations

do not interfere with checkpointing; if desired, one can

use the scheme in the next subsection to take a checkpoint

after the synchronous update step.

4.4 Synchronous replica coordination

SGD is robust to asynchrony [61], and many systems

train deep neural networks using asynchronous parame-

ter updates [14, 20], which are believed scalable because

they maintain high throughput in the presence of strag-

glers. The increased throughput comes at the cost of us-

ing stale parameter values in training steps. Some have

recently revisited the assumption that synchronous train-

ing does not scale [10, 18]. Since GPUs enable training

with hundreds—rather than thousands [47]—of machines,

synchronous training may be faster (in terms of time to

quality) than asynchronous training on the same platform.

Though we originally designed TensorFlow for asyn-

chronous training, we have begun experimenting with

synchronous methods. The TensorFlow graph enables

users to change how parameters are read and written when

training a model, and we implement three alternatives. In

the asynchronous case (Figure 5(a)), each worker reads

the current values of parameters when each step begins,

and applies its gradient to the (possibly different) current

values at the end: this approach ensures high utilization,

but the individual steps use stale parameter values, making

each step less effective. We implement the synchronous

version using queues (§3.1) to coordinate execution: a
blocking queue acts as a barrier to ensure that all workers

read the same parameter values, and a per-variable queue

accumulates gradient updates from all workers in order to

apply them atomically. The simple synchronous version

(Figure 5(b)) accumulates updates from all workers before

applying them, but slow workers limit overall throughput.

To mitigate stragglers, we implement backup work-

ers (Figure 5(c), [10]), which are similar to MapReduce

backup tasks [21]. Whereas MapReduce starts backup

tasks reactively—after detecting a straggler—our backup

workers run proactively, and the aggregation takes the

first m of n updates produced. We exploit the fact that
SGD samples training data randomly at each step, so each

worker processes a different random batch, and it is not a

274 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Preprocessing

Parameters

Mod

Shard 0

Shard 1

SumPart Stitch

Gather

Gather

Input

data
Reader

Read params Apply grads

Training

Fwd Back
Dist. FS

Periodic

checkpoint

True branch

False branch

Switch Merge

True

Training libraries Inference libs

Python client C++ client

C API

Networking layer

RPC …

Device layer

CPU …GPURDMA

Distributed master Dataflow executor

Kernel implementations

Const Var MatMul Conv2D ReLU Queue …

Shuffle queue Queue

Figure 6: The layered TensorFlow architecture.

problem if a particular batch is ignored. In §6.3 we show
how backup workers improve throughput by up to 10%.

5 Implementation

The TensorFlow runtime is a cross-platform library. Fig-

ure 6 illustrates its architecture: a C API separates user-

level code in different languages from the core runtime.

The core TensorFlow library is implemented in C++ for

portability and performance: it runs on several operating

systems including Linux, Mac OS X, Windows, Android,

and iOS; the x86 and various ARM-based CPU architec-

tures; and NVIDIA’s Kepler, Maxwell, and Pascal GPU

microarchitectures. The implementation is open-source,

and we have accepted several external contributions that

enable TensorFlow to run on other architectures.

The distributed master translates user requests into ex-

ecution across a set of tasks. Given a graph and a step def-

inition, it prunes (§3.2) and partitions (§3.3) the graph to
obtain subgraphs for each participating device, and caches

these subgraphs so that they may be re-used in subsequent

steps. Since the master sees the overall computation for a

step, it applies standard optimizations such as common

subexpression elimination and constant folding; pruning

is a form of dead code elimination. It then coordinates ex-

ecution of the optimized subgraphs across a set of tasks.

The dataflow executor in each task handles requests

from the master, and schedules the execution of the ker-

nels that comprise a local subgraph. We optimize the

dataflow executor for running large graphs with low over-

head. Our current implementation can execute 10,000

subgraphs per second (§6.2), which enables a large num-
ber of replicas to make rapid, fine-grained training steps.

The dataflow executor dispatches kernels to local devices

and runs kernels in parallel when possible, for example by

using multiple CPU cores or GPU streams.

The runtime contains over 200 standard operations, in-

cluding mathematical, array manipulation, control flow,

and state management operations. Many of the operation

kernels are implemented using Eigen::Tensor [36], which

uses C++ templates to generate efficient parallel code for

multicore CPUs and GPUs; however, we liberally use li-

braries like cuDNN [13] where a more efficient kernel

implementation is possible. We have also implemented

quantization, which enables faster inference in environ-

ments such as mobile devices and high-throughput data-

center applications, and use the gemmlowp low-precision

matrix library [35] to accelerate quantized computation.

We specialize Send and Recv operations for each

pair of source and destination device types. Trans-

fers between local CPU and GPU devices use the

cudaMemcpyAsync()API to overlap computation and

data transfer; transfers between two local GPUs use

DMA to relieve pressure on the host. For transfers be-

tween tasks, TensorFlow uses multiple protocols, includ-

ing gRPC over TCP, and RDMA over Converged Ether-

net. We are also investigating optimizations for GPU-to-

GPU communication that use collective operations [59].

Section 4 describes features that we implement com-

pletely above the C API, in user-level code. Typically,

users compose standard operations to build higher-level

abstractions, such as neural network layers, optimization

algorithms (§4.1), and sharded embedding computations
(§4.2). TensorFlow supports multiple client languages,
and we have prioritized Python and C++, because our in-

ternal users are most familiar with these languages. As

features become more established, we typically port them

to C++, so that users can access an optimized implemen-

tation from all client languages.

If it is difficult or inefficient to represent a subcom-

putation as a composition of operations, users can reg-

ister additional kernels that provide an efficient imple-

mentation written in C++. We have found it profitable

to hand-implement fused kernels for some performance

critical operations, such as the ReLU and Sigmoid acti-

vation functions and their corresponding gradients. We

are currently investigating automatic kernel fusion using

a compilation-based approach.

In addition to the core runtime, our colleagues have

built several tools that aid users of TensorFlow. These

include serving infrastructure for inference in produc-

tion [27], a visualization dashboard that enables users to

follow the progress of a training run, a graph visualizer

that helps users to understand the connections in a model,

and a distributed profiler that traces the execution of a

computation across multiple devices and tasks. We de-

scribe these tools in an extended whitepaper [1].

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 275

6 Evaluation

In this section, we evaluate the performance of Tensor-

Flow on several synthetic and realistic workloads. Unless

otherwise stated, we run all experiments on a shared pro-

duction cluster, and all figures plot median values with

error bars showing the 10th and 90th percentiles.

In this paper we focus on system performance met-

rics, rather than learning objectives like time to accu-

racy. TensorFlow is a system that allows machine learn-

ing practitioners and researchers to experiment with new

techniques, and this evaluation demonstrates that the sys-

tem (i) has little overhead, and (ii) can employ large

amounts of computation to accelerate real-world applica-

tions. While techniques like synchronous replication can

enable some models to converge in fewer steps overall, we

defer the analysis of such improvements to other papers.

6.1 Single-machine benchmarks

Although TensorFlow is a system for “large-scale” ma-

chine learning, it is imperative that scalability does not

mask poor performance at small scales [51]. Table 1 con-

tains results from Chintala’s benchmark of convolutional

models on TensorFlow and three single-machine frame-

works [15]. All frameworks use a six-core Intel Core i7-

5930K CPU at 3.5 GHz and an NVIDIA Titan X GPU.

Training step time (ms)

Library AlexNet Overfeat OxfordNet GoogleNet

Caffe [38] 324 823 1068 1935

Neon [58] 87 211 320 270

Torch [17] 81 268 529 470

TensorFlow 81 279 540 445

Table 1: Step times for training four convolutional models

with different libraries, using one GPU. All results are for

training with 32-bit floats. The fastest time for each model

is shown in bold.

Table 1 shows that TensorFlow achieves shorter step

times than Caffe [38], and performance within 6% of the

latest version of Torch [17]. We attribute the similar per-

formance of TensorFlow and Torch to the fact that both

use the same version of the cuDNN library [13], which

implements the convolution and pooling operations on

the critical path for training; Caffe uses open-source im-

plementations for these operations that are simpler but

less efficient than cuDNN. The Neon library [58] outper-

forms TensorFlow on three of the models, by using hand-

optimized convolutional kernels [46] implemented in as-

sembly language; in principle, we could follow the same

approach in TensorFlow, but we have not yet done so.

1 2 5 10 25 50 100

Number of workers

1

10

100

1000

10000

B
a
tc

h
e
s
/s

e
c
o
n
d Scalar

Sparse 1GB

Sparse 16GB

Dense 100M

Dense 1GB

Figure 7: Baseline throughput for synchronous replication

with a null model. Sparse accesses enable TensorFlow to

handle larger models, such as embedding matrices (§4.2).

6.2 Synchronous replica microbenchmark

The performance of our coordination implementation

(§4.4) is the main limiting factor for scaling with addi-
tional machines. Figure 7 shows that number of null train-

ing steps that TensorFlow performs per second for vary-

ing model sizes, and increasing numbers of synchronous

workers. In a null training step, a worker fetches the

shared model parameters from 16 PS tasks, performs a

trivial computation, and sends updates to the parameters.

The Scalar curve in Figure 7 shows the best perfor-

mance that we could expect for a synchronous training

step, because only a single 4-byte value is fetched from

each PS task. The median step time is 1.8 ms using a sin-

gle worker, growing to 8.8 ms with 100 workers. These

times measure the overhead of the synchronization mech-

anism, and capture some of the noise that we expect when

running on a shared cluster.

The Dense curves show the performance of a null step

when the worker fetches the entire model. We repeat the

experiment with models of size 100 MB and 1 GB, with

the parameters sharded equally over 16 PS tasks. The me-

dian step time for 100 MB increases from 147 ms with one

worker to 613 ms with 100 workers. For 1 GB, it increases

from 1.01 s with one worker to 7.16 s with 100 workers.

For large models, a typical training step accesses only

a subset of the parameters, and the Sparse curves show

the throughput of the embedding lookup operation from

Subsection 4.2. Each worker reads 32 randomly selected

entries from a large embedding matrix containing 1 GB or

16 GB of data. As expected, the step times do not vary

with the size of the embedding, and TensorFlow achieves

step times ranging from 5 to 20 ms.

276 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

1 4 8 16 32 50

Number of workers

0

5

10

15

20

25

30

Im
a
g
e
s
/s

e
c
o
n
d
/w

o
rk

e
r

(a) Baseline performance vs. MXNet

TensorFlow

MXNet

25 50 100 200

Number of workers

0

500

1000

1500

2000

2500

3000

Im
a
g
e
s
/s

e
c
o
n
d

(b) Coordination scalability

Asynchronous

Synchronous

0 1 2 3 4 5

Number of backup workers

1.9

2.0

2.1

2.2

2.3

2.4

2.5

S
te

p
ti
m

e
(s

e
c
o
n
d
s
)

Step time

1.00

1.02

1.04

1.06

1.08

1.10

N
o
rm

a
li
z
e
d

s
p
e
e
d
u
p

(c) Backup worker effectiveness

Speedup

Figure 8: Results of the performance evaluation for Inception-v3 training (§6.3). (a) TensorFlow achieves slightly
better throughput than MXNet for asynchronous training. (b) Asynchronous and synchronous training throughput

increases with up to 200 workers. (c) Adding backup workers to a 50-worker training job can reduce the overall step

time, and improve performance even when normalized for resource consumption.

6.3 Image classification

Deep neural networks have achieved breakthrough perfor-

mance on computer vision tasks such as recognizing ob-

jects in photographs [44], and these tasks are a key ap-

plication for TensorFlow at Google. Training a network

to high accuracy requires a large amount of computa-

tion, and we use TensorFlow to scale out this computation

across a cluster of GPU-enabled servers. In these experi-

ments, we focus on Google’s Inception-v3 model, which

achieves 78.8% accuracy in the ILSVRC 2012 image clas-

sification challenge [69]; the same techniques apply to

other deep convolutional models—such as ResNet [28]—

implemented on TensorFlow. We investigate the scalabil-

ity of training Inception-v3 using multiple replicas. We

configure TensorFlow with 7 PS tasks, and vary the num-

ber of worker tasks using two different clusters.

For the first experiment, we compare the performance

training Inception using asynchronous SGD on Tensor-

Flow and MXNet, a contemporary system using a pa-

rameter server architecture. For this experiment we use

Google Compute Engine virtual machines running on In-

tel Xeon E5 servers with NVIDIA K80 GPUs, config-

ured with 8 vCPUs, 16Gbps of network bandwidth, and

one GPU per VM. Both systems use 7 PS tasks running

on separate VMs with no GPU. Figure 8(a) shows that

TensorFlow achieves performance that is marginally bet-

ter than MXNet. As expected, the results are largely de-

termined by single-GPU performance, and both systems

use cuDNN version 5.1, so they have access to the same

optimized GPU kernels.

Using a larger internal cluster (with NVIDIA K40

GPUs, and a shared datacenter network), we investigate

the effect of coordination (§4.4) on training performance.
Ideally, with efficient synchronous training, a model such

as Inception-v3 will train in fewer steps, and converge to

a higher accuracy than with asynchronous training [10].

Training throughput improves to 2,300 images per sec-

ond as we increase the number of workers to 200, but

with diminishing returns (Figure 8(b)). As we add more

workers, the step time increases, because there is more

contention on the PS tasks, both at the network interface

and in the aggregation of updates. As expected, for all

configurations, synchronous steps are longer than asyn-

chronous steps, because all workers must wait for the

slowest worker to catch up before starting the next step.

While the median synchronous step is approximately 10%

longer than an asynchronous step with the same workers,

above the 90th percentile the synchronous performance

degrades sharply, because stragglers disproportionately

impact tail latency.

To mitigate tail latency, we add backup workers so that

a step completes when the first m of n tasks produce gra-
dients. Figure 8(c) shows the effect of adding backup

workers to a 50-worker Inception training job. Each addi-

tional backup worker up to and including the fourth re-

duces the median step time, because the probability of

a straggler affecting the step decreases. Adding a fifth

backup worker slightly degrades performance, because

the 51st worker (i.e., the first whose result is discarded)

is more likely to be a non-straggler that generates more

incoming traffic for the PS tasks. Figure 8(c) also plots

the normalized speedup for each configuration, defined as

t(b)/t(0) × 50/(50 + b) (where t(b) is the median step
time with b backup workers), and which discounts the
speedup by the fraction of additional resources consumed.

Although adding 4 backup workers achieves the shortest

overall step time (1.93 s), adding 3 achieves the highest

normalized speedup (9.5%), and hence uses less aggre-

gate GPU-time to reach the same quality.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 277

1 2 4 8 16 32

Number of PS tasks

10
1

10
2

10
3

10
4

10
5

W
o
rd

s
p
ro

c
e
s
s
e
d
/s

e
c
o
n
d

(a) Full softmax

256 workers

32 workers

4 workers

1 2 4 8 16 32

Number of PS tasks

10
1

10
2

10
3

10
4

10
5

W
o
rd

s
p
ro

c
e
s
s
e
d
/s

e
c
o
n
d

(b) Sampled softmax

256 workers

32 workers

4 workers

Figure 9: Increasing the number of PS tasks leads to in-

creased throughput for language model training, by par-

allelizing the softmax computation. Sampled softmax in-

creases throughput by performing less computation.

6.4 Language modeling

Given a sequence of words, a language model predicts the

most probable next word [6]. Therefore, language mod-

els are integral to predictive text, speech recognition, and

translation applications. In this experiment, we investi-

gate how TensorFlow can train a recurrent neural network

(viz. LSTM-512-512 [41]) to model the text in the One

Billion Word Benchmark [9]. The vocabulary size |V |
limits the performance of training, because the final layer

must decode the output state into probabilities for each of

|V | classes [37]. The resulting parameters can be large
(|V | × d for output state dimension d) so we use the tech-
niques for handling large models from Subsection 4.2. We

use a restricted vocabulary of the most common 40,000

words—instead of the full 800,000 words [9]—in order to

experiment with smaller configurations.

Figure 9 shows the training throughput, measured in

words per second, for varying numbers of PS and worker

tasks, and two softmax implementations. The full softmax

(Figure 9(a)) multiplies each output by a 512 × 40,000
weight matrix sharded across the PS tasks. Adding more

PS tasks increases the throughput, because TensorFlow

can exploit distributed model parallelism [20, 43] and per-

form the multiplication and gradient calculation on the PS

tasks, as in Project Adam [14]. Adding a second PS task

is more effective than increasing from 4 to 32, or 32 to

256 workers. Eventually the throughput saturates, as the

LSTM calculations dominate the training step.

The sampled softmax (Figure 9(b)) reduces the data

transferred and the computation performed on the PS

tasks [37]. Instead of a dense weight matrix, it multiplies

the output by a random sparse matrix containing weights

for the true class and a random sample of false classes.

We sample 512 classes for each batch, thus reducing the

softmax data transfer and computation by a factor of 78.

7 Conclusions

We have described the TensorFlow system and its pro-

gramming model. TensorFlow’s dataflow representation

subsumes existing work on parameter server systems, and

offers a set of uniform abstractions that allow users to

harness large-scale heterogeneous systems, both for pro-

duction tasks and for experimenting with new approaches.

We have shown several examples of how the TensorFlow

programming model facilitates experimentation (§4) and
demonstrated that the resulting implementations are per-

formant and scalable (§6).
Our initial experience with TensorFlow is encourag-

ing. A large number of groups at Google have deployed

TensorFlow in production, and TensorFlow is helping our

research colleagues to make new advances in machine

learning. Since we released TensorFlow as open-source

software, more than 14,000 people have forked the source

code repository, the binary distribution has been down-

loaded over one million times, and dozens of machine

learning models that use TensorFlow have been published.

TensorFlow is a work in progress. Its flexible dataflow

representation enables power users to achieve excellent

performance, but we have not yet determined default

policies that work well for all users. Further research

on automatic optimization should bridge this gap. On

the system level, we are actively developing algorithms

for automatic placement, kernel fusion, memory manage-

ment, and scheduling. While the current implementations

of mutable state and fault tolerance suffice for applica-

tions with weak consistency requirements, we expect that

some TensorFlow applications will require stronger con-

sistency, and we are investigating how to build such poli-

cies at user-level. Finally, some users have begun to chafe

at the limitations of a static dataflow graph, especially for

algorithms like deep reinforcement learning [54]. There-

fore, we face the intriguing problem of providing a sys-

tem that transparently and efficiently uses distributed re-

sources, even when the structure of the computation un-

folds dynamically.

Acknowledgments

We gratefully acknowledge contributions from our col-

leagues within Google, and from members of the wider

machine learning community. In particular, we appreci-

ate the feedback we have received from the rest of the

Google Brain team and the many users of DistBelief and

TensorFlow. We thank the anonymous OSDI reviewers

and our shepherd KyoungSoo Park for their suggestions,

which greatly improved the presentation of this paper.

278 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,

Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,

M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp,

G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser,

M. Kudlur, J. Levenberg, D. Mane, R. Monga,

S. Moore, D. G. Murray, C. Olah, M. Schus-

ter, J. Shlens, B. Steiner, I. Sutskever, K. Tal-

war, P. A. Tucker, V. Vanhoucke, V. Vasudevan,

F. B. Viégas, O. Vinyals, P. Warden, M. Watten-

berg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow:

Large-scale machine learning on heterogeneous dis-

tributed systems. arXiv preprint, 1603.04467, 2016.

arxiv.org/abs/1603.04467. Software available from

tensorflow.org.

[2] R. Al-Rfou, G. Alain, A. Almahairi, C. Anger-

mueller, D. Bahdanau, N. Ballas, F. Bastien,

J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio,

A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher

Snyder, N. Bouchard, N. Boulanger-Lewandowski,

X. Bouthillier, A. de Brébisson, O. Breuleux, P.-

L. Carrier, K. Cho, J. Chorowski, P. Christiano,

T. Cooijmans, M.-A. Côté, M. Côté, A. Courville,

Y. N. Dauphin, O. Delalleau, J. Demouth, G. Des-

jardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Du-

moulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan,

O. Firat, M. Germain, X. Glorot, I. Goodfellow,

M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet,

J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean,

K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lam-

blin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois,

S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey,

C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol,

O. Mastropietro, R. T. McGibbon, R. Memisevic,

B. van Merriënboer, V. Michalski, M. Mirza, A. Or-

landi, C. Pal, R. Pascanu, M. Pezeshki, C. Raf-

fel, D. Renshaw, M. Rocklin, A. Romero, M. Roth,

P. Sadowski, J. Salvatier, F. Savard, J. Schlüter,

J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk,

S. Shabanian, E. Simon, S. Spieckermann, S. R.

Subramanyam, J. Sygnowski, J. Tanguay, G. van

Tulder, J. Turian, S. Urban, P. Vincent, F. Visin,

H. de Vries, D. Warde-Farley, D. J. Webb, M. Will-

son, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang.

Theano: A Python framework for fast computa-

tion of mathematical expressions. arXiv preprint,

1605.02688, 2016. arxiv.org/abs/1605.02688.

[3] A. Angelova, A. Krizhevsky, and V. Van-

houcke. Pedestrian detection with a large-

field-of-view deep network. In Proceed-

ings of ICRA, pages 704–711. IEEE, 2015.

www.vision.caltech.edu/anelia/publications/

Angelova15LFOV.pdf.

[4] Arvind and D. E. Culler. Dataflow architectures.

In Annual Review of Computer Science Vol. 1,

1986, pages 225–253. Annual Reviews Inc., 1986.

www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&

doc=GetTRDoc.pdf&AD=ADA166235.

[5] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple ob-

ject recognition with visual attention. arXiv preprint,

1412.7755, 2014. arxiv.org/abs/1412.7755.

[6] Y. Bengio, R. Ducharme, P. Vincent, and C. Jau-

vin. A neural probabilistic language model. Journal

of Machine Learning Research, 3:1137–1155, 2003.

jmlr.org/papers/volume3/bengio03a/bengio03a.pdf.

[7] T. Brants and A. Franz. Web 1T 5-gram version 1,

2006. catalog.ldc.upenn.edu/LDC2006T13.

[8] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu.

Sample size selection in optimization methods for

machine learning. Mathematical Programming,

134(1):127–155, 2012. dx.doi.org/10.1007/s10107-

012-0572-5.

[9] C. Chelba, T. Mikolov, M. Schuster, Q. Ge,

T. Brants, and P. Koehn. One billion word bench-

mark for measuring progress in statistical lan-

guage modeling. arXiv preprint, 1312.3005, 2013.

arxiv.org/abs/1312.3005.

[10] J. Chen, R. Monga, S. Bengio, and R. Joze-

fowicz. Revisiting distributed synchronous SGD.

In Proceedings of ICLR Workshop Track, 2016.

arxiv.org/abs/1604.00981.

[11] T. Chen, M. Li, Y. Li, M. Lin, N. Wang,

M. Wang, T. Xiao, B. Xu, C. Zhang, and

Z. Zhang. MXNet: A flexible and efficient ma-

chine learning library for heterogeneous distributed

systems. In Proceedings of LearningSys, 2015.

www.cs.cmu.edu/˜muli/file/mxnet-learning-sys.pdf.

[12] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked,

T. Chandra, H. Aradhye, G. Anderson, G. Corrado,

W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong,

V. Jain, X. Liu, and H. Shah. Wide & deep

learning for recommender systems. arXiv preprint,

1606.07792, 2016. arxiv.org/abs/1606.07792.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 279

http://arxiv.org/abs/1603.04467
http://tensorflow.org
http://arxiv.org/abs/1605.02688
http://www.vision.caltech.edu/anelia/publications/Angelova15LFOV.pdf
http://www.vision.caltech.edu/anelia/publications/Angelova15LFOV.pdf
http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA166235
http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA166235
http://arxiv.org/abs/1412.7755
http://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
https://catalog.ldc.upenn.edu/LDC2006T13
http://dx.doi.org/10.1007/s10107-012-0572-5
http://dx.doi.org/10.1007/s10107-012-0572-5
http://arxiv.org/abs/1312.3005
http://arxiv.org/abs/1604.00981
https://www.cs.cmu.edu/~muli/file/mxnet-learning-sys.pdf
http://arxiv.org/abs/1606.07792

[13] S. Chetlur, C. Woolley, P. Vandermersch, J. Co-

hen, J. Tran, B. Catanzaro, and E. Shelhamer.

cuDNN: Efficient primitives for deep learning. arXiv

preprint, 1410.0759, 2014. arxiv.org/abs/1410.0759.

[14] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalya-

naraman. Project Adam: Building an effi-

cient and scalable deep learning training sys-

tem. In Proceedings of OSDI, pages 571–582,

2014. www.usenix.org/system/files/conference/

osdi14/osdi14-paper-chilimbi.pdf.

[15] S. Chintala. convnet-benchmarks, 2016.

github.com/soumith/convnet-benchmarks.

[16] E. S. Chung, J. D. Davis, and J. Lee. LINQits:

Big data on little clients. In Proceedings of ISCA,

pages 261–272, 2013. www.microsoft.com/en-

us/research/wp-content/uploads/2013/06/ISCA13 –

linqits.pdf.

[17] R. Collobert, S. Bengio, and J. Mariéthoz.

Torch: A modular machine learning soft-

ware library. Technical report, IDIAP, 2002.

infoscience.epfl.ch/record/82802/files/rr02-46.pdf.

[18] H. Cui, H. Zhang, G. R. Ganger, P. B. Gib-

bons, and E. P. Xing. GeePS: Scalable deep

learning on distributed GPUs with a GPU-

specialized parameter server. In Proceedings

of EuroSys, 2016. www.pdl.cmu.edu/PDL-FTP/

CloudComputing/GeePS-cui-eurosys16.pdf.

[19] A. Dai, C. Olah, and Q. V. Le. Document embedding

with paragraph vectors. arXiv preprint, 1507.07998,

2015. arxiv.org/abs/1507.07998.

[20] J. Dean, G. S. Corrado, R. Monga, K. Chen,

M. Devin, Q. V. Le, M. Z. Mao, M. Ran-

zato, A. Senior, P. Tucker, K. Yang, and

A. Y. Ng. Large scale distributed deep net-

works. In Proceedings of NIPS, pages 1232–1240,

2012. research.google.com/archive/large deep net-

works nips2012.pdf.

[21] J. Dean and S. Ghemawat. MapReduce:

Simplified data processing on large clusters.

In Proceedings of OSDI, pages 137–149,

2004. research.google.com/archive/mapreduce-

osdi04.pdf.

[22] DMLC. MXNet for deep learning, 2016.

github.com/dmlc/mxnet.

[23] J. Duchi, E. Hazan, and Y. Singer. Adap-

tive subgradient methods for online learning

and stochastic optimization. Journal of Ma-

chine Learning Research, 12:2121–2159, 2011.

jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.

[24] A. Frome, G. S. Corrado, J. Shlens, S. Ben-

gio, J. Dean, T. Mikolov, et al. DeVISE: A

deep visual-semantic embedding model. In Pro-

ceedings of NIPS, pages 2121–2129, 2013. re-

search.google.com/pubs/archive/41473.pdf.

[25] J. Gonzalez-Dominguez, I. Lopez-Moreno,

P. J. Moreno, and J. Gonzalez-Rodriguez.

Frame-by-frame language identification in

short utterances using deep neural networks.

Neural Networks, 64:49–58, 2015. re-

search.google.com/pubs/archive/42929.pdf.

[26] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,

B. Xu, D. Warde-Farley, S. Ozair, A. C.

Courville, and Y. Bengio. Generative adversar-

ial nets. In Proceedings of NIPS, pages 2672–

2680, 2014. papers.nips.cc/paper/5423-generative-

adversarial-nets.pdf.

[27] Google Research. Tensorflow serving, 2016. tensor-

flow.github.io/serving/.

[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep

residual learning for image recognition. In

Proceedings of CVPR, pages 770–778, 2016.

arxiv.org/abs/1512.03385.

[29] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen,

M. Ranzato, M. Devin, and J. Dean. Mul-

tilingual acoustic models using distributed

deep neural networks. In Proceedings

of ICASSP, pages 8619–8623, 2013. re-

search.google.com/pubs/archive/40807.pdf.

[30] G. E. Hinton. Learning distributed repre-

sentations of concepts. In Proceedings of

the Eighth Annual Conference of the Cog-

nitive Science Society, pages 1–12, 1986.

www.cogsci.ucsd.edu/˜ajyu/Teaching/Cogs202 –

sp13/Readings/hinton86.pdf.

[31] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl,

A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,

P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep

neural networks for acoustic modeling in speech

recognition: The shared views of four research

280 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

https://arxiv.org/abs/1410.0759
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf
https://github.com/soumith/convnet-benchmarks
https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/ISCA13_linqits.pdf
https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/ISCA13_linqits.pdf
https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/ISCA13_linqits.pdf
http://infoscience.epfl.ch/record/82802/files/rr02-46.pdf
http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/GeePS-cui-eurosys16.pdf
http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/GeePS-cui-eurosys16.pdf
http://arxiv.org/abs/1507.07998
http://research.google.com/archive/large_deep_networks_nips2012.pdf
http://research.google.com/archive/large_deep_networks_nips2012.pdf
http://research.google.com/archive/mapreduce-osdi04.pdf
http://research.google.com/archive/mapreduce-osdi04.pdf
https://github.com/dmlc/mxnet
http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
http://research.google.com/pubs/archive/41473.pdf
http://research.google.com/pubs/archive/41473.pdf
http://research.google.com/pubs/archive/42929.pdf
http://research.google.com/pubs/archive/42929.pdf
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
https://tensorflow.github.io/serving/
https://tensorflow.github.io/serving/
http://arxiv.org/abs/1512.03385
https://research.google.com/pubs/archive/40807.pdf
https://research.google.com/pubs/archive/40807.pdf
http://www.cogsci.ucsd.edu/~ajyu/Teaching/Cogs202_sp13/Readings/hinton86.pdf
http://www.cogsci.ucsd.edu/~ajyu/Teaching/Cogs202_sp13/Readings/hinton86.pdf

groups. IEEE Signal Process. Mag., 29(6):82–

97, 2012. www.cs.toronto.edu/˜gdahl/papers/

deepSpeechReviewSPM2012.pdf.

[32] S. Hochreiter and J. Schmidhuber. Long short-term

memory. Neural computation, 9(8):1735–1780,

1997. deeplearning.cs.cmu.edu/pdfs/Hochreiter97 –

lstm.pdf.

[33] S. Ioffe and C. Szegedy. Batch normaliza-

tion: Accelerating deep network training by

reducing internal covariate shift. In Pro-

ceedings of ICML, pages 448–456, 2015.

jmlr.org/proceedings/papers/v37/ioffe15.pdf.

[34] M. Isard, M. Budiu, Y. Yu, A. Birrell, and

D. Fetterly. Dryad: distributed data-parallel

programs from sequential building blocks.

In Proceedings of EuroSys, pages 59–72,

2007. www.microsoft.com/en-us/research/wp-

content/uploads/2007/03/eurosys07.pdf.

[35] B. Jacob et al. gemmlowp: a small self-

contained low-precision GEMM library, 2015.

github.com/google/gemmlowp.

[36] B. Jacob, G. Guennebaud, et al. Eigen library for

linear algebra. eigen.tuxfamily.org.

[37] S. Jean, K. Cho, R. Memisevic, and Y. Ben-

gio. On using very large target vocabulary

for neural machine translation. In Proceed-

ings of ACL-ICJNLP, pages 1–10, July 2015.

www.aclweb.org/anthology/P15-1001.

[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,

J. Long, R. Girshick, S. Guadarrama, and T. Dar-

rell. Caffe: Convolutional architecture for fast fea-

ture embedding. In Proceedings of ACM Multime-

dia, pages 675–678, 2014. arxiv.org/abs/1408.5093.

[39] M. I. Jordan. Serial order: A parallel dis-

tributed processing approach. ICS report

8608, Institute for Cognitive Science, UCSD,

La Jolla, 1986. cseweb.ucsd.edu/˜gary/PAPER-

SUGGESTIONS/Jordan-TR-8604.pdf.

[40] N. Jouppi. Google supercharges machine

learning tasks with TPU custom chip, 2016.

cloudplatform.googleblog.com/2016/05/Google-

supercharges-machine-learning-tasks-with-custom-

chip.html.

[41] R. Józefowicz, O. Vinyals, M. Schuster, N. Shazeer,

and Y. Wu. Exploring the limits of language

modeling. arXiv preprint, 1602.02410, 2016.

arxiv.org/abs/1602.02410.

[42] A. Karpathy, G. Toderici, S. Shetty, T. Leung,

R. Sukthankar, and L. Fei-Fei. Large-scale video

classification with convolutional neural networks. In

Proceedings of CVPR, pages 1725–1732, 2014. re-

search.google.com/pubs/archive/42455.pdf.

[43] A. Krizhevsky. One weird trick for paralleliz-

ing convolutional neural networks. arXiv preprint,

1404.5997, 2014. arxiv.org/abs/1404.5997.

[44] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

ImageNet classification with deep convolutional

neural networks. In Proceedings of NIPS, pages

1106–1114, 2012. papers.nips.cc/paper/4824-

imagenet-classification-with-deep-convolutional-

neural-networks.pdf.

[45] H. Larochelle, Y. Bengio, J. Louradour,

and P. Lamblin. Exploring strategies for

training deep neural networks. Journal

of Machine Learning Research, 10:1–40,

2009. jmlr.org/papers/volume10/larochelle09a/

larochelle09a.pdf.

[46] A. Lavin and S. Gray. Fast algorithms for convo-

lutional neural networks. In Proceedings of CVPR,

pages 4013–4021, 2016. arxiv.org/abs/1509.09308.

[47] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Cor-

rado, K. Chen, J. Dean, and A. Ng. Building

high-level features using large scale unsupervised

learning. In Proceedings of ICML, pages 81–88,

2012. research.google.com/archive/unsupervised –

icml2012.pdf.

[48] Y. LeCun, C. Cortes, and C. J. Burges. The

MNIST database of handwritten digits, 1998.

yann.lecun.com/exdb/mnist/.

[49] M. Li, D. G. Andersen, J. Park, A. J. Smola,

A. Ahmed, V. Josifovski, J. Long, E. J.

Shekita, and B.-Y. Su. Scaling distributed ma-

chine learning with the Parameter Server. In

Proceedings of OSDI, pages 583–598, 2014.

www.usenix.org/system/files/conference/osdi14/

osdi14-paper-li mu.pdf.

[50] C. J. Maddison, A. Huang, I. Sutskever, and D. Sil-

ver. Move evaluation in Go using deep convolutional

neural networks. arXiv preprint, 1412.6564, 2014.

arxiv.org/abs/1412.6564.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 281

http://www.cs.toronto.edu/~gdahl/papers/deepSpeechReviewSPM2012.pdf
http://www.cs.toronto.edu/~gdahl/papers/deepSpeechReviewSPM2012.pdf
http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
https://www.microsoft.com/en-us/research/wp-content/uploads/2007/03/eurosys07.pdf
https://www.microsoft.com/en-us/research/wp-content/uploads/2007/03/eurosys07.pdf
https://github.com/google/gemmlowp
http://eigen.tuxfamily.org
http://www.aclweb.org/anthology/P15-1001
http://arxiv.org/abs/1408.5093
http://cseweb.ucsd.edu/~gary/PAPER-SUGGESTIONS/Jordan-TR-8604.pdf
http://cseweb.ucsd.edu/~gary/PAPER-SUGGESTIONS/Jordan-TR-8604.pdf
https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html
https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html
https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html
http://arxiv.org/abs/1602.02410
https://research.google.com/pubs/archive/42455.pdf
https://research.google.com/pubs/archive/42455.pdf
http://arxiv.org/abs/1404.5997
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
http://jmlr.org/papers/volume10/larochelle09a/larochelle09a.pdf
http://jmlr.org/papers/volume10/larochelle09a/larochelle09a.pdf
http://arxiv.org/abs/1509.09308
http://research.google.com/archive/unsupervised_icml2012.pdf
http://research.google.com/archive/unsupervised_icml2012.pdf
http://yann.lecun.com/exdb/mnist/
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf
http://arxiv.org/abs/1412.6564

[51] F. McSherry, M. Isard, and D. G. Mur-

ray. Scalability! But at what COST? In

Proceedings of HotOS, HOTOS’15, 2015.

www.usenix.org/system/files/conference/hotos15/

hotos15-paper-mcsherry.pdf.

[52] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef-

ficient estimation of word representations in vector

space. In Proceedings of ICLR Workshops Track,

2013. arxiv.org/abs/1301.3781.

[53] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu.

Recurrent models of visual attention. In Pro-

ceedings of NIPS, pages 2204–2212, 2014.

papers.nips.cc/paper/5542-recurrent-models-of-

visual-attention.pdf.

[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,

J. Veness, M. G. Bellemare, A. Graves, M. Ried-

miller, A. K. Fidjeland, G. Ostrovski, S. Petersen,

C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Ku-

maran, D. Wierstra, S. Legg, and D. Hassabis.

Human-level control through deep reinforcement

learning. Nature, 518(7540):529–533, 02 2015.

dx.doi.org/10.1038/nature14236.

[55] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan.

SparkNet: Training deep networks in Spark. In Pro-

ceedings of ICLR, 2016. arxiv.org/abs/1511.06051.

[56] D. G. Murray, F. McSherry, M. Isard, R. Isaacs,

P. Barham, and M. Abadi. Incremental, it-

erative data processing with timely dataflow.

Commun. ACM, 59(10):75–83, Sept. 2016.

dl.acm.org/citation.cfm?id=2983551.

[57] A. Nair, P. Srinivasan, S. Blackwell, C. Alci-

cek, R. Fearon, A. De Maria, V. Panneershel-

vam, M. Suleyman, C. Beattie, S. Petersen, et al.

Massively parallel methods for deep reinforce-

ment learning. arXiv preprint, 1507.04296, 2015.

arxiv.org/abs/1507.04296.

[58] Nervana Systems. Neon deep learning framework,

2016. github.com/NervanaSystems/neon.

[59] NVIDIA Corporation. NCCL: Optimized primitives

for collective multi-GPU communication, 2016.

github.com/NVIDIA/nccl.

[60] R. Pascanu, T. Mikolov, and Y. Bengio. On

the difficulty of training recurrent neural networks.

In Proceedings of ICML, pages 1310–1318, 2013.

jmlr.org/proceedings/papers/v28/pascanu13.pdf.

[61] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild:

A lock-free approach to parallelizing stochas-

tic gradient descent. In Proceedings of NIPS,

pages 693–701, 2011. papers.nips.cc/paper/4390-

hogwild-a-lock-free-approach-to-parallelizing-

stochastic-gradient-descent.pdf.

[62] C. J. Rossbach, Y. Yu, J. Currey, J.-P. Mar-

tin, and D. Fetterly. Dandelion: a com-

piler and runtime for heterogeneous systems.

In Proceedings of SOSP, pages 49–68, 2013.

sigops.org/sosp/sosp13/papers/p49-rossbach.pdf.

[63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.

Learning representations by back-propagating er-

rors. In Cognitive modeling, volume 5, pages

213–220. MIT Press, 1988. www.cs.toronto.edu/

˜hinton/absps/naturebp.pdf.

[64] O. Russakovsky, J. Deng, H. Su, J. Krause,

S. Satheesh, S. Ma, Z. Huang, A. Karpathy,

A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-

Fei. ImageNet Large Scale Visual Recognition Chal-

lenge. International Journal of Computer Vision,

115(3):211–252, 2015. arxiv.org/abs/1409.0575.

[65] A. Smola and S. Narayanamurthy. An ar-

chitecture for parallel topic models. Proc.

VLDB Endow., 3(1–2):703–710, Sept. 2010.

vldb.org/pvldb/vldb2010/papers/R63.pdf.

[66] I. Sutskever, J. Martens, G. E. Dahl, and

G. E. Hinton. On the importance of initial-

ization and momentum in deep learning. In

Proceedings of ICML, pages 1139–1147, 2013.

jmlr.org/proceedings/papers/v28/sutskever13.pdf.

[67] I. Sutskever, O. Vinyals, and Q. V. Le. Se-

quence to sequence learning with neural net-

works. In Proceedings of NIPS, pages 3104–

3112, 2014. papers.nips.cc/paper/5346-sequence-to-

sequence-learning-with-neural.pdf.

[68] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,

D. Anguelov, D. Erhan, V. Vanhoucke, and

A. Rabinovich. Going deeper with convolu-

tions. In Proceedings of CVPR, pages 1–9, 2015.

arxiv.org/abs/1409.4842.

[69] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and

Z. Wojna. Rethinking the Inception architecture for

computer vision. arXiv preprint, 1512.00567, 2015.

arxiv.org/abs/1512.00567.

282 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
http://arxiv.org/abs/1301.3781
http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
http://dx.doi.org/10.1038/nature14236
http://arxiv.org/abs/1511.06051
http://dl.acm.org/citation.cfm?id=2983551
http://arxiv.org/abs/1507.04296
http://github.com/NervanaSystems/neon
https://github.com/NVIDIA/nccl
http://jmlr.org/proceedings/papers/v28/pascanu13.pdf
http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf
http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf
http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf
http://sigops.org/sosp/sosp13/papers/p49-rossbach.pdf
http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf
http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf
http://arxiv.org/abs/1409.0575
http://vldb.org/pvldb/vldb2010/papers/R63.pdf
http://jmlr.org/proceedings/papers/v28/sutskever13.pdf
http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural.pdf
http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural.pdf
http://arxiv.org/abs/1409.4842
http://arxiv.org/abs/1512.00567

[70] C. tao Chu, S. K. Kim, Y. an Lin, Y. Yu,

G. Bradski, K. Olukotun, and A. Y. Ng.

Map-reduce for machine learning on multi-

core. In Proceedings of NIPS, pages 281–288,

2007. papers.nips.cc/paper/3150-map-reduce-for-

machine-learning-on-multicore.pdf.

[71] A. Verma, L. Pedrosa, M. Korupolu, D. Op-

penheimer, E. Tune, and J. Wilkes. Large-

scale cluster management at Google with

Borg. In Proceedings of EuroSys, 2015. re-

search.google.com/pubs/archive/43438.pdf.

[72] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever,

and G. Hinton. Grammar as a foreign language.

arXiv preprint, 2014. arxiv.org/abs/1412.7449.

[73] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,

W. Macherey, M. Krikun, Y. Cao, Q. Gao,

K. Macherey, J. Klingner, A. Shah, M. Johnson,

X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo,

H. Kazawa, K. Stevens, G. Kurian, N. Patil,

W. Wang, C. Young, J. Smith, J. Riesa, A. Rud-

nick, O. Vinyals, G. Corrado, M. Hughes, and

J. Dean. Google’s Neural Machine Translation sys-

tem: Bridging the gap between human and ma-

chine translation. arXiv preprint, 1609.08144, 2016.

arxiv.org/abs/1609.08144.

[74] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlings-

son, P. K. Gunda, and J. Currey. DryadLINQ:

A system for general-purpose distributed data-

parallel computing using a high-level language.

In Proceedings of OSDI, pages 1–14, 2008.

www.usenix.org/legacy/event/osdi08/tech/full pa-

pers/yu y/yu y.pdf.

[75] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,

M. McCauley, M. J. Franklin, S. Shenker, and I. Sto-

ica. Resilient distributed datasets: A fault-tolerant

abstraction for in-memory cluster computing.

In Proceedings of NSDI, pages 15–28, 2012.

https://www.usenix.org/system/files/conference/

nsdi12/nsdi12-final138.pdf.

[76] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao,

K. Yang, Q. Le, P. Nguyen, A. Senior, V. Van-

houcke, J. Dean, and G. E. Hinton. On recti-

fied linear units for speech processing. In Pro-

ceedings of ICASSP, pages 3517–3521, 2013. re-

search.google.com/pubs/archive/40811.pdf.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 283

http://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf
http://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf
http://research.google.com/pubs/archive/43438.pdf
http://research.google.com/pubs/archive/43438.pdf
http://arxiv.org/abs/1412.7449
http://arxiv.org/abs/1609.08144
https://www.usenix.org/legacy/event/osdi08/tech/full_papers/yu_y/yu_y.pdf
https://www.usenix.org/legacy/event/osdi08/tech/full_papers/yu_y/yu_y.pdf
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
http://research.google.com/pubs/archive/40811.pdf
http://research.google.com/pubs/archive/40811.pdf