tl;dr: I got ~40% faster CPU-only training on a small CNN by building TensorFlow from source to use SSE/AVX/FMA instructions. Look at some example build flags. Then do it.

MNIST is the “Hello World” of machine learning, and TensorFlow’s MNIST For Beginners is a pretty user-friendly way to get started. As user-friendly as ML can get, anyway.

Installing TensorFlow from the recommended pip .whl is quick and painless. But upon testing my installation, I got these “warnings”:

>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
>>> print(sess.run(hello))
b'Hello, TensorFlow!'



But wait! “Could speed up CPU computations”? That sounds pretty good. However, the only way to take advantage of these speed improvements is to build TensorFlow from source.

Funny though - TensorFlow actually warns you about doing just that. Venture into this uncharted territory at your own risk, kiddo - here be dragons:

I know people who will pay a few extra hundred dollars for a slightly-higher wattage microwave, just so they can save a few seconds every time they heat up lunch. If you’re one of these people, the feeling that you might be waiting around longer than absolutely necessary for training on a non-optimally configured install of TensorFlow might not sit well. But exactly how much of a difference will it make? Is it worth the dragons? What do these instructions even mean? In the name of science, let’s find out.

# Measuring training time using MNIST

I used the mnist_softmax.py given in the tutorial as my first test. Recording the training time is pretty simple - just wrap some calls to time.time() around the training loop and print the elapsed time at the end, alongside the accuracy:

    # run training step many times!
start = time.time()
for _ in range(2000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
end = time.time()

# ...

    # ...

correct_prediction = tf.equal(tf.argmax(y, 1), y_)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print(sess.run(accuracy, feed_dict={x: mnist.test.images,
y_: mnist.test.labels}))
print("Training time: {} sec".format(end - start))


“Wait a second,” you might protest. “That’s not even a real neural net! That’s just a multinomial logistic regression. That’s not computationally intensive enough to be interesting.” And a fair point you’d be making. Let’s also test it out on a small but real Convolutional neural network (CNN) for MNIST.

I made my own mnist_cnn.py as an exercise, according to the instructions here. But it’s basically identical to TensorFlow’s example file at mnist_deep.py.

# What is SSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA? And why would they make Tensorflow faster?

SSE stands for Streaming SIMD Extensions, and SIMD stands for Single Instruction Multiple Data. In short, it sounds like these instructions allow your CPU to behave a little more like a GPU, doing the same operation on multiple data objects, instead of processing them one at a time. Handy for processing 3D graphics or - you guessed it - training layers of a neural net.

The AVX (Advanced Vector Extensions) and FMA (Fused Multiply Add) instructions are also extensions to SIMD operations.

Tech-wise, the Wikipedia and the Intel documentation for SSE goes far over my head. If anyone can ELI5, please do!

# So how do I build Tensorflow with support for SSE/AVX/FMA?

First, read all of TensorFlow’s instructions on building from source for your OS.

Depending on your specific configuration, you may want to consult this thread and pick your own particular set of flags for the build process. Here’s what I used when I got to the bazel build portion of the instructions:

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both -k //tensorflow/tools/pip_package:build_pip_package


I did not use --config=cuda for this particular build as I am doing a CPU-only build. A fun next step would be to try to optimize for the fastest CUDA-enabled Tensorflow build using an NVIDIA GPU!

The build process took much longer than expected - 6718.2s or almost 2 hours on my laptop. Plan accordingly and make sure you’ll have access to power.

This will generate a build_pip_package script. Run this to build a .whl file, which you can output to /tmp/tensorflow_pkg or another folder of your choice.

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg


Finally, cd into the repo where you built your .whl, and run pip install [filename of .whl]. Everything else should look exactly the same.

# The result? 43% speed improvement on a small CNN

Here are the results on the CNN. Totally worth the effort!

File: mnist_cnn.py

Training loops Test accuracy Time (sec) -
source build
Time (sec) -
default .whl
% speed improvement
300 92.33% 32.81 56.90 42.33%
400 94.10% 42.62 75.61 43.62%
500 94.37% 53.57 94.40 43.24%
1000 96.29% 106.24 193.57 45.11%

As might have been expected, mnist_softmax.py runtime didn’t show any meaningful change - maybe it doesn’t have enough calculations:

File: mnist_softmax.py

Training loops Test accuracy Time (sec) -
source build
Time (sec) -
default .whl
% speed improvement
500 91.35% 0.62 0.65 3.55%
1000 92.15% 1.41 1.39 -1.24%
2000 91.98% 2.55 2.90 11.86%

I have not run the recommended 10,000 training steps to achieve 99.2% accuracy on the MNIST CNN (because ain’t nobody got time for that on CPU-only TensorFlow). Given this data, however, it’s good to know that I should be able to do so in about 17 minutes using the source-built TensorFlow, vs 30 min using the default install.

# Specs

Here are the specs used to run the tests. YMMV:

• Ubuntu 16.04
• conda 4.5.4
• Python 3.6.5
• Intel(R) Core(TM) i5-5200U
• 8GB DDR3

.whl used in regular install: