TensorFlow: How to optimise your input pipeline with queues and multi-threading

Morgan
metaflow-ai
Published in
6 min readMar 2, 2017

--

TensorFlow 1.0 is out and along with this update, some nice recommendations appeared on the TF website. One that caught my attention particularly is about the feed_dict system when you make a call to sess.run():

One common cause of poor performance is underutilizing GPUs, or essentially “starving” them of data by not setting up an efficient pipeline. (…) Unless for a special circumstance or for example code, do not feed data into the session from Python variables…

And so far, of course, I’ve been exclusively doing that using the feed_dict system to train my models… Let’s change this bad habit!

There is already an extensive documentation on TF queues and some very nice visualisations on the TF website (I encourage you to go watch them). To avoid being redundant, we will focus on real and basic use case with complete code.

We will explore queues, QueueRunner and coordinators to improve our training speed on a very basic example of 33% thanks to multi-threading and optimised memory handling. We will also watch closely our performance on a single GPU (Nvidia GTX Titan X).

Let’s start from a simple neural net using the feed_dict system to train on a naive task. Then we will evolve our codes so we leverage the benefits of queues and remove this dependency.

Check the comments to have some hints, nothing new here, this is a starting point:

I will run this script on my GPU and we’ll gather some statistics:

Training phase monitoring of the first example with logs and nvidia-smi

Multiple remarks:

  • The “filesystem simulation” isn’t credible but we’ll keep this behaviour consistent overall our tests so we can ignore its impact.
  • We feed data to our model using the feed_dict system. This force TF to create a copy of Python data into the session.
  • We are only using ~31% of our GPU through the whole training
  • It takes ~18 seconds to train this NN

One could think that this is all we can get on such a simple task but one would be mistaken, think about it:

  • Everything is synchronous and single-threaded in this script (you have to wait for each python call to finish before going on the next python call)
  • We keep moving back and forth, between Python and the underlying C++ wrapper.

How avoiding all those pitfalls?

The solution is in the queue system of TF. You can think about it as designing your input pipeline beforehand right into the graph and stop doing everything in Python! In fact, we will try to remove any Python dependency we have from the input pipeline.

This will also give use nice properties of multi-threading, asynchronicity and memory optimisation due to the removal of the feed_dict system (which is very cool because if you plan to train your model later on a distributed infrastructure, TF will shine right out of the box).

But first, let’s explore queues in TF with simple examples. Again, read the comment to follow my thoughts:

Hanging example of a flawed queue system

What happened here? Why are we hanging in the void like that?

Well, this is how TF has been implemented, dequeue operations cause the whole graph to wait for more data if the queue gets empty. this behaviour happens only if you use the queue manually like that, but this is clearly cumbersome and totally useless as we are still in only one thread calling enqueue and dequeue operations.

Note: To be asynchronous, they have to be in their own thread, not the main one. As my French grandma used to say: if many cooks have to share the same knife to make a meal, they won’t be faster than only one cook…

To solve this, let me introduce the QueueRunner and the coordinator which their only purposes are to handle queues in their own threads and ensuring synchronisation (starting, queueing, dequeuing, stopping, etc.).

The QueueRunner needs 2 things:

  • A queue
  • Some enqueue operations (you can have multiple enqueue operations for one queue)

The coordinator needs nothing: it is a handy high-level API to handle queues under the “tf.train” namespace. If you create, as we did, a custom queue and you add a QueueRunner to handle it. As long as you don’t forget to add the QueueRunner to the QUEUE_RUNNERS collections of TF, you can use the high-level API safely.

Let’s take the precedent example and make the changes needed to handle queues in their own thread:

Little thought exercise:

Before you look at the resulting log, can you figure out how many times tf.random_normal has been called?

Spoiler, here is a dump of the result:

Logs of the queue exercise

As you can see, x_input_data has been called only 3 times. Each time we try to push more elements than a queue capacity, the extra elements are not trashed as one would expect, they are waiting their turn too.

So we need to fill a void in the queue only around the 4th and 10th call, where only 2 elements remain in the queue (we are asynchronous now so the order of print statements can be a little messed up).

Note: I won’t dig further into queues and their ecosystem, it’s pretty wide and cool and you should definitely get more familiar with it, you can check the list of references I add at the end of this article.

Thanks to all this new knowledge, we can finally update our first script to handle input data with our new queue system and see if we get any improvement on it!

Training phase monitoring of the second example with logs and nvidia-smi

Final thoughts:

  • Outside of the queue system, we used the exact same code as before
  • y_true is computed right inside the graph, you can compare that when people have to slice their input data to select inputs and labels
  • No need for any feed_dict anymore and no more waste of memory
  • We are now using ~43% of our GPU which is better than only 31%. Which means that our process was at least wasting ~11% of GPU resource due to the lag in out input pipeline. In such a case, it means that can grow your batch_size. But be careful, batch_size impacts how you converge too.
  • It took ~11.5 seconds to train which is a win of ~33% in terms of training duration, that’s cool!

TensorFlow best practice series

This article is part of a more complete series of articles about TensorFlow. I’ve not yet defined all the different subjects of this series, so if you want to see any area of TensorFlow explored, add a comment! So far I wanted to explore those subjects (this list is subject to change and is in no particular order):

Note: TF is evolving fast right now, those articles are currently written for the 1.0.0 version.

--

--

ML engineer & Tech lead. (Former co-founder and CTO @Explee, lenia_nft). ML and crypto enthusiast.