High speed applications – parallelism in .NET part 2

Welcome back!

This is the second part in my series of posts about parallelism in .NET. The first post in the series can be found here.

Writing the last post I forgot to mention that the System.Threading.Tasks namespace is part of the Task Parallel Library (TPL) which was introduced in .NET 4.0. Using the TPL is the preferred way to design and develop new parallel applications.

I also got feedback that I should mention that as a rule of thumb, using the System.Threading.Thread class directly like we did in the last post, should in most cases be avoided. Some exceptions to this might be:

  • We need to manipulate thread priority.
  • Having more than one UI (or STA or foreground) thread, handling different user interfaces.
  • Compatibility with legacy or unmanaged code.
  • Writing your own thread pool or a custom task scheduler for TPL.

The .NET framework thread pool (the managed thread pool)

Introduction

Today we’ll cover the .NET framework thread pool and how to use it through the System.Threading.ThreadPool static class. It has been around since .NET 1.1. Every .NET application has one .NET thread pool and it is started when the application is started. It is actually two pools in one. One for I/O (=input/output) and one for work items. I will not cover using the I/O part of the thread pool directly that much for several reasons: 1. I don’t know that much about it. 2. You will likely not have any direct use for it. What you do need to know is that it will keep track of I/O and can schedule delegates to be executed when an I/O event has occurred. An I/O event can be “connecting to a web service”, “opening a file” or “the requested data has returned from a database”.

The main reason for knowing how the thread pool works is that the TPL default task scheduler uses it for scheduling all tasks except long running tasks. Since you are reading this post, you are interested in designing and developing applications that execute in parallel. Knowing how the thread pool executes work items (=what a task is called when using the thread pool) is interesting, but more importantly, it can make you design your applications differently. When you see the benefits of the thread pool, you will be delighted to see how much easier the TPL makes this for us. We will focus on how the thread pool works, rather than how to use it directly.

The framework will by itself decide the number of threads in the pool and can increase or decrease the amount of threads to get optimal throughput, but we can change the upper and lower limits.

Using a thread pool

In part 1 we had it figured out, we are going to use the the thread pool.

2 tasks in the thread pool

Let’s say that our system has 4 cores (C1-C4), and the thread pool has 4 threads, one for each processor core (it will most likely have more than 4 threads, but let’s use it for this example). 4 threads in the pool means that we can run 4 tasks in parallel. A thread belongs to the core where it was created and it will always execute on that core.

If task 1 takes 20 seconds to execute and task 2 30 seconds, we are monopolizing 2 cores for 20 seconds and one for another 10. If the total of 50 seconds could be split up equally on all 4 cores, it would take just 12,5 seconds for both tasks to finish. If we split them up in chunks small enough, we can also schedule more tasks to run simultaneously, without the overhead of (excessive) context switching. Let’s have a look:

3 task on the thread pool

Now that’s more like it! The T1 blocks are task 1 split into multiple work items, T2 task 2 and T3 is a 3rd task executing on the thread pool threads. If the OS had only one process, and all applications running in it where well behaved like the one above – we would have virtually no context switching.

When we schedule a work item to run we Queue it, through ThreadPool.QueueUserWorkItem. Example:

If the thread pool works like I’ve explained, how come the results look this weird?

Program result

The results can somewhat be explained by:

  • The thread pool has not created threads for all cores. If we put a breakpoint on the line starting with var queue = new ConcurrentQueue<i…” and check the number of threads in the process, we see a value like 12 (in my case). After running the work items, before terminating the process, it’s 22. The threads available when starting will take most of the work items before all threads for the pool are created.
  • ConcurrentQueue<T> uses a SpinLock / SpinWait which do not guarantee FIFO (first-in first-out). A thread waiting to get an item from a ConcurrentQueue<T> may miss time after another if the load is high. Though, Microsoft has determined that in most cases, this is the most efficient concurrent queuing algorithm. More about locking, data consistency and concurrency in a later post in the series.
  • Also, the Console is not optimized for concurrency. If we were to get the thread number and value and post them to a queue instead, the result would be very different, but still not predictable.

Please remember that we can not accurately predict the order of how these work items execute in parallel, even more so if we have a lot of different tasks executing on the thread pool. If we need the result of the items in the queue ordered, we must provide a way to restore the original order when all work items are done.

It’s also a good idea to avoid queuing all work items at once. It’s often better for implementation, performance and parallelism to initially queue the number of concurrent work items and have the work items queue work items of their own. Think about the queue to a hot dog stand. If all customers where standing in line when you arrive, you would have to wait for quite a while before getting your hot dog. As a side note here, the thread pool has some implementation to help reduce this, but it is still a good idea not to schedule too many work items at once.

I/O and the thread pool

Today’s applications are generally not only CPU (Central Processing Unit – another name for processor) intensive. They need to retrieve data, process it and send it away. Retrieving and sending data can be through files, web services/web api’s, databases etc. This is I/O. I/O operations are most often the operations that are most time consuming, although not necessarily the most CPU intensive.

Have a look at this table of the main “tasks” (not TPL tasks) of a Web API method:

WebApi request table

CPU “tasks” are using CPU / memory to complete, while I/O “tasks” are primarily waiting for I/O to complete.

WebApi pie chart

We are spending 67% of the total request time on I/O “tasks”, which more or less is just waiting for an external factor to complete. Now, for a busy web server, we do not want to waste the resources of our work item thread pool waiting. This is why it’s great that the thread pool handles I/O on a separate pool. An I/O thread pool thread can service lots of I/O events.

Implementing I/O with the thread pool is post or two of it’s own, and we will not go into more detail about it. Fortunately for us, Microsoft’s developers have implemented most I/O for us using the thread pool already. For example, FileStream have two different implementations of asynchronous file operations both using the thread pool. The database classes SqlConnection / SqlCommand / SqlDataReader do as well. An asynchronous operations is an operation that is started, and then notifies us / invokes a method when it finishes. The first implementation is the older obsolete APM (Asynchronous Programming Model) implementation and the second is the TPL (Task Parallel Library) implementation. We should always use TPL when we can. These provide us with what we need to work efficiently with the thread pool. When the code in the example above connects to the database server/service, we can simply stop doing anything and let the thread pool handle other requests, and then resume execution when the connection attempt has finished – without having to write and even better execute any code for waiting.

Serialization/deserialization is the process of converting to a logical object structure for example a model class or a collection, into a format that can be sent to another process, in another service, on another computer and back. An example is when we are returning data in JSON format from our WebAPI, we are serializing it from our object structure to JSON. The client then deserialize it to be able to use it. As a matter of fact, if you profile (tracing what a process does and how much time each step takes) a WebAPI’s you can see a significant amount of time goes to serialization / deserialization.

How much work do we put in a work item?

The boring answer to this is “it dependes”, but let’s work out some guidelines for work items. Half of the work we do when working with a thread pool is to cooperate with other parts of our application. Each thead pool work items should execute rather quickly. Other work items may have to wait too long before starting. Throughput can be dramatically improved if work items that do very little before initiating an I/O request can be handled quickly. Work items taking too long can cause the thread pool to instantiate a new thread to keep up the throughput and as we know from part 1, we do not want the system to instantiate new threads unless it’s absolutely necessary.

  • Every time we need to wait for I/O or ideally wait for anything before continuing, we should exit the current work item and have a new one start when the I/O request has finished.
  • If possible, split CPU/memory intensive work items that exceed 5 milliseconds. Perhaps think about how to have them execute in parallel :).
  • If possible, avoid creating too many CPU/memory intensive work items. There is a little overhead for every work item.

Profile and measure your applications to find out the work item duration that allows for the maximum throughput. Visual Studio 2015 has a built in profiler, that can be found under Debug->Profiler. Perhaps I can cover application measuring, profiling and optimization when this series is finished.

A few facts about the .NET thread pool
  • The thread pool is actually two thread pools, one for work items and one for I/O.
  • The threads are background threads, which means that they will not keep the application running like foreground threads do.
  • System.Threading.Timer and System.Timers.Timer both use the thread pool, but not the Windows timers.
  • The framework decides the number of threads in the pool and will increase or decrease the amount of threads to get optimal throughput.
  • We can set the minimum and maximum thread count in the thread pool with GetMinThreads/SetMinThreads and GetMaxThreads/SetMaxThreads. Changing the minimum thread count can be useful if we know that the thread pool always grows for a while at startup when starting before reaching the amount needed for optimal throughput. Remember – This affects all users of the thread pool including the TPL (default task scheduler).

Global & local work item queues and work stealing

Originally, the thread pool had only a global queue for all work items. When the next thread pool thread where available, it would grab the next item from the global queue.
Since .NET 4.0, the thread pool has a global queue for all items queued from outside the thread pool AND a local queue for each thread pool thread where the work items queued from that thread pool thread are placed. A local queue can be accessed more quickly since we do not need any locking mechanisms (more about locking later in the series). Data can also be cached by the processor more easily if work items queued are executed on the same thread (=the same core).

When a thread pool thread is ready to execute another work item, it first looks at it’s local queue and if it’s empty, the global queue and if it’s also empty, it tries to “steal” work items from other threads local queues.

What’s next?

In the next post, we will start looking at Task Parallel Library (TPL). I will try to have the posts ready a little quicker. Maybe early / mid next week.

Cheers
Erik

4 thoughts on “High speed applications – parallelism in .NET part 2

  1. Pingback: High speed applications – parallelism in .NET part 1 – Erik Bergman's .NET blog

  2. Pingback: The week in .NET – 3/22/2016 | .NET Blog

  3. Pingback: High speed applications – Parallelism in .NET part 3, TPL – Erik Bergman's .NET blog

  4. Pingback: High speed applications – Parallelism in .NET part 4 – TPL, exceptions & cancellation – Erik Bergman's .NET blog

Leave a Reply

Your email address will not be published. Required fields are marked *