When you have a chore that takes a long time to complete then it's probably going to be faster if you can divide the amount of work.
For example, if you paint a room on your own, this job might take a couple of hours. But with a couple of friends, the same amount of work can be completed in an hour or less, depending on the number of friends that are helping and how hard everyone is working.
The same applies to programming, where these friends are called threads. When you want to work through a collection faster, a common solution is to divide the work among threads that run concurrently.
In the early days of .NET, spawning new threads was manual work and required some knowledge. If you take a look at the thread docs you can see that it takes some "orchestration code" to manage these threads.
Because we write the code on your own, there's also a probability that this code contains bugs. It even gets more complex when you spawn multiple threads to achieve the best performance.
Luckily, C# hides all of the implementation details for us with the Task Parallel Library (TPL) introduced in .NET Framework 4. It's also safe to say that the chance of bugs within this code is far less in comparison with a custom implementation.
Data parallelism refers to scenarios in which the same operation is performed concurrently (that is, in parallel) on elements in a source collection or array. In data parallel operations, the source collection is partitioned so that multiple threads can operate on different segments concurrently.
In this article, we'll take a look at the different ways to process a collection faster.
For our case, we had 60.000 items that had to be migrated from one system to another system. It took 30 minutes to process 1.000 items, which makes it 30 hours in total to process all of the items in the collection.
That's a lot of time, especially if you know that it isn't a difficult task to migrate an item. The simplified version of the initial code looked like this. A simple iteration over a list, and within the loop, the migration of an item where we:
- retrieve the details of the item
- migrate the item
- save the item into system B
When we take a closer look at where we lose time, it was clear that a lot of time was spent waiting.
Waiting while the item is retrieved and waiting on the item to be saved.
The migration in
MigrateToSystemB is fast and only took a couple of milliseconds.
Normally, I would suggest limiting the amount of I/O to make this process faster. This can be done when we retrieve and save the items in batch, instead of one-by-one. For this use case, doing that wasn't an option.
The only way to make this migration faster is to add parallelism to the code. Instead of migrating item by item in a sequential order, where the following item is migrated after the previous item is migrated, we want to migrate multiple items at the same time. Each migration will have its own thread and that makes it more efficient.
The easiest way to add parallelism to the loop is to use
Parallel.ForEach. Internally, the
Parallel.ForEach method divides the work into multiple tasks, one for each item in the collection.
The Parallel class provides library-based data parallel replacements for common operations such as for loops, for each loops, and execution of a set of statements.
To my surprise, the refactored code doesn't look much different from the initial implementation.
With a small change to wrap the iteration within the
Doing this results that we now process the same list concurrently.
Parallel.ForEach tries to use all of the available threads of the machine.
To lower the impact on the system we can use the
MaxDegreeOfParallelism option. This property limits the number of spawned concurrent tasks so we don't impact the other running processes of the application.
MaxDegreeOfParallelism option can be set to a static value, or we can use the
Environment.ProcessorCount property to make it dependant on the machine's resources. In the snippet below, we configure the
MaxDegreeOfParallelism option to use a maximum of 75% resources of the machine.
In our case, this "refactor" resulted that it now only takes 40 seconds to process 1.000 items. For the whole collection of 60.000 items, it takes 40 minutes. From 30 hours to 40 minutes with just a few lines of extra code! Because we're using the number of processors of the machine, it takes 20% longer on my machine compared to the server.
But it doesn't stop here.
Parallel solution works fine for my use case, .NET also has Parallel LINQ (PLINQ).
PLINQ brings parallelism to the well-known LINQ API.
This ensures that the code remains readable while writing more complex business logic, where you need to order, filter, or transform the data.
If you're already familiar with LINQ, I got good news for you because you'll immediately feel right at home.
A PLINQ query in many ways resembles a non-parallel LINQ to Objects query. PLINQ queries, just like sequential LINQ queries, operate on any in-memory IEnumerable or IEnumerable<T> data source, and have deferred execution, which means they do not begin executing until the query is enumerated. The primary difference is that PLINQ attempts to make full use of all the processors on the system. It does this by partitioning the data source into segments, and then executing the query on each segment on separate worker threads in parallel on multiple processors. In many cases, parallel execution means that the query runs significantly faster.
To process the collection in a parallel manner, we can use the
AsParallel() extension method followed by any of the existing LINQ extension methods. The
Parallel example rewritten with the
PLINQ syntax looks like this, where we use the
ForAll extension method to iterate over the items.
This can also be rewritten by using more of the LINQ syntax. For this example, doing this doesn't add much value but it's just to give you the idea. Each item is handled on a separate thread, sequent tasks for the same item are handled on the same thread.
Note that we can set the degree of parallelism with the
WithDegreeOfParallelism extension method.
The performance benefits between the
Parallel solution and this
PLINQ solution were the same.
While performance isn't a factor (in most cases) to choose between these two solutions, there are subtle differences between the
PLINQ and the
Parallel methods. Both solutions have a right to exist and provide a solution to different problems.
The distinct advantages of both are well-explained in "When To Use Parallel.ForEach and When to Use PLINQ".
The main differences that I will remember are:
Parallelyou set the maximum degree, which means that it's impacted based on the available resources
PLINQyou set the degree, meaning that that's the actual number of threads that are used
- the order in which the tasks are invoked within a
Paralleliteration is random. In other words, use
Parallelto execute independent tasks
- if the order is important, use
PLINQbecause the order is preserved
Paralleldoesn't return a result. The output of
ParallelLoopResult, which contains the completion information of the collection (for example if all tasks are completed) but nothing more.
- when you need a return value of the processed stream use
PLINQ. Because the tasks do run concurrently, we need a way to merge the results of all the tasks to one result object. To specify how the result of each task must be merged back to the output result, use the merge options.
Parallelprovides a way to exit early with
ParallelLoopState.Break(). Both prevent more iterations from starting but have the difference that
Stop, stops the loop immediately while
Breakstill runs previous iterations.
- to stop a
PLINQiteration, a CancellationToken is used but this doesn't guarantee that the following iterations are not started.
PLINQ methods, there's a third library called Dataflow (Task Parallel Library). By solving my performance issue, this was the first time I encountered
Dataflow could be an article on its own and my knowledge of it is very minimal. Based on what I've read these past days, I see
Dataflow as a library to build (the blocks) of a processing pipeline.
The Task Parallel Library (TPL) provides dataflow components to help increase the robustness of concurrency-enabled applications. These dataflow components are collectively referred to as the TPL Dataflow Library. This dataflow model promotes actor-based programming by providing in-process message passing for coarse-grained dataflow and pipelining tasks. The dataflow components build on the types and scheduling infrastructure of the TPL and integrate with the C#, Visual Basic, and F# language support for asynchronous programming.
To be honest, I over-simplified the initial code snippet, the real use case is using the
async/await syntax to get and save an item.
Sadly, this syntax doesn't play nice with the
Parallel nor the
PLINQ API, but it can be with
After doing some research I stumbled upon the post "Parallel Foreach async in C#", in which the author iterates over an implementation of an async variant of the
The last implementation in the post uses the latest features of C#, containing the
Dataflow library to obtain the best result.
Because I like the implementation, and it still remains readable I shamelessly copy-pasted the method into our codebase.
Dataflow variant, I was able to cut another 3 minutes off of the time to process the whole collection.
Update: .NET 6 introduces a new Parallel.ForEachAsync method, meaning that you don't need to write your own implementation of the async variant.
The use case that inspired this post was one of the few times where I actually needed to use parallel programming. It's also the first time that it has such a big impact.
I like how easy .NET makes it to rewrite code that runs sequentially into code that runs in parallel. Because of it, we can focus on delivering business value without making the code difficult to write and read.
The learning curve isn't steep and it grows with the complexity of the use case:
- use the
Parallel.ForEachmethod for the simplest use case, where you just need to perform an action for each item in the collection
- use the
PLINQmethods when you need to do more, e.g. query the collection or to stream the data
- use the
DataFlowmethods for when you want complete control over the processing pipeline
Does this mean that I'll sprinkle parallel programming all over in the codebase? No, that's not what I'm recommending because it might even have a negative result, as it comes with its own potential pitfalls.
I wrote this post to get more familiar with parallel programming and to use it as a reference point in the future. I only scratched the surface, if you're also interested (like me) in learning more about this topic I can highly recommend these resources, which I used to gain insights during the process of writing this post.
- Parallelism vs. Concurrency
- Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4
- ParallelOptions.MaxDegreeOfParallelism vs PLINQ’s WithDegreeOfParallelism
- Potential pitfalls with PLINQ
- Exiting from Parallel Loops Early
- Practical Parallelization in C# with MapReduce, ProducerConsumer and ActorModel
- Processing Pipelines Series - Concepts
- Data Processing Pipelines with TPL Dataflow in C# .NET Core
- 15 Years of Concurrency
I appreciate it if you would support me if have you enjoyed this post and found it useful, thank you in advance.