Trivial Solution To CPU-Bound Thread Slowdown On Multicore Systems

According to David Beasley, the CPython GIL doesn't just prevent CPU-bound applications from taking advantage of multiple cores. He says it also slows them down on multicore systems.

As an example, he says, following CPU-bound function runs is slower on multicore systems when run in two threads vs when run twice sequentially on the same thread:
def counts(n=10000000):
while n > 0:
n -= 1

I tested the claim by running this little program (let's call it
from time import time
from threading import Thread

def counts(n=10000000):
while n > 0:
n -= 1

def measure(f, comment):
print time()-t, "seconds"

def sequ():

def para():
t1 = Thread(target=counts,args=())
t2 = Thread(target=counts,args=())
t1.join(); t2.join()

if __name__ == "__main__":
measure(sequ, "Sequential Execution")
measure(para, "Sequential Execution")

Here's the result from running the script on a Core Duo 2 laptop:
Sequential Execution: 4.81399989128 seconds
Parallel Execution: 11.2730000019 seconds

He's sort-of right in this case. So why has the GIL has survived? Partly because the problem is easily solved.

One way to do that is to use the sys.checkinterval function which allows CPU-bound threads to do a little more work before giving up control of the GIL. Since the overhead of GIL passing in the above is more than 100%, we need to increase the check interval by a factor of hundred or so. (The default is 100.)

Just add the following to after "if __name__ == '__main__'":
import sys

Let's run it again and see what happens:
Sequential Execution: 4.63999986649 seconds
Parallel Execution: 4.63199996948 seconds

Problem solved!

Avoiding Pre-emption Altogether

Personally, I like to include "sys.setcheckinterval(sys.maxint)" in all my scripts, so I don't have to lock shared data structures when manipulating them in Python code. If you do that, your threads will never be pre-empted as long as you don't call any blocking functions that release the GIL, and you can avoid the overhead of fine-grained locking, potential deadlocks, etc. What do you think?

10 comments :: Trivial Solution To CPU-Bound Thread Slowdown On Multicore Systems

  1. Forgive me if I'm wrong, but I'll mirror my reddit comment (since I don't want this to be google cached and lead some folks astray):

    Setting the check interval so high just means that they're effectively running sequentially... it's just running Thread1 until it's done and then Thread2 until it's done and voila... sequential time.

    With parallel execution you should get more than 0.008 seconds faster (which I would call a fluke).

  2. The above comment is correct. Also, time.time() measures wall clock time, so your benchmark times will be incorrect. You really want to measure CPU time, using time.clock()

  3. @TheMoken: 10000 ticks is as high as you think!

    In high performance mode, my core 2 duo runs 20 million iterations of the loop above in 2 seconds. That's 10 million iterations per second.

    Each iteration involves at least 3 bytecode instructions: decrement, compare, and jump. So that's at least 30 million python bytecode instructions per second.

    If the bytecode interpreter switches threads every 100 instructions, that means it's switching threads 300,000 times per second. That's ridiculous; no wonder it's more than twice as slow!

    Switching threads every 10000 instructions means switching threads 3000 times per second. I don't see how that's not enough to preserve the illusion of concurrent execution.

    I understand that for most real-world applications, each bytecode instruction is going to do much more than decrementing an integer counter, but then the slowdown on multicore systems will be correspondingly lower for those applications, too.

    If my deductions above are correct, then I think we can boldly say the GIL is not worse than what we thought it was, as long as the checkinterval is set appropriately.

    I think the default of switching every 100 bytecode instructions is extremely unrealistic in a world where systems have multiple cores and run millions of bytecode instructions per second.

  4. Correction:
    10,000 ticks is *NOT* as high as you think.

  5. It doesn't matter what you set it to, you're just closing in on sequential time by running them almost sequentially. Python's thread problems don't just go away because you start running larger block of the threads at a time. All that does is prevent the totally pointless threading overhead from choking your program. Basically, the threads are functioning in lockstep rather than in parallel, in addition to having the overhead of getting the GIL.

    If the system was properly threading, it would be a no-brainer to match sequential time but on your dual-core core2 the runtime should've dropped significantly because the native OS threads would've been able to take advantage of the multi-processor. That is... if Python's threading weren't fundamentally broken.

  6. Yes, CPython's threading is 'broken', but it's not as broken as the presentation suggested. It doesn't run more slowly on multi-core systems if configured appropriately.

  7. Seun, setting interval to 10,000 (3,000 per sec) will have the effect of limiting I/O per second, no?

  8. Peace be with you brother! i see your good works every where and i need your help in creating a forum similar to the one you've got.
    Agile Developers

  9. I like to get up early to go out and breathe fresh air. I feel that it is good for health and a good habit

  10. You’ve written nice post, I am gonna bookmark this page, thanks for info. I actually appreciate your own position and I will be sure to come back here.

Post a Comment