18 February 2015

Reassigning (Update: and Pausing/Resuming!) Tasks

Recently I did something rude and hogged a compute server of ~96 cores. Unsurprisingly, I got an email from someone in the lab asking if I could do something about it. At first I assumed I’d have to kill the tasks that were running on the cores he needed, but then I realized Linux would probably just reassign other tasks into those newly-freed cores (a feature in most cases, but in this one instance a bit of a nuisance).

For whatever reason, I didn’t immediately remember that you could set affinity for a task. Great, right? Yes!

Kinda.

The problem is that if you’ve already started the threads, and in my case if you have 200+ tasks queued and running, reassigning core affinity manually is ridiculous. A simple bash function will help though:

The way this works is pretty simple: it creates a list of all the tasks you’re running and returns just the pid’s:

ps -u $(whoami) -o pid | tail -n +2

and in a for loop calls taskset and reassigns the core affinity on that n (which refers to each line in the for loop)

taskset -pc $1 $n;

taskset is kind of cool because it lets you reassign core affinity according to a pretty open-ended syntax.

What you get is a pretty neat allocation (in this case I used 15-96 as my range):

jerk move

Edit (March 4): Later I found that I would want to pause and continue my tasks occasionally (in this case, our disk I/O was severely limited for some unknown reason, but 80 separate processes hitting the disk weren’t helping). You can easily do that with the following run-on sentence one-liner:

ps -u $(whoami) | awk '{ print $1 }' | xargs kill -STOP

will pause all of your processes. If you’re thinking ahead, you might realize that your ssh session involves a process or two, so you might want to filter for just your python scripts, for example:

ps -u $(whoami) | grep python2.7 | awk '{ print $1 }' | xargs kill -STOP

This gets all of your processes, filters for just python2.7 processes (e.g. doesn’t grab the iPython instance you have running :), and throws the first cluster of non-whitespace text into kill -STOP.

I usually find that once I’ve done that, looking at htop [-u $(whoami)] makes it easy to monitor tasks as I reenable them by using

kill -CONT pid

Hope that helps people

If you have something to say about this post, email me or tweet at me.