Friday, 16 January 2015

Low nproc limit prevents sudo'ing to another user

A while ago I posted a solution to OutOfMemoryError: unable to create new native thread, however I encountered a problem caused by the nproc setting but wasn't fixable by a simple ulimit command. The problem we were encountering was that we couldn't sudo into service account when we had all our application servers running.

Our sysadmins have setup a user escalation script, which there's nothing wrong with. It does some prechecks, sudos a script under the requested user, does some logging, then does an exec <configured shell> . When there are too many processes, it manages to run the 2nd script as the user, however the exec command blocks, it never gets to the profile script to execute the ulimit command.
I traced it down to the default soft limit for the number of process for all users to 1024.
Our process count is way above that, which means when it tries to create a new process for bash when we run the sudo script, it can’t. So we are unable to sign into the account.

As the profile scripts never get run, this one can't be solved at the user level.
I started doing some searching around, in /etc/security/limits.conf we had:
*       -         nproc     31768
I found if I specifically added a user to that file then it would work, i.e.:
serviceaccount     -      nproc     4096

This serverfault/stackexchange disccussion highlights the problem. This redhat bug request from 2008, shows that they requested this file /etc/security/limits.d/90-nproc.conf be added with this setup:
*        soft      nproc     1024
root     soft      nproc     unlimited
Which shows where it comes from and why manually specifying it in /etc/security/limits.conf worked (the more specific rule won).
The soft limit is 1024 for all non-root users but their hard limit is 31768, which means it's initially limited to 1024 until a shell raises it's own limit. All other processes that were started without a raised limit are unable to create more processes, including our bash shell invoked during our sudo script.

So for power users' systems you'd need to change /etc/security/limits.conf to at lest 4096, Oracle apparently recommends this for java. For a service account with multiple app servers 8192 is probably safest.

In our profile script I output the unlimit and the current process count so we know if we're getting close to the limit.