Hastening Linux process cleanup with process

5 hours ago by calpaterson

> That work did not address one other unfortunate characteristic of the OOM killer, though: its opinion of what is the least important process on the system tends to differ from that of the system's users.

My experience of the linux OOM killer is not that its opinion differs from mine but that it has no opinion at all for a long, long time after the system is in deep trouble. The OOM killer simply does not act quickly enough to save systems. Sadly it's not customisable but 'earlyoom' (packaged for debian and probably everything else) is. I turned it on when I was messing about when debugging a badly behaved bit of software which went into a memory allocation loop and have just left it on. It's saved me a few times and I now plan to leave it on forever.

Looks like oomd is an idea along the same lines but with slightly different goals. It's not in my distro so not an easy option for me.

5 hours ago by freedomben

This is my experience also. By the time the OOM killer kicks in, the system has been locked for 15 to 20 minutes already. If it was production you've already terminated the instance. If it's your laptop or desktop, you've already held the power button.

Fedora has earlyoom enabled by default but so far it hasn't saved me. I really need to look into configuring it. How did you get started? Man pages? Blog post?

3 hours ago by calpaterson

The manpage combined with some (forced) experimentation with the wonky code mentioned was enough for me. I run with this config as even the earlyoom defaults were not strict enough:

-r 30 -m 5 -s 80

I run with 16gb of memory and do use swap. In practice if swap is growing at all once memory is near full, I'm in trouble and action needs to be taken.

4 hours ago by cmurf

earlyoom is enable on Fedora 32 and 33 Workstation edition; and 33 KDE spin. On Fedora 34, all editions and spins have systemd-oomd enabled. It does take some initial configuration of systemd service units since oomd works by cgroupsv2 based accounting and killing of entire cgroups, not PIDs. This work is still a work in progress, with uresourced setting up the initial resource allocations (with planned obsolescence). It should be safe to run uresourced on any edition or spin but right now it's only enabled by default on Workstation edition.

If the results you're getting aren't expected, there's still some chance it's a bug somewhere, so you should report it against systemd, attach `journalctl -b -o short-monotonic --no-hostname` or at least ~10 minutes of logging prior to the unexpected behavior you're reporting.

4 hours ago by GranPC

I used to experience this a long time ago, and I know many people who still do - but on my system (running an Ubuntu 18.04 derived distribution) the OOM killer takes at most 3-4 seconds to step in and kill whichever process is consuming the most memory. Does anyone know if Ubuntu/Debian tunes the OOM killer differently to try and stop this from happening?

3 hours ago by jeffbee

The reason Linux systems lock up like that is because the kernel will let a process fill all memory up with dirty pages that need writeback, then as soon as it needs some memory the first thing it does it drops all of the in-memory copies of file-backed pages, which includes all of your programs. Then whenever one of your programs wants to run, or continue running by branching to a far address, it has to swap in that code from disk again. Even though you thought your system does not "have swap", it does have swap in effect. The workaround for this is to copy important programs into memory and pin them there with mlock. It is particularly important that if you rely on a userspace OOM killer it gets locked into memory.

25 minutes ago by scottlamb

> The workaround for this is to copy important programs into memory and pin them there with mlock.

Nit: I don't think it's necessary to copy the program into anonymous memory to use mlock. You might be thinking of huge pages (transparent or otherwise), which are supported on anonymous pages and unfortunately aren't yet supported on ext4- or btrfs-backed file pages.

an hour ago by phsau

I've also found that under these conditions kswapd will effectively consume all your CPU time. The time it spends running is probably proportional to your maximum memory too - in our case it parses through 500+GB of LRU. The blocking writeback behaviour can be managed effectively with dirty page writeback ratio tuning. You don't want to block trying to write 50GB to disk at once when you hit the high dirty page watermark.

an hour ago by smcameron

So, if a process is in an uninterruptible sleep, because say, it's doing i/o, say reading from disk, and furthermore, say it's using O_DIRECT, so the storage stack is going to set things up to bypass the buffer cache and DMA directly into the process's memory, and then you rip away the process's memory and give it to another process... and then the DMA completes, and kaboom? The DMA just clobbered an unsuspecting process's memory?

That is, it was my understanding that the reason a process is in an uninterruptible sleep is generally because it's waiting for a DMA to complete, and if you were permitted to interrupt it, the DMA would eventually complete and clobber who knows what. Ripping the memory away from such a process would, to my first glance, seem to encounter the same problem -- how do you stop the pending DMA (which might already be in progress, but which might also take awhile to complete.) Whatever method, it would be device dependent, which makes it impractical (who's going to retrofit all the device drivers with DMA stopping APIs, and I'm sure there are many devices that have no way to stop pending DMAs, and anyway, maybe the DMA is already in progress, only half completed.) Maybe do it at the pci level, unmap DMA buffers. But many current drivers will generally assume that they never give their devices bad bus addresses, yet now the device is attempting DMA to a bad bus address (i.e. a suddenly unmapped bus address).

Well, I've been out of the linux driver game for awhile now, so perhaps I'm missing or forgetting something. Ripping memory out from under a process with pending DMA sounds pretty sketchy to me though.

But, if dumb old me can think of this, of course the kernel developers can also, and undoubtedly did. Wonder how it really works?

44 minutes ago by phendrenad2

Mayne usermode processes shouldn't have uninterruptible I/O access.

4 hours ago by rbanffy

A possible solution not to this, but for the OOM killer would be an “importance” attribute orthogonal to the task priority - it’s ok to kill the NTP server or to bounce the DNS proxy. It’s much less OK to kill Emacs or my desktop.

3 hours ago by jeffbee

That's what /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj are for.

6 hours ago by teddyh

This seems useful for systemd when stopping services.

Hastening Linux process cleanup with process_mrelease