How to solve the OOM killer process from killing your OSDs
Introduction
The Out of Memory Killer, or OOM Killer, is a process that the Linux kernel employs when the system is critically low on memory. This happens because the Linux kernel has allocated too much memory to its processes.
An Object Storage Daemon (OSD) stores data, handles data replication, recovery, and rebalancing, and provides some monitoring information to Ceph Monitors and Managers by checking other Ceph OSD Daemons for a heartbeat. OSDs usually require about 1 GB of RAM per 1 TB of storage.
It has been observed in different clusters, OSDs start very slowly, an increase in OSD memory consumption of up to 8GB (with a default osd_memory_target of 4 GB), and individual hosts that wanted to consume as much main memory as the Linux OOM Killer terminated processes.
In summary, complete Ceph clusters with all OSDs could no longer be started successfully because all OSDs wanted to consume the maximum amount of main memory until they were terminated by the Linux OOM killer.
Root cause
There are duplicate entries with a higher version than the log entries. This means that if there are any duplicate entries with a version higher than the tail of the log, we won't trim anything further but will continue to accumulate new duplicates as we trim pg_log_entries and add them to the end of the list of duplicates.
You can track the issue at https://tracker.ceph.com/issues/53729#note-57
Mitigation
The ceph-objectstore-tool is a tool to modify the state of an OSD and will be our main tool to find and eliminate duplicate entries in the placement group logs and finally recover the health of the affected OSDs.
An updated ceph-objectstore-tool, or at least one built with the patches in this pull request, is required: https://github.com/ceph/ceph/pull/45529
We are going to follow the solution proposed by Cloud System Solutions (https://www.clyso.com/blog/osds-with-unlimited-ram-growth/ ) that we have used on a couple of occasions to solve the problem for our clients as well.
With Octopus as of version 15.2.17 and with Quincy as of version 17.2.4, you can use the default ceph-objectstore tool, but for Pacific, you definitely need to build one using the patch.
Mitigation process
It is necessary to identify the OSDs with the problem (those that fail to start because they are OOM or take a long time to start).Set the noout flag before stopping the OSD:
ceph osd set noout
Stop the faulty OSD:
systemctl stop ceph-osd@{OSD-ID}.service
Activate the OSD by discovering it with LVM and mounting it to its appropriate destination:
ceph-volume lvm activate {OSD-ID} {FSID} --no-systemd
Get the list of placement groups from the faulty OSD:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-{OSD-ID} --op list-pgs > osd.{OSD-ID}.pgs.txt
Find the duplicate entries in the placement groups list:
while read pg; do echo $pg; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-{OSD-ID} --op log --pgid $pg > pglog.json; jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pglog.json; done < /root/osd.{OSD-ID}.pgs.txt 2>&1 | tee dups.log
Restores the PG log within designated size limits by doing an offline trim:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-{OSD-ID} --op trim-pg-log-dups --pgid {PG-ID} --osd_max_pg_log_entries=100 --osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000
osd_pg_log_trim_max=500000 is too high, increasing it more would not increase speed. It is recommended to start with a smaller number (100000 for example) and depending on how it works, it can be increased.
Conclusion
According to several cases in the Ceph community, manual or automatic (via autoscaler) changes to the number of placement groups can unintentionally trigger duplicate entries in PG logs, which can make OSDs inaccessible and start to consume a large amount of main memory until they are killed by the OOM killer.
Identifying this problem early and trying to fix it early using the ceph-objectstore-tool will be vital to avoiding further problems and downtime in your Ceph clusters.