CRSCTL/SRVCTL throws CRS-5168 : unable to communicate with ohasd.

Recently we started getting error while executing any command using srvctl & crsctl. While Database and ASM running absolutely fine without any error reported at DB/ASM or GI level. We are running 19/18c majorly. Here is the error you get.

srvctl config asm
PRCR-1070 : Failed to check if resource ora.asm is registered
CRS-5168 : unable to communicate with ohasd

crsctl stat res -t
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

Let’s verify the process status. All the process are running, I did try to look for the errors for all the services but did not find any thing relevant.

ps -ef|grep d.bin
grid      4057     1  0 May04 ?        22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot
grid      4386     1  0 May04 ?        10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin
grid      4523     1  0 May04 ?        11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin

Let’s trace it using strace. -f option is to follow the fork of new process, -o sending the output to a file.

strace -o abc.txt -f srvctl config asm

Important caveat: There are perils of using the tracing & debugging tools like strace/ptrace/GDB which has been shown by Tanel Podar https://tanelpoder.com/2008/06/14/debugger-dangers/

Here is the important bit of output from strace.

27383 sendto(21, "\4", 1, MSG_NOSIGNAL, NULL, 0) = 1 27383 connect(25, {sa_family=AF_UNIX, sun_path="/var/tmp/.oracle/sCRSD_UI_SOCKET"}, 110) = -1 ENOENT (No such file or directory) 27383 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 26

Strace reveals that socket file “/var/tmp/.oracle/sCRSD_UI_SOCKET” is missing. let’s verify at OS level.

ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET
ls: cannot access /var/tmp/.oracle/sCRSD_UI_SOCKET: No such file or directory

CRS agents communicate with each other via socket descriptors (for IPC protocol). Its apparent, that the socket file is being removed, On few occasions when I was not aware, I had to forcefully restarted to make it working but obviously not the solution.
However, since ohasd.bin runs with reboot argument, hence killing this process will restart itself and the socket file should be recreated automatically. So, let’ try to kill “ohasd.bin reboot”

 ps -ef|grep d.bin
grid      4057     1  0 May04 ?        22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot
grid      4386     1  0 May04 ?        10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin
grid      4523     1  0 May04 ?        11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin

kill -9 4057
ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug 15 00:17 /var/tmp/.oracle/sCRSD_UI_SOCKET

Now, the paramount question is why the socket file is delete automatically, I had no answer to this, hence, raised with oracle and they quickly pointed out “Oracle Linux 7 and Redhat Linux 7: The socket files in /var/tmp/.oracle Location Get Deleted (Doc ID 2455193.1)“, Which explains that its deleted by the kernel service systemd-tmpfiles-clean.service, and simple solution is to exclude the location as mentioned below in /usr/lib/tmpfiles.d/tmp.conf 

Content of "/usr/lib/tmpfiles.d/tmp.conf "
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

# See tmpfiles.d(5) for details

# Clear tmp directories separately, to make them easier to override
v /tmp 1777 root root 10d
v /var/tmp 1777 root root 30d

# Exclude namespace mountpoints created with PrivateTmp=yes
x /tmp/systemd-private-%b-*
X /tmp/systemd-private-%b-*/tmp
x /var/tmp/systemd-private-%b-*
X /var/tmp/systemd-private-%b-*/tmp
x /tmp/.oracle*
x /var/tmp/.oracle*
x /usr/tmp/.oracle*

Restart of service may be required, not sure though.

systemctl restart systemd-tmpfiles-clean.service

systemctl status systemd-tmpfiles-clean.service

? systemd-tmpfiles-clean.service - Cleanup of Temporary Directories
   Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-clean.service; static; vendor preset: disabled)
   Active: inactive (dead) since Sat 2020-08-15 16:36:50 BST; 5s ago
     Docs: man:tmpfiles.d(5)
           man:systemd-tmpfiles(8)
  Process: 5274 ExecStart=/usr/bin/systemd-tmpfiles --clean (code=exited, status=0/SUCCESS)
 Main PID: 5274 (code=exited, status=0/SUCCESS)