Recently we started getting error while executing any command using srvctl & crsctl. While Database and ASM running absolutely fine without any error reported at DB/ASM or GI level. We are running 19/18c majorly. Here is the error you get.
srvctl config asm PRCR-1070 : Failed to check if resource ora.asm is registered CRS-5168 : unable to communicate with ohasd crsctl stat res -t CRS-4639: Could not contact Oracle High Availability Services CRS-4000: Command Status failed, or completed with errors.
Let’s verify the process status. All the process are running, I did try to look for the errors for all the services but did not find any thing relevant.
ps -ef|grep d.bin grid 4057 1 0 May04 ? 22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot grid 4386 1 0 May04 ? 10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin grid 4523 1 0 May04 ? 11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin
Let’s trace it using strace. -f option is to follow the fork of new process, -o sending the output to a file.
strace -o abc.txt -f srvctl config asm
Important caveat: There are perils of using the tracing & debugging tools like strace/ptrace/GDB which has been shown by Tanel Podar https://tanelpoder.com/2008/06/14/debugger-dangers/
Here is the important bit of output from strace.
27383 sendto(21, "\4", 1, MSG_NOSIGNAL, NULL, 0) = 1 27383 connect(25, {sa_family=AF_UNIX, sun_path="/var/tmp/.oracle/sCRSD_UI_SOCKET"}, 110) = -1 ENOENT (No such file or directory) 27383 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 26
Strace reveals that socket file “/var/tmp/.oracle/sCRSD_UI_SOCKET” is missing. let’s verify at OS level.
ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET ls: cannot access /var/tmp/.oracle/sCRSD_UI_SOCKET: No such file or directory
CRS agents communicate with each other via socket descriptors (for IPC protocol). Its apparent, that the socket file is being removed, On few occasions when I was not aware, I had to forcefully restarted to make it working but obviously not the solution.
However, since ohasd.bin runs with reboot argument, hence killing this process will restart itself and the socket file should be recreated automatically. So, let’ try to kill “ohasd.bin reboot”
ps -ef|grep d.bin grid 4057 1 0 May04 ? 22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot grid 4386 1 0 May04 ? 10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin grid 4523 1 0 May04 ? 11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin kill -9 4057 ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET srwxrwxrwx 1 grid oinstall 0 Aug 15 00:17 /var/tmp/.oracle/sCRSD_UI_SOCKET
Now, the paramount question is why the socket file is delete automatically, I had no answer to this, hence, raised with oracle and they quickly pointed out “Oracle Linux 7 and Redhat Linux 7: The socket files in /var/tmp/.oracle Location Get Deleted (Doc ID 2455193.1)“, Which explains that its deleted by the kernel service systemd-tmpfiles-clean.service, and simple solution is to exclude the location as mentioned below in /usr/lib/tmpfiles.d/tmp.confÂ
Content of "/usr/lib/tmpfiles.d/tmp.conf " # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. # See tmpfiles.d(5) for details # Clear tmp directories separately, to make them easier to override v /tmp 1777 root root 10d v /var/tmp 1777 root root 30d # Exclude namespace mountpoints created with PrivateTmp=yes x /tmp/systemd-private-%b-* X /tmp/systemd-private-%b-*/tmp x /var/tmp/systemd-private-%b-* X /var/tmp/systemd-private-%b-*/tmp x /tmp/.oracle* x /var/tmp/.oracle* x /usr/tmp/.oracle* Restart of service may be required, not sure though. systemctl restart systemd-tmpfiles-clean.service systemctl status systemd-tmpfiles-clean.service ? systemd-tmpfiles-clean.service - Cleanup of Temporary Directories Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-clean.service; static; vendor preset: disabled) Active: inactive (dead) since Sat 2020-08-15 16:36:50 BST; 5s ago Docs: man:tmpfiles.d(5) man:systemd-tmpfiles(8) Process: 5274 ExecStart=/usr/bin/systemd-tmpfiles --clean (code=exited, status=0/SUCCESS) Main PID: 5274 (code=exited, status=0/SUCCESS)