CRSCTL/SRVCTL throws CRS-5168 : unable to communicate with ohasd.

Recently we started getting error while executing any command using srvctl & crsctl. While Database and ASM running absolutely fine without any error reported at DB/ASM or GI level. We are running 19/18c majorly. Here is the error you get.

srvctl config asm
PRCR-1070 : Failed to check if resource ora.asm is registered
CRS-5168 : unable to communicate with ohasd

crsctl stat res -t
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

Let’s verify the process status. All the process are running, I did try to look for the errors for all the services but did not find any thing relevant.

ps -ef|grep d.bin
grid      4057     1  0 May04 ?        22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot
grid      4386     1  0 May04 ?        10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin
grid      4523     1  0 May04 ?        11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin

Let’s trace it using strace. -f option is to follow the fork of new process, -o sending the output to a file.

strace -o abc.txt -f srvctl config asm

Important caveat: There are perils of using the tracing & debugging tools like strace/ptrace/GDB which has been shown by Tanel Podar https://tanelpoder.com/2008/06/14/debugger-dangers/

Here is the important bit of output from strace.

27383 sendto(21, "\4", 1, MSG_NOSIGNAL, NULL, 0) = 1 27383 connect(25, {sa_family=AF_UNIX, sun_path="/var/tmp/.oracle/sCRSD_UI_SOCKET"}, 110) = -1 ENOENT (No such file or directory) 27383 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 26

Strace reveals that socket file “/var/tmp/.oracle/sCRSD_UI_SOCKET” is missing. let’s verify at OS level.

ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET
ls: cannot access /var/tmp/.oracle/sCRSD_UI_SOCKET: No such file or directory

CRS agents communicate with each other via socket descriptors (for IPC protocol). Its apparent, that the socket file is being removed, On few occasions when I was not aware, I had to forcefully restarted to make it working but obviously not the solution.
However, since ohasd.bin runs with reboot argument, hence killing this process will restart itself and the socket file should be recreated automatically. So, let’ try to kill “ohasd.bin reboot”

 ps -ef|grep d.bin
grid      4057     1  0 May04 ?        22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot
grid      4386     1  0 May04 ?        10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin
grid      4523     1  0 May04 ?        11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin

kill -9 4057
ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug 15 00:17 /var/tmp/.oracle/sCRSD_UI_SOCKET

Now, the paramount question is why the socket file is delete automatically, I had no answer to this, hence, raised with oracle and they quickly pointed out “Oracle Linux 7 and Redhat Linux 7: The socket files in /var/tmp/.oracle Location Get Deleted (Doc ID 2455193.1)“, Which explains that its deleted by the kernel service systemd-tmpfiles-clean.service, and simple solution is to exclude the location as mentioned below in /usr/lib/tmpfiles.d/tmp.conf 

Content of "/usr/lib/tmpfiles.d/tmp.conf "
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

# See tmpfiles.d(5) for details

# Clear tmp directories separately, to make them easier to override
v /tmp 1777 root root 10d
v /var/tmp 1777 root root 30d

# Exclude namespace mountpoints created with PrivateTmp=yes
x /tmp/systemd-private-%b-*
X /tmp/systemd-private-%b-*/tmp
x /var/tmp/systemd-private-%b-*
X /var/tmp/systemd-private-%b-*/tmp
x /tmp/.oracle*
x /var/tmp/.oracle*
x /usr/tmp/.oracle*

Restart of service may be required, not sure though.

systemctl restart systemd-tmpfiles-clean.service

systemctl status systemd-tmpfiles-clean.service

? systemd-tmpfiles-clean.service - Cleanup of Temporary Directories
   Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-clean.service; static; vendor preset: disabled)
   Active: inactive (dead) since Sat 2020-08-15 16:36:50 BST; 5s ago
     Docs: man:tmpfiles.d(5)
           man:systemd-tmpfiles(8)
  Process: 5274 ExecStart=/usr/bin/systemd-tmpfiles --clean (code=exited, status=0/SUCCESS)
 Main PID: 5274 (code=exited, status=0/SUCCESS)

How to remove OUD instance from replication topology?

This might sound trivial and indeed it is, but certainly not well documented. We had 8 OUD (Oracle Unified directory) instances in replication topology, as part of data-center exit two servers needed to be removed from the cluster. Here is the output for ./dsreplication status -X –advanced , for better visibility I have removed some of the stuff.

So, the idea was to completely and gracefully remove from the replication topology and the servers should not be appearing any where in the configuration. After searching for a while did encounter few doc like “How to Remove a Server From a Replication Group (Doc ID 2185157.1) and OUD 11g/12c – How to Remove a Corrupt OUD Instance From a Replication Topology (Doc ID 2287753.1) . Unfortunately both the procedures appeared to me as a desperate attempt to remove after some kind of failure, I needed a graceful way.

Finally tuned out to be very simple! Run dsreplication with disable flag, which actually removes from the configuration not just disables it, probably Oracle should have used the term “Remove” rather than disable.

Trigger this on the server that you want to remove.

./dsreplication disable -X --advanced

>>>> Specify Oracle Unified Directory LDAP connection parameters

Directory server host name or IP address [ABC3uk.oracle.com]:

Directory server administration port number [4444]:

Global Administrator User ID, or bind DN if no Global Administrator is defined
[admin]:

Password for user 'admin':
Establishing connections and reading configuration ..... Done.
You can choose to disable all the replication on the server.  If you choose
'yes', the change log and the replication port (if defined) will be disabled;
if this server contains replicated data, the data will not be replicated with
other servers any more.  All the registration information will be deleted.
Do you want to disable all the replication configuration? (yes / no) [no]: yes
Disabling replication on the registration data (cn=admin data) of server
ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on base DN cn=OracleContext,DC=global,DC=tesco,DC=org of
server ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on the oracle context used for integration
(cn=oraclecontext) of server ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on the oracle schema version used for integration
(cn=oracleschemaversion) of server ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on the schema of server
ABC3uk.oracle.com:4444 ..... Done.
Disabling replication port 8989 of server
ABC3uk.oracle.com:4444 ........... Done.
Removing truststore information ..... Done.
Removing registration information ..... Done.

Note, that –advanced is not required, I have problem in my configuration as without advance replication configuration was not visible, which I have not investigated yet. To verify run

./dsreplication status -X --advanced

Further you can also confirm if all the entry has been removed from config.ldif, It should have been removed.

cat /appl/product/middleware/asinst_1/OUD/config/config.ldif |grep 8989

One last thing, if you have OUDSM configured then “disable” option is available on console also, presumable it should do the same, however, I have not tested it, make sure you test if at all you plan to use OUDSM.

Identical name for Network interface in RAC nodes not mandatory

One of the prerequisite for building the Oracle RAC is to have same name for private and public interface on all the nodes, as Oracle document dictates, which is also verified by Cluster verification utility. Recently I had situation wherein, I needed to install RAC 18C on vendor provided HP Super dome Flex servers for bench marking, The Network interface name was auto-generated like enP1s2248f0 and enP1s2408f0.

I knew that its not going to accept but still I chose to ignore the error because I was least bothered about the RAC configuration accuracy, Just wanted working database with RAC for bench marking purpose. While installing I got error PRVG-11081 Unable to find the matching networks“,

Finally I forcefully ignored the error, as consequence of this it only installed the RAC on Node 1, and the node 2 was skipped. See Below Image.

Obviously, It’s easy to change the interface name however, in my case vendor was not willing to do so, as they had auto-generated naming convention which they needed to comply with.

After searching for while I finally found MOS Doc ID 1570785.1, clearly says that this limitation was lifted in 12c, hence it’s possible to replace the interface name with “*” in OCR, Here is what I did

#At first get the existing configuration 

oifcfg getif
e1000g0 global public
e1000g1 global cluster_interconnect

#Replace Interface with "*"

oifcfg setif -global "*"/10.1.1.0:public
oifcfg setif -global "*"/192.168.5.0:cluster_interconnect

#Verify again.

oifcfg getif
e1000g0 global public
e1000g1 global cluster_interconnect
* 10.1.1.0 global public
* 192.168.5.0 global cluster_interconnect

#Delete the concrete interfaces as needed

oifcfg delif -global e1000g0 -force
oifcfg delif -global e1000g1 -force

#Confirm that only he wildcard interfaces remain

After replacing the interface name with “*” in OCR, I was able to add node, Hence, If you are installing fresh, then procedure would be.

  • Install on first node only.
  • Modify the interface name to “*” in OCR using oifcfg command
  • Follow add node procedure.

LGRW PostWait v/s Polling Algorithm in action.

Well, this post is about “Adaptive log file sync”, this feature was introduced in 11.2.0.3 which primarily dictates how the foreground process will be posted whether or not LGWR has written the committed change victor to the persistent storage, With the adaptive method LGWR has got the choice to choose between Post-Wait v/s Polling method.

Traditionally LGWR used to post the FG process after it has written to disk, however, with the polling method, foreground process will sleep for a while, come back, and check whether commit completed.  

Recently we started facing very high latency on log file sync wait, so the first thing was to verify the “log file parallel write” and found that more than 90% of IO was completed in less than 1 ms, it was worth verifying with storage and storage team did confirm the same. Further LGWR trace file did reveal that it was stuck on “ polling” and did not switch back to Post wait, and because of which extra time was wasted on polling caused the high latency.

Further LGWR trace file revels following.

[oracle@hostname trace]$ grep “Log file sync switching” -B 1 *
FAKE_DB_lgwr_25389.trc-*** 2020-02-06 15:56:49.376
FAKE_DB_lgwr_25389.trc:Log file sync switching to polling
FAKE_DB_lgwr_25389.trc-*** 2020-02-06 21:12:57.705
FAKE_DB_lgwr_25389.trc:Log file sync switching to post/wait
FAKE_DB_lgwr_27858.trc-*** 2020-02-23 06:06:14.698
FAKE_DB_lgwr_27858.trc:Log file sync switching to polling

It was not difficult to understand the problem as I had read the amazing post by “Frits Hoogland” long back, solution was to turn off Adaptive  “alter system set “_use_adaptive_log_file_sync” = false” and immediately after that log file sync wait disappeared.

After I published this post, “Frits Hoogland” was gracious enough to read and comment on twitter, He mentioned that post does not have enough information to conclude, therefore, I decided to add some more information . I was troubleshooting on 11.2.0.3, running on RHEL 5.8. After it was confirmed from the trace file that it was switching back and forth, I did compare the graph before and after the switch and it was clear that, because of polling latency was way too high. By this time it was clear to me that Oracle was at least not able to calibrate well enough to switch back. With couple of search I did find few related bugs,

Bug 27143321 – LGWR May Not Switch Back to post/wait Mode (Doc ID 27143321.8)

Bug 13707904 – LGWR sometimes uses polling, sometimes post/wait (Doc ID 13707904.8)

Patching was not in the choice at all, due to change freeze and Lock down, hence I decided to turn off the Adaptive as the default is the Post/Wait and it was working better for my system.