Couchbase Analytics Service bind issue with Azure VM enabled with Accelerated Networking

Recently, I found myself diving into a technical challenge: the Couchbase analytics service wasn’t starting up after being added to our cluster. The console was quite cryptic, flashing a message as mentioned below.

Interestingly, the analytics service log indicated that port 8095 was already in use, only to become available moments later.

2023-11-24T12:36:26.471+00:00 ERRO CBAS.server.HttpServer [main] Failure starting an Http Server at: /127.0.0.1:8095
java.net.BindException: Address already in use

2023-11-24T12:36:29.980+00:00 DEBU CBAS.cbas ensuring bind port 127.0.0.1:8095 is available
2023-11-24T12:36:29.982+00:00 DEBU CBAS.cbas trying optional address family at [::1]:8095
2023-11-24T12:36:29.984+00:00 INFO CBAS.cbas ignoring optional family address IPv6: listen tcp6 [::1]:8095: socket: address family not supported by protocol
2023-11-24T12:36:32.691+00:00 ERRO CBAS.server.HttpServer [main] Failure starting an Http Server at: /127.0.0.1:8095

My initial approach was to determine if port 8095 was being used by another service, using the command netstat -tanelp | grep "8095". Surprisingly, this revealed nothing. Considering that Netstat relies on top-level system calls, I hypothesized that perhaps a process was intermittently grabbing the port. To probe deeper, I turned to eBPF-based tcplife for more granular tracing. After halting Couchbase and initiating tcplife, I restarted Couchbase but observed no services binding to port 8095.

PID       COMM        LADDR    LPORT RADDR RPORT TX_KB RX_KB MS
447556 prometheus 127.0.0.1 36816 127.0.0.1 8095 0 0 0.04
447556 prometheus 127.0.0.1 50428 127.0.0.1 8095 0 0 0.03
447556 prometheus 127.0.0.1 50444 127.0.0.1 8095 0 0 0.04
447556 prometheus 127.0.0.1 36684 127.0.0.1 8095 0 0 0.08
447556 prometheus 127.0.0.1 36696 127.0.0.1 8095 0 0 0.08
447556 prometheus 127.0.0.1 44300 127.0.0.1 8095 0 0 0.04
447556 prometheus 127.0.0.1 44304 127.0.0.1 8095 0 0 0.04

So, there was nothing ever trying to bind on port 8095, on the Prometheus was trying to connect to Analytics on 8095 which is normal.
Having exhausted all obvious avenues, I contacted Couchbase support. Their investigation brought an unexpected revelation: our Azure Virtual Machine Scale Set (VMSS) had two network interfaces, eth0 and eth1, sharing the same IP address. This was surprising as each VM in our Azure VMSS was configured with only a single NIC.
#ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:0d:3a:b8:24:c6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
link/ether 00:0d:3a:b8:24:c6 brd ff:ff:ff:ff:ff:ff
altname enP27495p0s2
altname enP27495s1

#ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.119.83.40 netmask 255.255.255.0 broadcast 10.119.83.255
ether 00:0d:3a:b8:24:c6 txqueuelen 1000 (Ethernet)
RX packets 101881077 bytes 48801167623 (45.4 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 94569580 bytes 43202558095 (40.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
inet 10.119.83.40 netmask 255.255.255.0 broadcast 10.119.83.255
ether 00:0d:3a:b8:24:c6 txqueuelen 1000 (Ethernet)
RX packets 62618582 bytes 26119847950 (24.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 61873861 bytes 21738931845 (20.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The anomaly was traced back to Azure’s implementation of Accelerated Networking, more about this here, for larger VM SKUs. Accelerated Networking activates single root I/O virtualization (SR-IOV) on supported VM types, significantly boosting networking performance. In Azure VMs, a synthetic network interface is created for each virtual NIC. Additionally, a VF interface appears in the Linux guest as a PCI device, leveraging Mellanox mlx4 or mlx5 drivers, as detailed here. This behavior, though unexpected, was not incorrect. However, it affected only the analytics service; other services like data, index, eventing, and FTS operated normally.

Resolution

Two solutions emerged: one at the operating system level and the other at the Couchbase level. The Couchbase team identified this as a bug, fixed it in version 7.2.4 (I was using 7.1.4), and agreed to backport the fix to 7.1.6. In the meantime, a Couchbase-level workaround involved the following command:

Run the below curl command and restart the couchbase service, Analytics would come back.

curl -v -X PUT http://localhost:9122/metakv/config/service/parameters -u Admin:password -d '{"configVersion":1,"config":{"bindToHost":true}}'

However, if you want to fix it at the OS level and you don’t really need the Accelerated networking then create a file named “68-azure-sriov-nm-unmanaged.rules” under /etc/udev/rules.d/, containing below to un-manage the interface.

# Accelerated Networking on Azure exposes a new SRIOV interface to the VM.
# This interface is transparently bonded to the synthetic interface,
# so NetworkManager should just ignore any SRIOV interfaces.

SUBSYSTEM=="net", DRIVERS=="hv_pci", ACTION=="add", ENV{NM_UNMANAGED}="1"

After creating the udev rules reload and restart the VM, this would avoid assigning the IP address to the synthetic interface.

# /sbin/udevadm control –reload-rules
# /sbin/udevadm trigger –type=devices –action=change

Using any of these methods would fix the Analytics service.

Automating SSL implementation in PostgreSQL

Here is quick Ansible playbook to enable SSL and creating self-signed certificate. By default SSL is not on, certificate and key file is controlled by parameter ssl_cert_file = ‘server.crt’ and ssl_key_file = ‘server.key’, therefore I am creating with same name, but you can change as per your choice. The step would be to create self-signed certificate and then set the ssl=on parameter as well as modify the pg_hba.conf rule as per your requirement, In my case I have enabled SSL for external user, replication communication is non-ssl based. For reference you can use below task to cater your requirement. I have not included as requirement could be very different. So you can combine both of them as per your requirement or you may have to write task to modify existing rule as per the requirement. After all this modification you need to restart database which is included in last task.

- name: Configuring pg_hba.conf
  blockinfile:
        insertbefore: EOF
        create: yes
        marker_begin: ANSIBLE PostgreSQL HBA START
        marker_end: ANSIBLE PostgreSQL HBA END
        backup: yes
        path: /u00/app/postgres/12/data/pg_hba.conf
        block: |
                hostssl    all             all             0.0.0.0/0          md5
                local   all             postgres                           ident
                host    replication     replication     {{ ansible_default_ipv4.address }}/32    trust
                host    replication     postgres        {{ ansible_default_ipv4.address }}/32    trust
  become: yes
  become_user: postgres
- name: PostgreSQL SSL
  become: true
  become_user: postgres
  tasks:
        - openssl_privatekey:
                path: "/u00/app/postgres/12/data/server.key"
                mode: 0400
                type: RSA
                size: 2048

        - openssl_csr:
                path: "/u00/app/postgres/12/data/server.req"
                privatekey_path: "/u00/app/postgres/12/data/server.key"
                country_name: "UK"
                state_or_province_name: "London"
                organization_name: "YOUR-company"
                organizational_unit_name: "ANYTHING-YOU-WANT-TO-IDENTIFY"
                mode: 0444

        - openssl_certificate:
                path: "/u00/app/postgres/12/data/server.crt"
                csr_path: "/u00/app/postgres/12/data/server.req"
                privatekey_path: "/u00/app/postgres/12/data/server.key"
                provider: selfsigned
                invalid_at: YYYYmmddHHMMSS
                invalid_at: 20200310235959
                mode: 0444

        - name: Setting PostgreSQL parameters
          postgresql_set:
                name: "{{ item.name }}"
                value: "{{ item.value }}"
          loop:
                - { name: 'ssl', value: on }
          become: yes
          become_user: postgres

        - name: Restarting database
          shell: 'source /u00/home/postgres/.bash_profile ; /opt/rh/rh-postgresql12/root/usr/bin/pg_ctl restart -D /u00/app/postgres/12/data'
          become: yes
          become_user: postgres

CRSCTL/SRVCTL throws CRS-5168 : unable to communicate with ohasd.

Recently we started getting error while executing any command using srvctl & crsctl. While Database and ASM running absolutely fine without any error reported at DB/ASM or GI level. We are running 19/18c majorly. Here is the error you get.

srvctl config asm
PRCR-1070 : Failed to check if resource ora.asm is registered
CRS-5168 : unable to communicate with ohasd

crsctl stat res -t
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

Let’s verify the process status. All the process are running, I did try to look for the errors for all the services but did not find any thing relevant.

ps -ef|grep d.bin
grid      4057     1  0 May04 ?        22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot
grid      4386     1  0 May04 ?        10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin
grid      4523     1  0 May04 ?        11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin

Let’s trace it using strace. -f option is to follow the fork of new process, -o sending the output to a file.

strace -o abc.txt -f srvctl config asm

Important caveat: There are perils of using the tracing & debugging tools like strace/ptrace/GDB which has been shown by Tanel Podar https://tanelpoder.com/2008/06/14/debugger-dangers/

Here is the important bit of output from strace.

27383 sendto(21, "\4", 1, MSG_NOSIGNAL, NULL, 0) = 1 27383 connect(25, {sa_family=AF_UNIX, sun_path="/var/tmp/.oracle/sCRSD_UI_SOCKET"}, 110) = -1 ENOENT (No such file or directory) 27383 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 26

Strace reveals that socket file “/var/tmp/.oracle/sCRSD_UI_SOCKET” is missing. let’s verify at OS level.

ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET
ls: cannot access /var/tmp/.oracle/sCRSD_UI_SOCKET: No such file or directory

CRS agents communicate with each other via socket descriptors (for IPC protocol). Its apparent, that the socket file is being removed, On few occasions when I was not aware, I had to forcefully restarted to make it working but obviously not the solution.
However, since ohasd.bin runs with reboot argument, hence killing this process will restart itself and the socket file should be recreated automatically. So, let’ try to kill “ohasd.bin reboot”

 ps -ef|grep d.bin
grid      4057     1  0 May04 ?        22:06:47 /u00/app/grid/product/19.3.0/grid_1/bin/ohasd.bin reboot
grid      4386     1  0 May04 ?        10:18:59 /u00/app/grid/product/19.3.0/grid_1/bin/evmd.bin
grid      4523     1  0 May04 ?        11:50:49 /u00/app/grid/product/19.3.0/grid_1/bin/ocssd.bin

kill -9 4057
ls -l /var/tmp/.oracle/sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug 15 00:17 /var/tmp/.oracle/sCRSD_UI_SOCKET

Now, the paramount question is why the socket file is delete automatically, I had no answer to this, hence, raised with oracle and they quickly pointed out “Oracle Linux 7 and Redhat Linux 7: The socket files in /var/tmp/.oracle Location Get Deleted (Doc ID 2455193.1)“, Which explains that its deleted by the kernel service systemd-tmpfiles-clean.service, and simple solution is to exclude the location as mentioned below in /usr/lib/tmpfiles.d/tmp.conf 

Content of "/usr/lib/tmpfiles.d/tmp.conf "
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

# See tmpfiles.d(5) for details

# Clear tmp directories separately, to make them easier to override
v /tmp 1777 root root 10d
v /var/tmp 1777 root root 30d

# Exclude namespace mountpoints created with PrivateTmp=yes
x /tmp/systemd-private-%b-*
X /tmp/systemd-private-%b-*/tmp
x /var/tmp/systemd-private-%b-*
X /var/tmp/systemd-private-%b-*/tmp
x /tmp/.oracle*
x /var/tmp/.oracle*
x /usr/tmp/.oracle*

Restart of service may be required, not sure though.

systemctl restart systemd-tmpfiles-clean.service

systemctl status systemd-tmpfiles-clean.service

? systemd-tmpfiles-clean.service - Cleanup of Temporary Directories
   Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-clean.service; static; vendor preset: disabled)
   Active: inactive (dead) since Sat 2020-08-15 16:36:50 BST; 5s ago
     Docs: man:tmpfiles.d(5)
           man:systemd-tmpfiles(8)
  Process: 5274 ExecStart=/usr/bin/systemd-tmpfiles --clean (code=exited, status=0/SUCCESS)
 Main PID: 5274 (code=exited, status=0/SUCCESS)

How to remove OUD instance from replication topology?

This might sound trivial and indeed it is, but certainly not well documented. We had 8 OUD (Oracle Unified directory) instances in replication topology, as part of data-center exit two servers needed to be removed from the cluster. Here is the output for ./dsreplication status -X –advanced , for better visibility I have removed some of the stuff.

So, the idea was to completely and gracefully remove from the replication topology and the servers should not be appearing any where in the configuration. After searching for a while did encounter few doc like “How to Remove a Server From a Replication Group (Doc ID 2185157.1) and OUD 11g/12c – How to Remove a Corrupt OUD Instance From a Replication Topology (Doc ID 2287753.1) . Unfortunately both the procedures appeared to me as a desperate attempt to remove after some kind of failure, I needed a graceful way.

Finally tuned out to be very simple! Run dsreplication with disable flag, which actually removes from the configuration not just disables it, probably Oracle should have used the term “Remove” rather than disable.

Trigger this on the server that you want to remove.

./dsreplication disable -X --advanced

>>>> Specify Oracle Unified Directory LDAP connection parameters

Directory server host name or IP address [ABC3uk.oracle.com]:

Directory server administration port number [4444]:

Global Administrator User ID, or bind DN if no Global Administrator is defined
[admin]:

Password for user 'admin':
Establishing connections and reading configuration ..... Done.
You can choose to disable all the replication on the server.  If you choose
'yes', the change log and the replication port (if defined) will be disabled;
if this server contains replicated data, the data will not be replicated with
other servers any more.  All the registration information will be deleted.
Do you want to disable all the replication configuration? (yes / no) [no]: yes
Disabling replication on the registration data (cn=admin data) of server
ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on base DN cn=OracleContext,DC=global,DC=tesco,DC=org of
server ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on the oracle context used for integration
(cn=oraclecontext) of server ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on the oracle schema version used for integration
(cn=oracleschemaversion) of server ABC3uk.oracle.com:4444 ..... Done.
Disabling replication on the schema of server
ABC3uk.oracle.com:4444 ..... Done.
Disabling replication port 8989 of server
ABC3uk.oracle.com:4444 ........... Done.
Removing truststore information ..... Done.
Removing registration information ..... Done.

Note, that –advanced is not required, I have problem in my configuration as without advance replication configuration was not visible, which I have not investigated yet. To verify run

./dsreplication status -X --advanced

Further you can also confirm if all the entry has been removed from config.ldif, It should have been removed.

cat /appl/product/middleware/asinst_1/OUD/config/config.ldif |grep 8989

One last thing, if you have OUDSM configured then “disable” option is available on console also, presumable it should do the same, however, I have not tested it, make sure you test if at all you plan to use OUDSM.

Identical name for Network interface in RAC nodes not mandatory

One of the prerequisite for building the Oracle RAC is to have same name for private and public interface on all the nodes, as Oracle document dictates, which is also verified by Cluster verification utility. Recently I had situation wherein, I needed to install RAC 18C on vendor provided HP Super dome Flex servers for bench marking, The Network interface name was auto-generated like enP1s2248f0 and enP1s2408f0.

I knew that its not going to accept but still I chose to ignore the error because I was least bothered about the RAC configuration accuracy, Just wanted working database with RAC for bench marking purpose. While installing I got error PRVG-11081 Unable to find the matching networks“,

Finally I forcefully ignored the error, as consequence of this it only installed the RAC on Node 1, and the node 2 was skipped. See Below Image.

Obviously, It’s easy to change the interface name however, in my case vendor was not willing to do so, as they had auto-generated naming convention which they needed to comply with.

After searching for while I finally found MOS Doc ID 1570785.1, clearly says that this limitation was lifted in 12c, hence it’s possible to replace the interface name with “*” in OCR, Here is what I did

#At first get the existing configuration 

oifcfg getif
e1000g0 global public
e1000g1 global cluster_interconnect

#Replace Interface with "*"

oifcfg setif -global "*"/10.1.1.0:public
oifcfg setif -global "*"/192.168.5.0:cluster_interconnect

#Verify again.

oifcfg getif
e1000g0 global public
e1000g1 global cluster_interconnect
* 10.1.1.0 global public
* 192.168.5.0 global cluster_interconnect

#Delete the concrete interfaces as needed

oifcfg delif -global e1000g0 -force
oifcfg delif -global e1000g1 -force

#Confirm that only he wildcard interfaces remain

After replacing the interface name with “*” in OCR, I was able to add node, Hence, If you are installing fresh, then procedure would be.

  • Install on first node only.
  • Modify the interface name to “*” in OCR using oifcfg command
  • Follow add node procedure.

LGRW PostWait v/s Polling Algorithm in action.

Well, this post is about “Adaptive log file sync”, this feature was introduced in 11.2.0.3 which primarily dictates how the foreground process will be posted whether or not LGWR has written the committed change victor to the persistent storage, With the adaptive method LGWR has got the choice to choose between Post-Wait v/s Polling method.

Traditionally LGWR used to post the FG process after it has written to disk, however, with the polling method, foreground process will sleep for a while, come back, and check whether commit completed.  

Recently we started facing very high latency on log file sync wait, so the first thing was to verify the “log file parallel write” and found that more than 90% of IO was completed in less than 1 ms, it was worth verifying with storage and storage team did confirm the same. Further LGWR trace file did reveal that it was stuck on “ polling” and did not switch back to Post wait, and because of which extra time was wasted on polling caused the high latency.

Further LGWR trace file revels following.

[oracle@hostname trace]$ grep “Log file sync switching” -B 1 *
FAKE_DB_lgwr_25389.trc-*** 2020-02-06 15:56:49.376
FAKE_DB_lgwr_25389.trc:Log file sync switching to polling
FAKE_DB_lgwr_25389.trc-*** 2020-02-06 21:12:57.705
FAKE_DB_lgwr_25389.trc:Log file sync switching to post/wait
FAKE_DB_lgwr_27858.trc-*** 2020-02-23 06:06:14.698
FAKE_DB_lgwr_27858.trc:Log file sync switching to polling

It was not difficult to understand the problem as I had read the amazing post by “Frits Hoogland” long back, solution was to turn off Adaptive  “alter system set “_use_adaptive_log_file_sync” = false” and immediately after that log file sync wait disappeared.

After I published this post, “Frits Hoogland” was gracious enough to read and comment on twitter, He mentioned that post does not have enough information to conclude, therefore, I decided to add some more information . I was troubleshooting on 11.2.0.3, running on RHEL 5.8. After it was confirmed from the trace file that it was switching back and forth, I did compare the graph before and after the switch and it was clear that, because of polling latency was way too high. By this time it was clear to me that Oracle was at least not able to calibrate well enough to switch back. With couple of search I did find few related bugs,

Bug 27143321 – LGWR May Not Switch Back to post/wait Mode (Doc ID 27143321.8)

Bug 13707904 – LGWR sometimes uses polling, sometimes post/wait (Doc ID 13707904.8)

Patching was not in the choice at all, due to change freeze and Lock down, hence I decided to turn off the Adaptive as the default is the Post/Wait and it was working better for my system.