Couchbase Analytics Service bind issue with Azure VM enabled with Accelerated Networking

Recently, I found myself diving into a technical challenge: the Couchbase analytics service wasn’t starting up after being added to our cluster. The console was quite cryptic, flashing a message as mentioned below.

Interestingly, the analytics service log indicated that port 8095 was already in use, only to become available moments later.

2023-11-24T12:36:26.471+00:00 ERRO CBAS.server.HttpServer [main] Failure starting an Http Server at: /127.0.0.1:8095
java.net.BindException: Address already in use

2023-11-24T12:36:29.980+00:00 DEBU CBAS.cbas ensuring bind port 127.0.0.1:8095 is available
2023-11-24T12:36:29.982+00:00 DEBU CBAS.cbas trying optional address family at [::1]:8095
2023-11-24T12:36:29.984+00:00 INFO CBAS.cbas ignoring optional family address IPv6: listen tcp6 [::1]:8095: socket: address family not supported by protocol
2023-11-24T12:36:32.691+00:00 ERRO CBAS.server.HttpServer [main] Failure starting an Http Server at: /127.0.0.1:8095

My initial approach was to determine if port 8095 was being used by another service, using the command netstat -tanelp | grep "8095". Surprisingly, this revealed nothing. Considering that Netstat relies on top-level system calls, I hypothesized that perhaps a process was intermittently grabbing the port. To probe deeper, I turned to eBPF-based tcplife for more granular tracing. After halting Couchbase and initiating tcplife, I restarted Couchbase but observed no services binding to port 8095.

PID       COMM        LADDR    LPORT RADDR RPORT TX_KB RX_KB MS
447556 prometheus 127.0.0.1 36816 127.0.0.1 8095 0 0 0.04
447556 prometheus 127.0.0.1 50428 127.0.0.1 8095 0 0 0.03
447556 prometheus 127.0.0.1 50444 127.0.0.1 8095 0 0 0.04
447556 prometheus 127.0.0.1 36684 127.0.0.1 8095 0 0 0.08
447556 prometheus 127.0.0.1 36696 127.0.0.1 8095 0 0 0.08
447556 prometheus 127.0.0.1 44300 127.0.0.1 8095 0 0 0.04
447556 prometheus 127.0.0.1 44304 127.0.0.1 8095 0 0 0.04

So, there was nothing ever trying to bind on port 8095, on the Prometheus was trying to connect to Analytics on 8095 which is normal.
Having exhausted all obvious avenues, I contacted Couchbase support. Their investigation brought an unexpected revelation: our Azure Virtual Machine Scale Set (VMSS) had two network interfaces, eth0 and eth1, sharing the same IP address. This was surprising as each VM in our Azure VMSS was configured with only a single NIC.
#ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:0d:3a:b8:24:c6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
link/ether 00:0d:3a:b8:24:c6 brd ff:ff:ff:ff:ff:ff
altname enP27495p0s2
altname enP27495s1

#ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.119.83.40 netmask 255.255.255.0 broadcast 10.119.83.255
ether 00:0d:3a:b8:24:c6 txqueuelen 1000 (Ethernet)
RX packets 101881077 bytes 48801167623 (45.4 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 94569580 bytes 43202558095 (40.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
inet 10.119.83.40 netmask 255.255.255.0 broadcast 10.119.83.255
ether 00:0d:3a:b8:24:c6 txqueuelen 1000 (Ethernet)
RX packets 62618582 bytes 26119847950 (24.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 61873861 bytes 21738931845 (20.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The anomaly was traced back to Azure’s implementation of Accelerated Networking, more about this here, for larger VM SKUs. Accelerated Networking activates single root I/O virtualization (SR-IOV) on supported VM types, significantly boosting networking performance. In Azure VMs, a synthetic network interface is created for each virtual NIC. Additionally, a VF interface appears in the Linux guest as a PCI device, leveraging Mellanox mlx4 or mlx5 drivers, as detailed here. This behavior, though unexpected, was not incorrect. However, it affected only the analytics service; other services like data, index, eventing, and FTS operated normally.

Resolution

Two solutions emerged: one at the operating system level and the other at the Couchbase level. The Couchbase team identified this as a bug, fixed it in version 7.2.4 (I was using 7.1.4), and agreed to backport the fix to 7.1.6. In the meantime, a Couchbase-level workaround involved the following command:

Run the below curl command and restart the couchbase service, Analytics would come back.

curl -v -X PUT http://localhost:9122/metakv/config/service/parameters -u Admin:password -d '{"configVersion":1,"config":{"bindToHost":true}}'

However, if you want to fix it at the OS level and you don’t really need the Accelerated networking then create a file named “68-azure-sriov-nm-unmanaged.rules” under /etc/udev/rules.d/, containing below to un-manage the interface.

# Accelerated Networking on Azure exposes a new SRIOV interface to the VM.
# This interface is transparently bonded to the synthetic interface,
# so NetworkManager should just ignore any SRIOV interfaces.

SUBSYSTEM=="net", DRIVERS=="hv_pci", ACTION=="add", ENV{NM_UNMANAGED}="1"

After creating the udev rules reload and restart the VM, this would avoid assigning the IP address to the synthetic interface.

# /sbin/udevadm control –reload-rules
# /sbin/udevadm trigger –type=devices –action=change

Using any of these methods would fix the Analytics service.

Leave a Reply

Your email address will not be published. Required fields are marked *