Detecting a weird packet loss issue.¶
So- I recently installed a Quad 100G Mikrotik CRS504-4XQ-IN Router into my lab recently and moved a few servers over to 100G NICs.
Around the time I did this- I also started experiencing tons of random, hard to pinpoint instances of latency across my network.
I validated flow-control was enabled for the network- and I checked various port counters to look for errors.
This- issue took me around a week or two to finally pinpoint.... this post- is going through a bit of the steps taken.
The Symptoms¶
The symptoms- were rather unique.
- Web pages would pause for up to 10 seconds before loading.
- Some mobile apps (clash of clans) would randomly disconnect with connection issues.
- Random latency and disconnects occurring throughout my network.
- Very little detectable packet-loss. This- is key.
- The symptoms almost resembled what occurs when you have a STP loop causing a port to flap up and down.
Starting at the Mikrotik¶
[admin@sw-100g] > /interface print stats
Flags: R - RUNNING; S - SLAVE
Columns: NAME, RX-BYTE, TX-BYTE, RX-PACKET, TX-PACKET, RX-DROP, TX-DROP, TX-QUEUE-DROP, RX-ERROR, TX-ERROR
# NAME RX-BYTE TX-BYTE RX-PACKET TX-PACKET RX-DROP TX-DROP TX-QUEUE-DROP RX-ERROR TX-ERROR
0 R ether1 201 163 444 3 040 455 588 2 264 797 2 543 603 0 0 1 0 0
;;; Kube01: 100G
1 RS qsfp28-1-1 2 969 461 776 402 3 316 576 813 413 1 381 885 242 1 640 058 300 26
2 qsfp28-1-2 0 0 0 0 0
3 qsfp28-1-3 0 0 0 0 0
4 qsfp28-1-4 0 0 0 0 0
;;; Kube02: 100G
5 RS qsfp28-2-1 2 740 746 241 766 3 137 569 536 280 1 741 858 272 1 520 825 302 0
6 qsfp28-2-2 0 0 0 0 0
7 qsfp28-2-3 0 0 0 0 0
8 qsfp28-2-4 0 0 0 0 0
;;; Kube05: 100G
9 RS qsfp28-3-1 4 152 472 555 949 3 509 013 890 640 2 125 952 515 2 224 145 002 0
10 qsfp28-3-2 0 0 0 0 0
11 qsfp28-3-3 0 0 0 0 0
12 qsfp28-3-4 0 0 0 0 0
;;; Uplink: Core Switch Port 26
13 RS qsfp28-4-1 128 030 532 966 173 290 502 219 122 095 521 150 966 404 486
14 qsfp28-4-2 0 0 0 0 0
15 qsfp28-4-3 0 0 0 0 0
16 qsfp28-4-4 0 0 0 0 0
17 R bridge1 0 541 473 0 3 139 0 0 0 0 0
18 R lo 1 051 258 1 051 258 6 252 6 252 0 0 0 0 0
;;; V_SERVER
19 R vlan4 0 538 162 0 3 124 0 0 0 0 0
While there are a few TX drops on the uplink port going back to my Unifi USW-24-PRO core switch, 486 out of the millions of packets- isn't very concerning. Since, I updated all of the firmware for my switches, and routers yesterday- that is likely the cause of those drops.
Having- had lots of issues with STP / RSTP in the past- I decided to check the logs for flapping interfaces.
[admin@sw-100g] > /interface print stats
07-10 06:42:50 bridge,stp qsfp28-4-1:0 learning
07-10 06:42:50 bridge,stp qsfp28-4-1:0 forwarding
07-10 06:42:50 bridge,stp qsfp28-4-1:0 TCHANGE start
07-10 06:42:50 bridge,info hardware offloading activated on bridge "bridge1" ports: qsfp28-4-1,qsfp28-3-1,qsfp28-1-1,qsfp28-2-1
07-10 06:42:53 bridge,stp qsfp28-4-1:0 TCHANGE over
07-10 06:42:53 bridge,stp qsfp28-3-1:0 learning
07-10 06:42:53 bridge,stp qsfp28-3-1:0 forwarding
07-10 06:42:53 bridge,stp qsfp28-1-1:0 learning
07-10 06:42:53 bridge,stp qsfp28-1-1:0 forwarding
07-10 06:42:53 bridge,stp qsfp28-2-1:0 learning
07-10 06:42:53 bridge,stp qsfp28-2-1:0 forwarding
07-10 13:07:40 interface,info qsfp28-4-1 link down
07-10 13:07:40 interface,info qsfp28-4-1 link up (speed 10G, full duplex)
07-10 13:07:40 bridge,stp qsfp28-4-1:0 becomes Designated
07-10 13:07:41 bridge,stp qsfp28-4-1:0 becomes Root
07-10 13:07:41 bridge,stp qsfp28-4-1:0 learning
07-10 13:07:41 bridge,stp qsfp28-4-1:0 forwarding
07-10 13:07:41 bridge,stp qsfp28-4-1:0 TCHANGE start
07-10 13:07:43 bridge,stp qsfp28-4-1:0 TCHANGE over
07-10 17:10:35 interface,info qsfp28-4-1 link down
07-10 17:10:38 interface,info qsfp28-4-1 link up (speed 10G, full duplex)
07-10 17:10:38 bridge,stp qsfp28-4-1:0 becomes Designated
07-10 17:10:38 bridge,stp qsfp28-4-1:0 becomes Root
07-10 17:10:38 bridge,stp qsfp28-4-1:0 learning
07-10 17:10:38 bridge,stp qsfp28-4-1:0 forwarding
07-10 17:10:38 bridge,stp qsfp28-4-1:0 TCHANGE start
07-10 17:10:40 bridge,stp qsfp28-4-1:0 TCHANGE over
Note- most of the events you see above- was due to me changing around parameters. I did notice a high amount of RX-Pauses, and decided to see if toggling flow-control off would do anything of value. (It didn't)
After- not finding anything really of use- I decided to try a few difference approaches....
Multi-Ping (Python)¶
After digging down quite a few rabbit holes, I started running Python- multi-ping-ext. This- is a simple python project, which pings multiple hosts at the same time, and keeps track of latency and failures.
And- it didn't take long for it to spot the issues.
For back-story, my network has 4 different routers, handling various pieces.
All of my 1G networking, is routed via the UXG-Lite (10.100.1.1, 10.255.253.1)
Most of my 10G networking, is routed by the USW-24-PRO (10.255.253.2)
10.255.253.3, is one of my internal routers for various 10G subnets. BUT- for traffic to reach it, it typically goes through the UXG-Lite, due to... Unifi's crappy layer 3 switch support.
The next question..... Where is this packet loss coming from?
Starting at the USW-24¶
Since- Unifi gives basically no details of use via the GUI / Interface, lets dig into the switch itself.
(UBNT) #show interface ethernet all
Port Bytes Tx Bytes Rx Packets Tx Packets Rx Utilization Tx (%) Utilization Rx (%)
------ -------- -------- ---------- ---------- ------------------ ------------------
0/1 45309732962 78659561832 50079930 67805477 1 12
0/2 53058446789 1779656951 47792959 9648488 0 0
0/3 38979425716 39846249001 51526480 51914192 0 0
0/4 1173970958 174831946 2073311 1073446 0 0
0/5 8343943807 109783344193 18746036 76815417 0 0
0/6 24066682019 537929824 16833100 1724027 0 0
0/7 0 0 0 0 0 0
0/8 0 0 0 0 0 0
0/9 0 0 0 0 0 0
0/10 95260623684 141818291159 102760702 128171969 1 13
0/11 0 0 0 0 0 0
0/12 0 0 0 0 0 0
0/13 4861040 1268167 57094 17508 0 0
0/14 29821762 22472576 175979 47169 0 0
0/15 0 0 0 0 0 0
0/16 4799011 1204238 56230 16791 0 0
0/17 48800564 665190853 501605 566173 0 0
0/18 50779467155 3861213785 40912085 14892259 11 0
0/19 0 0 0 0 0 0
0/20 0 0 0 0 0 0
0/21 0 0 0 0 0 0
0/22 320095833288 61377643284 284889253 183029549 3 0
0/23 6988361559 256695727474 102007331 200475378 0 2
0/24 37968718630 13602255975 57136717 41141882 0 0
0/25 23447404365 23682097617 18507312 22317363 1 0
0/26 399311466370 371416678133 331286549 322562623 0 0
3/1 98368179751 80439218783 97872889 77453965 0 6
3/2 32410625826 110321274017 35579136 78539444 0 0
Looking at the specific interface for the UXG-Lite, does not show anything of concern either.
(UBNT) #show interface 0/10
Packets Received Without Error................. 128826893
Packets Received With Error.................... 0
Broadcast Packets Received..................... 91668
Receive Packets Discarded...................... 18
Packets Transmitted Without Errors............. 103061722
Transmit Packets Discarded..................... 54
Transmit Packet Errors......................... 0
Collision Frames............................... 0
Number of link down events..................... 0
Load Interval.................................. 300
Bits Per Second Received....................... 120304808
Bits Per Second Transmitted.................... 15672624
Packets Per Second Received.................... 10643
Packets Per Second Transmitted................. 3365
Percent Utilization Received................... 12%
Percent Utilization Transmitted................ 1%
Time Since Counters Last Cleared............... 1 day 0 hr 59 min 11 sec
Really- nothing of value to see....
UXG-Lite¶
Nearly instantly, after looking at the port counters- the issue was very clear.
The primary bridge appears to be dropping tons of packets.
root@UniFiNext-GenGatewayLite:~# netstat --statistics
Ip:
Forwarding: 1
29023746 total packets received
25114294 forwarded
0 incoming packets discarded
3747325 incoming packets delivered
29195929 requests sent out
1709 outgoing packets dropped
44 dropped because of missing route
15 reassemblies required
5 packets reassembled ok
1 fragments failed
Icmp:
995633 ICMP messages received
17 input ICMP message failed
ICMP input histogram:
destination unreachable: 259
echo requests: 826570
echo replies: 168804
1080204 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 80875
redirect: 3819
echo requests: 168940
echo replies: 826570
IcmpMsg:
InType0: 168804
InType3: 259
InType8: 826570
OutType0: 826570
OutType3: 80875
OutType5: 3819
OutType8: 168940
Tcp:
78978 active connection openings
31357 passive connection openings
36080 failed connection attempts
3 connection resets received
3 connections established
2494553 segments received
2540942 segments sent out
337 segments retransmitted
0 bad segments received
37405 resets sent
Udp:
142428 packets received
116282 packets to unknown port received
370 packet receive errors
507393 packets sent
370 receive buffer errors
0 send buffer errors
IgnoredMulti: 3072
UdpLite:
TcpExt:
41567 TCP sockets finished time wait in fast timer
67157 delayed acks sent
7842 delayed acks further delayed because of locked socket
Quick ack mode was activated 244 times
4061 packets directly queued to recvmsg prequeue
1405 bytes directly in process context from backlog
TCPDirectCopyFromPrequeue: 16537
950638 packet headers predicted
4 packet headers predicted and directly queued to user
422349 acknowledgments not containing data payload received
576758 predicted acknowledgments
TCPDSACKUndo: 3
12 congestion windows recovered without slow start after partial ack
1 retransmits in slow start
TCPTimeouts: 52
TCPLossProbes: 253
TCPLossProbeRecovery: 2
TCPDSACKOldSent: 244
TCPDSACKRecv: 221
TCPDSACKIgnoredNoUndo: 115
TCPSpuriousRTOs: 1
TCPSackShiftFallback: 1
IPReversePathFilter: 3261
TCPRcvCoalesce: 44097
TCPOFOQueue: 294
TCPSpuriousRtxHostQueues: 17
TCPAutoCorking: 1185
TCPSynRetrans: 107
TCPOrigDataSent: 1469084
TCPHystartTrainDetect: 4
TCPHystartTrainCwnd: 66
TCPACKSkippedSeq: 10
IpExt:
InNoRoutes: 2524
InMcastPkts: 16772
OutMcastPkts: 34423
InBcastPkts: 21278
OutBcastPkts: 17548
InOctets: 24232842554
OutOctets: 45517272843
InMcastOctets: 1118364
OutMcastOctets: 1828988
InBcastOctets: 1251369
OutBcastOctets: 567758
InNoECTPkts: 32780595
InECT1Pkts: 9
InECT0Pkts: 2296
InCEPkts: 2
root@UniFiNext-GenGatewayLite:~# netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
br0 1500 176609 0 18411 0 238268 0 0 0 BMRU
br13 1500 4191 0 0 0 2580 0 0 0 BMRU
eth0 1500 107188671 0 88 415 135637148 0 519 0 BMRU
eth1 1500 45873230 0 3 0 16769419 0 0 0 BMRU
root@UniFiNext-GenGatewayLite:~# ip -s link show br0
13: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether d8:b3:70:8d:e8:b4 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
44763888 177339 0 18520 0 0
TX: bytes packets errors dropped carrier collsns
78755044 238774 0 0 0 0
br0 is the primary interface which receives incoming traffic. Given, its high packet loss- this certainly explains the random latency, and disconnects.
The amount of current throughput, isn't very high- under 100Mbit/s.
top - 19:45:13 up 1 day, 1:54, 1 user, load average: 2.82, 2.38, 2.30
Tasks: 144 total, 1 running, 143 sleeping, 0 stopped, 0 zombie
%Cpu(s): 11.2 us, 4.2 sy, 0.0 ni, 73.2 id, 0.0 wa, 2.2 hi, 9.2 si, 0.0 st
MiB Mem : 974.1 total, 172.7 free, 261.1 used, 540.3 buff/cache
MiB Swap: 768.0 total, 761.1 free, 6.9 used. 530.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2165 root 5 -15 115492 23928 16488 S 18.0 2.4 127:56.77 ubios-udapi-ser
3 root 20 0 0 0 0 S 7.4 0.0 63:56.26 ksoftirqd/0
2729 root 5 -15 671156 7360 2232 S 3.2 0.7 47:30.15 utmdaemon
50759 root 5 -15 380824 35984 22152 S 1.3 3.6 11:34.95 dpi-flow-stats
2083016 root 20 0 8920 3500 2728 R 1.0 0.4 0:00.26 top
3284 root 20 0 101796 16896 9444 S 0.6 1.7 31:51.49 mcad
3380 root 20 0 169904 12240 8784 S 0.6 1.2 29:59.01 exe
1836 root 20 0 25060 9744 8272 S 0.3 1.0 5:09.99 utermd
51799 root 5 -15 43288 744 644 S 0.3 0.1 0:22.12 dpinger
2062438 root 20 0 14928 7308 6160 S 0.3 0.7 0:00.59 sshd
2064428 root 20 0 0 0 0 D 0.3 0.0 0:00.72 kworker/u4:3
Top, shows current load average, is pretty high- but, not overwhelming.
root@UniFiNext-GenGatewayLite:~# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: off
TX: off
RX negotiated: off
TX negotiated: off
Looks like flow-control is disabled by default too. So, as a test- I decided to enable it.
ethtool -A eth0 rx on tx on
While- this decreased the amount of packet drops, and the frequency- this did not correct the issue.
Looking through the logs- I notice tons of processes complaining about executions taking too long.
root@UniFiNext-GenGatewayLite:/var/log# tail -n 40 -f messages | grep execution
2024-07-10T19:50:10-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1156ms exceeds expectations in io_context_ext (process-manager-child-exit post event)
2024-07-10T19:50:13-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 557ms exceeds expectations in io_context_ext (nl-neighbors-poll timer event)
2024-07-10T19:50:29-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 554ms exceeds expectations in io_context_ext (nl-neighbors-poll timer event)
2024-07-10T19:50:33-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1047ms exceeds expectations in ubios-udapi-server (single_filter_observer<wan_failover_interfaces_iface_observer> action tunovpnc1)
2024-07-10T19:50:34-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1034ms exceeds expectations in ubios-udapi-server (single_filter_observer<wan_failover_interfaces_iface_observer> action tunovpnc1)
2024-07-10T19:50:35-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1075ms exceeds expectations in io_context_ext (process-manager-child-exit post event)
2024-07-10T19:50:45-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1029ms exceeds expectations in io_context_ext (nl-neighbors-poll timer event)
2024-07-10T19:50:58-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 910ms exceeds expectations in ubios-udapi-server (single_filter_observer<wan_failover_interfaces_iface_observer> action tunovpnc1)
2024-07-10T19:50:59-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 931ms exceeds expectations in ubios-udapi-server (single_filter_observer<wan_failover_interfaces_iface_observer> action tunovpnc1)
2024-07-10T19:51:00-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1001ms exceeds expectations in io_context_ext (process-manager-child-exit post event)
2024-07-10T19:51:01-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 634ms exceeds expectations in io_context_ext (nl-neighbors-poll timer event)
2024-07-10T19:51:17-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 752ms exceeds expectations in io_context_ext (nl-neighbors-poll timer event)
2024-07-10T19:51:23-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: utils-daemon: Routine execution time of 1094ms exceeds expectations in ubios-udapi-server (single_filter_observer<wan_
More importantly, I noticed one of my OpenVPN clients was busy connecting over and over.
2024-07-10T19:51:50-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Got process exit event for process openvpn-raw-1
2024-07-10T19:51:50-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: openvpn: tunnel tunovpnc1 wasn't being tracked
2024-07-10T19:51:50-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: signal-out-notifier: Sending to mcad: EVT_VPN_ClientDisconnected server on /vpn/openvpn/raws/1
2024-07-10T19:51:51-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: signal-out-notifier: Sending to mcad: EVT_VPN_ClientConnecting server on /vpn/openvpn/raws/1
2024-07-10T19:51:51-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Watchdog will restart process openvpn-raw-1 in 20s
2024-07-10T19:52:11-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Watchdog is restarting throttled process openvpn-raw-1
2024-07-10T19:52:15-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Got process exit event for process openvpn-raw-1
2024-07-10T19:52:15-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: openvpn: tunnel tunovpnc1 wasn't being tracked
2024-07-10T19:52:15-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: signal-out-notifier: Sending to mcad: EVT_VPN_ClientDisconnected server on /vpn/openvpn/raws/1
2024-07-10T19:52:16-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: signal-out-notifier: Sending to mcad: EVT_VPN_ClientConnecting server on /vpn/openvpn/raws/1
2024-07-10T19:52:16-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Watchdog will restart process openvpn-raw-1 in 20s
2024-07-10T19:52:37-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Watchdog is restarting throttled process openvpn-raw-1
2024-07-10T19:52:41-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Got process exit event for process openvpn-raw-1
2024-07-10T19:52:41-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: openvpn: tunnel tunovpnc1 wasn't being tracked
2024-07-10T19:52:41-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: signal-out-notifier: Sending to mcad: EVT_VPN_ClientDisconnected server on /vpn/openvpn/raws/1
2024-07-10T19:52:42-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: signal-out-notifier: Sending to mcad: EVT_VPN_ClientConnecting server on /vpn/openvpn/raws/1
2024-07-10T19:52:42-05:00 UniFiNext-GenGatewayLite ubios-udapi-server[2165]: process: Watchdog will restart process openvpn-raw-1 in 20s
The root cause¶
I decided to pause the failing connection, and to troubleshoot it later.
Interesting enough- as soon as I paused it, I noticed a bunch of interesting.... updates.
Also- while digging around, I noticed one of my DDNS clients was failing. I went ahead and fix it.
Afterwards, load dropped down to 1.5... and all noticeable packet loss stopped.
top - 20:35:47 up 1 day, 2:44, 1 user, load average: 1.52, 1.56, 1.65
Tasks: 143 total, 1 running, 142 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.7 us, 7.2 sy, 0.0 ni, 76.8 id, 0.0 wa, 2.0 hi, 4.3 si, 0.0 st
MiB Mem : 974.1 total, 144.0 free, 264.9 used, 565.2 buff/cache
MiB Swap: 768.0 total, 761.4 free, 6.6 used. 525.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3284 root 20 0 101796 16896 9444 S 6.8 1.7 33:06.78 mcad
2165 root 5 -15 117296 26396 16836 S 6.2 2.6 132:09.44 ubios-udapi-ser
3380 root 20 0 169904 12240 8784 S 3.6 1.2 31:04.44 exe
2729 root 5 -15 671156 7360 2232 S 2.9 0.7 48:56.65 utmdaemon
2122585 root 20 0 8908 3420 2660 R 1.0 0.3 0:00.14 top
3 root 20 0 0 0 0 S 0.6 0.0 65:22.61 ksoftirqd/0
13 root rt 0 0 0 0 S 0.3 0.0 0:18.67 migration/1
1085 root 20 0 0 0 0 S 0.3 0.0 0:08.40 jbd2/mmcblk0p4-
1836 root 20 0 25060 9744 8272 S 0.3 1.0 5:19.49 utermd
50759 root 5 -15 380824 35984 22152 S 0.3 3.6 11:59.51 dpi-flow-stats
2062438 root 20 0 14928 7312 6160 S 0.3 0.7 0:01.27 sshd
2063867 root 20 0 0 0 0 D 0.3 0.0 0:04.59 kworker/u4:0
2079918 root 20 0 260128 16256 13864 S 0.3 1.6 0:05.55 syslog-ng
2090876 root 20 0 0 0 0 S 0.3 0.0 0:02.57 kworker/u4:4
1 root 20 0 100268 10128 7364 S 0.0 1.0 1:12.42 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.11 kthreadd
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
So- the root cause, was due to a single OpenVPN client spamming reconnect attempts, and a DDNS client which was spamming retries.
Next steps? Enable SNMP on the UXG-Lite (Unsupported)¶
Warn
Note- installing software on the UXG-lite is unsupported. Firmware updates will likely remove any customizations you perform.
I have libreNMS already running for every OTHER network device... EXCEPT the uxg-lite.
But- you know what? There is nothing special about the UXG-Lite. Its just an embedded system running a customized verison of debian.
Don't believe me? Look for yourself.
root@UniFiNext-GenGatewayLite:/etc/snmp# cd ~
root@UniFiNext-GenGatewayLite:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
So- I'm going to enable SNMP on this thing.
Info
Yes- I realize SNMP is available for UDM models, and *PRO models.
I'm not paying 400$ for the privilege to "enable SNMP", or to handle 10G routing, when I have a perfectly serviceable 10G Layer 3 switch already.
What is interesting- there is a snmp.conf on the base image.
root@UniFiNext-GenGatewayLite:/etc/snmp# ls -al
total 5
drwxr-xr-x 2 root root 32 Aug 15 2022 ./
drwxrwxr-x 1 root root 4096 Jul 10 20:31 ../
-rw-r--r-- 1 root root 510 Aug 15 2022 snmp.conf
root@UniFiNext-GenGatewayLite:/etc/snmp# cat /etc/snmp/snmp.conf
# As the snmp packages come without MIB files due to license reasons, loading
# of MIBs is disabled by default. If you added the MIBs you can reenable
# loading them by commenting out the following line.
mibs :
# If you want to globally change where snmp libraries, commands and daemons
# look for MIBS, change the line below. Note you can set this for individual
# tools with the -M option or MIBDIRS environment variable.
#
# mibdirs /usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf
Although, there are no snmp services, or daemons. I was unable to locate snmpd binaries either. So, I installed it.
root@UniFiNext-GenGatewayLite:/etc/snmp# apt-get install snmpd
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
snmptrapd
The following NEW packages will be installed:
snmpd
0 upgraded, 1 newly installed, 0 to remove and 36 not upgraded.
Need to get 56.7 kB of archives.
After this operation, 142 kB of additional disk space will be used.
Get:1 https://deb.debian.org/debian bullseye/main arm64 snmpd arm64 5.9+dfsg-4+deb11u1 [56.7 kB]
Fetched 56.7 kB in 0s (165 kB/s)
Preconfiguring packages ...
Selecting previously unselected package snmpd.
(Reading database ... 24527 files and directories currently installed.)
Preparing to unpack .../snmpd_5.9+dfsg-4+deb11u1_arm64.deb ...
Unpacking snmpd (5.9+dfsg-4+deb11u1) ...
Setting up snmpd (5.9+dfsg-4+deb11u1) ...
adduser: Warning: The home directory `/var/lib/snmp' does not belong to the user you are currently creating.
Created symlink /etc/systemd/system/multi-user.target.wants/snmpd.service → /lib/systemd/system/snmpd.service.
After installing SNMPD, I modified the config file, and restarted the service.
root@UniFiNext-GenGatewayLite:/etc/snmp# vi snmpd.conf
root@UniFiNext-GenGatewayLite:/etc/snmp# systemctl restart snmpd.service
And, afterwards, I was able to add it directly into LibreNMS without any issues.
Now- I can centerally report on its stats, performance, and packet loss.
Sure- it might get erased during the next firmware update- but, nothing an ansible playbook can't fix automatically.
Why was this issue so hard to pinpoint?¶
Reason #1- My UXG-Lite does not "SUPPORT" snmp to export data back to a central location. Have its data, in LibreNMS, would have made it much easier to pinpoint the exact location.¶
- This- was corrected by manually installing SNMP, and configuring it.
- As well- I created an ansible playbook to handle automatically reinstalling, and reconfiguring it when I do firmware updates.
Reason #2- I have a overly complex network, with 7 switches, and 4 routers.¶
A large reason for this- is performance-based.¶
- 1G clients can easily use the UXG as a gateway.
- 10G clients, would limited to 1G MAX throughput by the UXG- so, their gateway is the 10G Layer 3 USW-Pro-24
- Unifi's layer-3 support on switches is still, a laughable joke. Still, no IPv6 support. Can only add TWO static routes on a USW "PRO" switch. (Enterprise doesn't have this weird limitation)
- 100G clients, would also be bottlenecked when routing across the 10G. These client's use the Mikrotik for routing. It can do hardware asic routing, and can handle line-speed 100G with hardware offload.
Another reason- is reliability-oriented.¶
One goal of mine- is even when the internet is completely out, and backup circuits are unavailable- my entire network should continue to function normally.
Rather then making the internet-facing UXG-Lite router the "Center" of my network, I prefer to think of it as being just another component on the edge, and not at the core.
Security Concerns.¶
While- the Unifi interface is nice, easy and simple. It does not have supported for getting intricuite with firewall rules.
For my Security, Management, and IOT vlans, I needed a robust solution which was powerful, easy to configure and monitor, and extremely reliable. So- I decided to use my 10 year old EdgeMAX.
This device- has never once failed me. Its CLI is basted on Vyatta (VyOS now), which is extremely powerful, flexible, and performs well.
So- it was chosen to host the sensitive items.
This- was also done- to assist in the event someone gains access into my network. By having all of the management interfaces, physical security, IOT, etc behind a seperate firewaell- it allows me to seperate those devices from the rest of my network, physically. (If- you pretend vlans on the core switch is enough seperation).
Farewall¶
Thats it. Just sharing a brief capture of the issue that I have been trying to track down for weeks. Hopefully- you find something handy in here.