Cisco Catalyst cluster heartbeat switch issue - increasing input errors

Question

PROBLEM: The servers in two clusters keep losing heartbeat connectivity with each other thus causing database outages. Outages are brief but disruptive.

SETUP:

There are two clusters of three servers each.
Each server has one NIC connected to a single Layer 2 switch (Catalyst 2950) with the switch ports hard-coded at 100Mb/full-duplex.
The DBAs confirm that each heartbeat NIC is hard-coded to 100Mb/full-duplex.
There are two clusters configured in VLAN 100 and in the same subnet (10.40.60.0/24).
The management IP address is on a separate subnet (10.40.1.0/24) and it's switch port is in VLAN 1.

SYMPTOMS:

I see an ever-increasing error count on the switch ports. For the three servers in one cluster the input errors (all CRC) are about 3% of total input packets. There are no output errors. The other cluster is at about 6% of total input packets.
Transmit and receive load on the switch ports is light, under 20/255 on txload and rxload.
The switch log shows the switch ports bouncing:

May 16 11:15:31 PDT: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/13, changed state to down
May 16 11:15:32 PDT: %LINK-3-UPDOWN: Interface FastEthernet0/13, changed state to down
May 16 11:15:34 PDT: %LINK-3-UPDOWN: Interface FastEthernet0/13, changed state to up
May 16 11:15:35 PDT: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/13, changed state to up

TROUBLESHOOTING STEPS PERFORMED:

I replaced the old Cat5 cabling between the server heartbeat NIC and the switch with new Cat6 -- no effect.
I created a new VLAN 200 in a new subnet (10.40.61.0/24) and had the DBAs re-IP their heartbeat NICs on one cluster -- no effect.
We tried every combination of speed and duplex on the switch port and the NIC -- no effect, went back to 100Mb/full-duplex on both.
The DBAs upgraded the Broadcom drivers on both clusters to the latest -- drop in error percentage on the 6% cluster down to 4%, the other cluster is still at 3%.

MY PROPOSED NEXT STEPS:

There are Intel NICs on the servers. Try moving the cluster heartbeat to an Intel NIC. Maybe it's a Broadcom issue?
Change out the switch to a gig capable switch. There is a Catalayst 3560x available, but taking it will delay a project. Maybe gig on the switch port and NIC will play nicer?

THOUGHTS?

Is there something I can configure on the existing 2950 switch to mitigate the errors? What additional troubleshooting steps should I take?

score 9 · Accepted Answer · answered May 17 '13 at 22:45

CRC errors are often cabling problems. Here are the things I would check next before swapping out hardware:

Are the servers connected directly to the switch or do they connect through some sort of infrastructure cabling? If so, get the infrastructure cables re-certified.
If you have a real cable tester (not a simple continuity tester), I would test the cables.
If the cables are hand made, I would replace with factory made cables. Often run into these types of issues with hand made cables.
Check to see if there is any source of EM near where the cables run. Re-path the cables if you can even temporarily to make sure they are kept separate from power or other sources of EM.

Beyond that, I would start at the NICs as you already indicated. Could be you got some from a bad run.

score 3 · Answer 2 · answered May 20 '13 at 20:19

I would recommend testing by moving to the Intel NIC as you have proposed. I have run into similar issues where a small percentage of the traffic was input errors. We trouble shot the problem by placing a dumb hub between server (in my case it was cameras) and the switch. If the switch no longer sees any input errors then the problem is the server NIC.

I tried many of the same steps that you have proposed. In my case it turned out to be a bad manufacturing run. The only thing that corrected the issue was replacing the NIC (cameras).

Cisco Catalyst cluster heartbeat switch issue - increasing input errors

2 Answers2