13

I've got monitoring setup on several devices in our office. The ping response time to small access switches is commonly 1-4ms... As of 3AM this morning, this has sky-rocketed to 300ms on average.

Where does one start looking in a situation like this? What things can I observe in the switch to find the source of latency?

NOTE: It's not load related.. all the links bandwidth usage is normal and unaffected, most links are very under utilized. Also - monitoring is local to the devices reporting the latency so there is no WAN factor here.

A L
  • 3,310
  • 9
  • 33
  • 55
  • 3
    Assuming this is a Cisco IOS switch... Please post `show proc cpu history` for the switch with the high ping-times. If that CPU is consistently high, or spiking high on a regular basis, run `show proc cpu sort` – Mike Pennington Jun 27 '13 at 21:36
  • Is the latency only towards the switch control-plane or do you get same latency when you ping something behind the switch? – ytti Jun 28 '13 at 07:08
  • @MikePennington - http://imgur.com/a/gfX9q#0 - this is very cool! Looks like it spikes pretty high up consistently though on average it's low .. – A L Jun 28 '13 at 15:52
  • @Ytti - didn't mean to post this on a separate line .. anyway - So I dug deeper into this. cp <-> cp response is actually low from distribution to access, or at least was at the time I tested. From an access level port to the devices on the access layer switches is where we're seeing the extreme latency. – A L Jun 28 '13 at 15:54
  • @user1353, thank you... that imgur that you posted is not consistently high enough to cause consistently increased ping times from CPU on that switch – Mike Pennington Jun 28 '13 at 16:09
  • @Ytti - it's only hitting stuff behind the access layer switches.. ex: DISTRIBUTION control plan <---1ms---> ACCESS control plane <---300-500+ms---> endpoint device. The sh proc cpu hist was from the distribution layer device which our monitoring system has to go through to his the access layer stuff. – A L Jun 28 '13 at 16:12
  • I'm sorry, it's still not crystal clear to me. So you have high latency when you ping the switch /AND/ you have high latency when you ping devices passing _through_ the switch? – ytti Jun 28 '13 at 16:23
  • No worries, I could be explaining it in an unclear way. Regardless of everything that's been said - as of now, we have identified that the latency is from the access layer switches, to their endpoints. This is measured by ping from the access layer device to a given endpoint that's directly connected to it. Yesterday we hadn't narrowed down where the latency was from, now we know where the latency is we just don't know WHAT is causing it. – A L Jun 28 '13 at 19:07
  • I should add that - the only thing between the access layer port and a given endpoint is a patch panel and drop to the users desk. I am thinking it could be an issue with the drops themselves, but I find this highly unlikely that so many drops could experience the same physical issue at a time.. Any ideas? – A L Jun 28 '13 at 19:11
  • @user1353, many things are possible... electrical noise could leak into all the systems via noisy power lines, although I find it hard to believe that this is the cause of 300ms latency, if youre using Cat5 cabling... on the other hand, if someone stuck HomePlugAV in the mix, who knows what could be happening. Have you validated the cable runs are good? – Mike Pennington Jun 29 '13 at 02:14
  • No I have not yet, that's next on the list. Thanks Mike for your suggestions, will report back when I know what's up. – A L Jul 01 '13 at 15:12
  • Did any answer help you? If so, you should accept the answer so that the question doesn't keep popping up forever, looking for an answer. Alternatively, you can post and accept your own answer. – Ron Maupin Jan 03 '21 at 02:52

4 Answers4

6

First, latency isn't directly tied to bandwidth. There are many reasons why a device would delay a packet other than a congested link.

Have you attempted a traceroute? This will show you the latency between hops, if you're looking for a L3 boundary as a suspect.

You might also check to see if any of the devices in the path have a significant usage of CPU/RAM.

Mierdin
  • 1,841
  • 14
  • 17
  • I would agree with Mierdin and also recommend MTR for continuously running a traceroute in this sort of situation. Wikipedia Link: http://en.m.wikipedia.org/wiki/MTR_(software) – Brett Lykins Jun 27 '13 at 22:26
  • @Mierdin - Thanks for your feedback, so there is no L3 factor here, traceroute shows an initially high response of about 500ms, then 260ms, then 76ms arriving at the device - these are for each try on the same single hop, not for multiple hops. See my comment to MikePennington for the CPU related info. – A L Jun 28 '13 at 15:57
3

if this is just based on the LAN, there's a few things you can do to start off to try and find out what is causing this:

  • Show process cpu history command: if the CPU usage is very high, then you need to see which process is causing this and perhaps hit google with the offending process.

  • show debug command: a common cause I've found is people leaving debug commands running on the switch. A common favourite was IP accounting being left on devices that were already over-utilized. Use "undebug all" to get rid of the debugs.

  • Give it a reboot: probably not possibly during the day, but use the "reload in" command to time it at night or over the weekend. You'd be surprised how many issues a quick reboot can fix.

  • shut trunk ports - If its a L3 switch, another common issue I've seen is too much traffic using this device for routing between VLANs. If possible, temporarily shut some of the trunk ports to see if this reduces the latency.

Its good to be aware that your pings are low priority, in regards to latency and also when being processed by the CPU. Might also be a good idea to double check your QoS settings and make sure there's no silly mistakes causing this, as much as that is unlikely.

  • Great feedback, I had already checked the show debug, and a reboot is not possible at this time. – A L Jun 28 '13 at 16:05
2

I use cacti to monitor bandwidth, and openNMS to monitor latency. If you are monitoring all the devices linked to this switch you may see a corollary between usage and the latency. (i know you said it is not a bandwidth issue, but you never now) I have seen lower-end switches sag under heavy usage, which causes a lot of latency. Do you have any "dumb" devices feeding this switch that may be the source of the sag even though this switch is not passing much traffic. Also with cacti you may be able to poll the CPU usage, and you may see a spike at the time of the latency.

As mentioned above, MTR or neotrace are also useful to keep an eye on the situation and you may see where the latency starts, which may not be this switch itself.

Blake
  • 922
  • 2
  • 10
  • 20
0

If this is not happening on LAN, you could limit the "wan port" throghtput, this will force a better TDM. Try something arround 80% of your maximun throughput and see if it helps. You may need to tweek depending the amount of terminals.

  • As i understand OP has clearly stated in the note, that this is not load related. –  Dec 02 '17 at 13:20