Got a difficult large network issue, wireless on new ProCurve setup

YeOldeStonecat

Well-Known Member
Reaction score
6,546
Location
Englewood Florida
I have this over at Ubiquities forums....copy 'n paste here...
I'm leaning towards...some issue with the stacking of the ProCurve switches...maybe someone who lives and breaths enterprise switch setups can chime in (caling netwizz?)

******************************************
So I've had this larger network in place for a few years, it is in a spread out building. It used to have a bunch of those Linksys/Cisco small business series SRW switches...switches uplinked with fiber at a gig, from the wings of the building...to the central area where the "core switch" was. About 2 years ago we put in 17x Unifi APs.



It had run great for a couple of years. In some spots we have the ToughSwitches powering the UAPs..in other spots...a few solo APs just ran from POE injectors. Had 2x SSIDs.



Network was 192.168.9.xxx



They are moving to an IP phone system, adding many more computers, and needing a faster network. So a combination of 2 things happens...

*They have wiring guys come in and add runs, relocate runs

*We come in...remove the Linksys/Cisco SRW switches...and put in HP ProCurve 2900 series (I think they were 2900 series)...a 24 porter at the top of the rack, 2x 48 port models under it,...and the 10 gig switch uplinks across those 3 switches. And in the 3 wings of the building...3x more 48 port models, with the 10 gig fiber uplinks.



Within a day or so...reports of the wireless not working in many areas. We've had them redirected to our cloud controller...which we have about 40x other of our clients on...running fine. This client has been on this cloud controller of ours for ~2 years fine.



Since we switched over to the ProCurves...wired computers..desktops...run great. No problems. But the wireless system...clients just...stop passing traffic. Windows 7 clients get that yellow exclamation mark in the wireless status. Unable to browse/connect to local resources, and unable to browse/connect to the internet.



Sometimes the connection comes back fine. And then..fades back out...

If we reboot the 3x core ProCurves...everything comes back. And it can run fine for a day...or two days...sometimes 3 days. And then the symptoms creep back.



When the wireless symptoms arise...the wired computers are fine.

I set management IPs for the ProCurve switches in the 192.168.168.xxx range.

Routing done at the edge device (Untangle)...

Default VLAN for the production network



When I put the ProCurves in...I added a 3rd SSID, for "Guest"...with the intention of doing that on VLAN 6.

I hadn't gotten to setting up VLAN tagging all the way through on the ProCurves yet...by the time the symptoms started.



DHCP pool has plenty of room...but regardless, I chopped that down from default 8 days..to 4 days..and then to 1 day.



Our Unifi controller was running 4.something firmware...whatever was current last fall. This morning I upgraded to the latest. No help...symptom returned this afternoon.



When the symptom comes up...if I go to a wireless client..here's the wierd part. If I ping a public IP, such as an OpenDNS server (208.67.222.222), or Comcasts DNS server (75.75.75.75)...I get replies.

Pings usually reply about 75% of the time. Other times....times out. And latency tends be high.



If I ping the gateway, 192.168.9.1....nada.

If I ping one of the local DCs...the DNS servers for the LAN, nada.

Which explains why if I ping a publc DNS name...like www.google.com or www.msn.com...cannot find.



I am pinging this from the standard production wireless SSID...which does not have guest policy applied.



In the guest network definition, blocked networks...I had added 192.168.9.0/24. I removed that today...symptoms still kept up.



Rebooting the UAPs themselve...usually doesn't do much.

However..reboot the 3x core ProCurves..symptoms go away.



The 3x ProCurves in the server room are "stacked".



I currently have 4x SSIDs setup...2x on the production lan, and 2x with "guest" policy applied...1 of which is on VLAN6....untags on the top ProCurve which has a patch cable going to a 3rd ethernet interface on the Untangle firewall...which is doing DHCP....for a separate subnet..192.168.10.xxx

But the symptoms where present before setting this up.



Rather early on...we turned off STP.



I'm ripping what's left of my hair out over this..it's acting like a loop-back is happening on the wireless part.

one odd thing I noticed this morning after I upgraded the Unifi controller..and then pushed out the firmware updates...2 of the UAPs gave a message about "must upgrade parent access point first". But all UAPs are conneected via ethernet. It's almost like that "mesh" feature is going on..but none of the UAPs show up in the controller as wireless linked. All are wired.



Frustrated. This network ran great on the cheap switches. And we have lots of other clients using ProCurve models...1800 series and a few on the higher series...with Unifi APs all around. No problem with them. We've come to love the Unifi APs for years.



Added note....currently (and since the setup 2 years ago)..the UAPs were dynamic IP. But haven't had a problem letting them be dynamic.
 
Stupid question - but have you double checked the firmware is up to date on the new switches?

I have a few UBNT ap's running through some ProCurve switches without any issues for about 2 years now. Nothing as complex as what you are talking about though. No fiber stacking, vlan's, etc. Just ran 4 lines between switches in trunk. Also, no guest network on the unifi's but that will be setup soon.

Maybe hang an extra ap in there with one simple new ssid and no vlan (if you can) on it and see if it's still working when others drop?
 
Couple other thoughts on my drive this morning....

Forgot to mention - if you hang a spare ap, maybe put it in a new site and keep all the extra features disabled (except security).

I also noticed the other day Zero Handoff was enabled. I know that causes some issues elsewhere and not sure how or when that snuck in there. If that's enabled, see if disabling helps.
 
Firmware was updated when they received the switches....since then, another has come out. We had HP network support remote in and check a few things...as I wanted to double check my "stacking" settings. I'll roll out the updated firmware over the weekend...but it's not a major release.

I have a thread here at Ubiqs forums....for those more interested...can follow additional details here.
https://community.ubnt.com/t5/UniFi...network-17x-APs-multiple/m-p/1523795#U1523795
One guy mentioned to disable the "Uplink Connectivity Monitor"...which checks for wireless uplink when ever main wired connection appears down. Since I don't have any wireless mesh going on at this site, I turned it off. BUT...it did make me go "hmmmm"...because, remember in my first post...I noted that on a couple of the APs..when I rolled up the firmware upgrade, a couple of APs popped up an error to upgrade the parent access point first. There should be no parent access point unless it is the client of a AP to AP wireless mesh.

Tyler...yeah primary SSID (and 2 others) are on default VLAN.
The Guest SSID was also on that during initial roll out....and symptoms started appearing. We then, (as planned anyways)..moved the Guest SSID to VLAN 6...going to a 3rd interface on Untangle...on a different IP range with DHCP enabled on that Untangle interface. So with or without VLAN 6...the symptoms were present.
 
I don't work with Ubiq, but I have a warehouse where a vender has 5 of them in place connected to their network. They were put in about 3 years ago and never had a problem until three weeks ago when I changed everything about the network. They changed ISP's and I replaced all the wiring and switches. For a week the AP's wouldn't work unless you rebooted the modem, and would drop over and over during they day. The vendor came in and he said he had to reset the AP's, update the firmware and then reconfigure them. Since he did that no problems.

I know it's no where near your setup, but just thought I would share.
 
I had a funky issue with an oddball managed switch, made by Luxul, and a ubnt AP... I never resolved the issue as the managed switch was unnecessarily complex for the environment (5 workstations, derp) and just replaced it with a dumb switch. However, I remember investigating the issue on Google and found some leads relating to ubnt VLAN tagging... when the VLAN ID was "1". What's the default VLAN ID for those procurve switches? Might be worth doing some google-fu.
 
I am no an expert on VLAN tagging/Procurves. But here are my thoughts:

I'd be looking into MTU size - had the MTU size gone up to 1500 on the new core switches and was it lower on the old switches? If packets are being fragmented and retransmitted then that can account for poor and intermittent performance as the packets are fragmented and have to be reassembled. I would try the ping tests to see what MTU is ideal - http://www.tp-link.us/FAQ-190.html et al - or might just whack the MTU down to a valid number in the 1300's to test.

Random thought - any chance the new switches have port security and when a certain number of MAC addresses are seen on a port that they are dropping traffic?
 
However, I remember investigating the issue on Google and found some leads relating to ubnt VLAN tagging... when the VLAN ID was "1". What's the default VLAN ID for those procurve switches? Might be worth doing some google-fu.

Hmm...curious what that was you found. Default_VLAN on Procurves has an Id of 1. Default VLAN for the UBNTs is assumed to be 1, you cannot specify it otherwise...as the custom VLAN options are starting at 2.
 
So much for the "disable wireless connectivity check" option....I did that this morning, and they just dropped about 30 minutes ago. Restart of the switch brought them all back on.
 
If you did suspect a loop then wouldn't turning STP back on be in order? The main purpose of STP is to ensure that you do not create loops when you have redundant paths in your network.

Is there anyway you can get an SNMP switch inspector or use the web interfaces to monitor packet throughput on all the ports? If there is a loop then you might see a couple of ports with very high packet counts.
 
An update on this...we've noticed a trend which is pointing to the wiring job. Last fall we had our wiring guy come and look at the job, but this client apparently went with the lowest bidder. Twice now....when the APs fell offline, we go onsite, log into the ProCurves web admin, and notice certain ports are chatty as heck with high error counts, and often we'll see those ports dropped down to 10 meg speeds. Here's an odd one....both of those times, the servers chassis fans were screaming 100% when we walk in. And the air conditioning in the server room was working fine. We unplug those cables...immediately the servers fans drop to normal speed, and the wireless springs back to life.

Something is horribly wrong with the wiring job those guys are doing. They didn't even label the ports in the patch panel! And I asked for that many times. I also asked that each run be verified/certified once they moved it.

Added note...on most of the ports that the ProCuve reported tons of errors on...it could not pull a MAC address from the other end.

Last night (that was the second time I referenced above)...we just administratively shut off 8 chatty ports. Come Monday....I'm sure some people will call and complain of no network access..and I'll have them call the budget wiring guys.

It's turned into a pissing contest...we get eggs on our face "cuz the network ain't working". We had HP enterprise networking support remote in and check out config..all good. I had help from Ubiquiti..forms, and staff, config is good! We had a lot of time and effort on our part due to this...seriously like 35 or 40 hours...often after hours, between 2 of us. Got the logs from the switches to prove the wiring faults. Going in next week with our wiring guys Fluke cable tester to further back up our argument.

The delicate part is...it's been a long time AYCE client, and I'm in the middle of renegotiating the 2,300/mo to 4,200/mo. So I don't want to be a jerk and force an invoice on them for this overage. Gotta cool my jets and get a meeting with the CEO to work on this.

My learned lessen from this...if a wiring job isn't being done by our guy, I want proof that those network runs are certified and good before I put my equipment on it.

**added note..odd how a bad network run plugged into a switch towards the bottom of the rack, affects the fans of the server which is plugged in on the top of the rack core switch.
 
Last edited:
My learned lessen from this...if a wiring job isn't being done by our guy, I want proof that those network runs are certified and good before I put my equipment on it.

I'd say it's not worth risking losing the account by giving them an unexpected bill, even though I think it would be fair to do so, they probably won't see it the same way. Take what you learned, and add the above to the renegotiated contract and all your new AYCE contracts.
 
Funny, every issue I have had like this recently turns out to be wiring...yet I did not think wiring when I read your OP.
 
I didin't chime in to start with, but when I read your OP, I thought that there must have been some sort of TCP flooding going on, and was going to suggest a faulty NIC. I was going to suggest a Fluke Network Analyzer, but those things require a second mortgage for most of us, and we don't get the return on investment.

Andy
 
My learned lessen from this...if a wiring job isn't being done by our guy, I want proof that those network runs are certified and good before I put my equipment on it.

Been burned way too many times with the assume thing. I know you don't do wiring but you might think about getting a low end tester to have in your bag. Not one of those blinky light things as they are useless. Personally I bought a JDSU NT1150 but that might be more than you want to spend and it discontinued anyways. You can get a RWC1000 for around $350 which will probably provide you with decent results. If I suspect a premise wiring problem I'll toss it on a couple of suspect drops and show the results to the customer. No delays, muss or fuss. Well, the customer might fuss but it won't be about me. LOL!!!
 
So...symptom kept lingering, wiring guys came back, redid, verified. Still playing "whack a mole" here...occasional interrupts, we find a noisy port, kill it, network returns...

..started noticing more random ports. They'll be at 10 megs, noisy broadcast. We find relatively new PC, but in sleep mode. Wake it up, renegotiates at 1000.
...came on this post today..
https://supportforums.cisco.com/dis...17-lm-nic-causes-broadcast-storm-and-high-cpu

BOOM! There it is!
 
Back
Top