Tuesday, October 9, 2007

I do NOT heart Motorola

Alright, I'm a zombie wreck this morning and a bit bitter about it, so please excuse the rant. I try to be ok with losing sleep for the sake of my daughter, but when it's just some vendor's bug in their router code that keeps me up til 2am (well, 3am b/c I was so horked off I couldn't go to sleep right when I got home), well that's another matter entirely.

See, apparently our mid-sized cable company is the _only_ one that has discovered this recurring routing issue on their plant. And I'm the only one in my office who gets stuck working on this issue every time it comes up (mostly because the others don't know how to recognize it and blow it off until I get a call about it on one of my night or weekend shifts.)

Let me see if I can explain (there's only one guy at work who understands what I'm talking about, but I need to get this off my chest.)

In this example, we have two customers using our Internet service on two BSRs (large routers with RF capability) in different parts of town. Now, each router has different ranges of addresses available for the customers to use. Usually there is at LEAST one static range (pay extra for this), and usually a few ranges they can use dynamically. Now generally this issue presents as a subscriber is trying to remotely administer a server elsewhere on our network, or maybe VPN into a business in another part of town. They are either able to do so just fine from a third location, or are able to do so until they dynamically pull an IP from a new range. So once the helpdesk makes sure the customer hasn't blocked themselves out with a firewall entry or specifying the wrong subnet on the static assignment for the server, everyone involved thinks this is a simple routing issue (bad subnet somewhere?) or maybe we moved one of the scopes around, and left a remnant on the old BSR, and now it's black-holing the traffic.

Since routing, and managing the IP address allocations are both my jobs, this eventually ends up in my lap. I've only been working in this department for about 17 months, and this is the 5th time this has come up. Four of the 5 times, when working on the issue, usually with Motorola, one of the BSRs gets rebooted, and the problem goes away. BUT I do NOT like rebooting a router with 10,000+ subscribers on it, causing a 30 minute downtime for all and requiring about 4% to manually cycle the power on their modems to regain service - all for an issue that's ostentatiously affecting less than five subscribers. And then doing it again sometime in the next 3-4 months. We have enough outages as it is without some bug increasing the downtime.

Unfortunately, in order to PROVE to Moto that this is THE code problem, I have to do a rediculous amount of testing - to PROVE it's not a setting in OSPF, or a piece of equipment between the BSRs with a static entry that's black-holing the traffic. The best (*sarcasm*) part of the issue, is that on first look, it's fine, because you can ping from the gateway address of the initial range to the gateway address on the second. So, routing looks good. It's only when you involve the CPE (customer premise equipment) devices that respond to ping that you can get a picture of the problem. Now, usually the device with the static IP is blocking ICMP traffic (not a bad idea) even if the other devices involved in the problem aren't so I have to randomly ping devices in the affected subnets until I can get a decent test sample. THEN, ping from each gateway to each OTHER gateway in the affected and a sample BSR (control group), and ping the CPEs I found that are responding to ping earlier. And then to seal the deal, we try to get subscribers in the affected BSRs to ping the gateways and CPE devices.

I pop the results into an Excel spreadsheet, which now has the results from 3 different instances, so I can send it in when I open the ticket and (hopefully) cut to the chase. Although the solution they found last time didn't help clear up the issue this time, so we're back at square one.

So anyway... that's what I was doing until 2 am last night, instead of going home at midnight. And unfortunately for me, if I want to be remotely functional, I need my 8 hours of sleep. Since my husband leaves for work @ 8:30... and I don't know of many 2 year-olds who sleep past then, any time an issue like this crops up, I get no sleep. (Anyone else have trouble sleeping b/c they're not asleep and they keep themselves up worrying about not getting enough sleep? Beautiful.)

So... that's my rant. I feel better. Still stupid tired, but better.

No comments: