This is a story which takes place over the past week at a Swedish company whose name was kept secret for obvious reasons. The person narrating is one of my friends (we both dabble in SGI gear and I helped debug (pun not intended) his VMware Irix netboot appliance so it became an out-of-the-box solution).
This is one of those IT stories that make some people have trouble sleeping.
Anyways, to the story and be aware, there will be a lot of IT talk.
There is also a fair number of translational typos.
[quote="DeBug posting on a geeky forum"]Hi all, here is another story from the Trenches.
Or more correctly a self glorifying story where I frequently compares myself and my colleagues with fictionally characters from Sci-Fi TV stories when I describe a recent IT system malfunction.
Some dialogue and events are slightly exaggerated for dramatic effect.
This story is probably best enjoyed by IT tech aficionados.
In Star Trek Voyager Janeway and crew have a “Year of Hell” where they are forced to work a battle of malfunctions and declining resources. Every action just takes them deeper in to trouble and misery. The ship is in the end so crippled that a ragged Janeway just gives up and rams her ship in to the offending Timeship killing them both with the last words, “..because this is one year of Hell I rather forget”
We then miraculously find Janeway back in a pristine Voyager spiff and spam, the whole incident never happened as the time-line is erased.
Well, I and some of my trusted colleagues at work just had our own “Week of Hell”
It started with what I knew was going to be a pretty hectic week. I had a full schedule and of the three guys that work in “Operations” (responsible with keeping our 80 to 100 servers around the world running) only one guy would be in the office, the other two were away for training or having called in sick.
The lone guy in the office was also trying to juggle a difficult situation at home and so was working on reduced hours and sometimes had to leave on short notice so the stars was not aligned in a favorable way for us to start with.
Monday:
Monday was uneventful but we had a lot of snow falling and a cold spell had rolled in starting to shutdown some of the airports in Europe.
Tuesday:
I was up early on Tuesday to adjust for the prolonged commute. My morning would start by giving new employees an introduction to the IT department. They had been flown in from around the world so I would need to be at work early, snow or not.
A bit later that day I had my monthly meeting with the President of the company and I reported “Everything was working perfectly and in fact it had never worked so well ever during my 13 years in the company”.
After I left that meeting I could not shake a feeling that perhaps I had jinxed something.
Maybe the Gods would take offense and demand a SQL server as offering? Or send the plague down upon us with a new computer Virus that would knock us out?
Thanks to our diligent planning and work in Disaster recovery and prevention it was now years ago since we had a major outage.
Maybe we had become complacent?
I shook it off as a bad case of the Swedish “Jante lag” were we police our self in to thinking that we are of no special importance and should not be proud of anything.
Fuck it! Everything was working great!
But had one been attuned to the universe one might have felt a slight shift in alignment that had started the day before.
In the log files of some of our systems data accumulated indicating subtle changes.
The rest of the day progressed with lots of work, but the guy in the office and I managed to handle everything before I left for some outdoor exercise with a friend of mine. We walk 10 kilometers with 12kg of backpacks to burn some fat while we go over current events.
The fresh snow and the -10 degrees made the usual route take longer than normally to complete.
I returned home just in time for my 23:00 a clock phone meeting with our US office.
For some reason they never called but as I knew that they just wanted me to perform some account changes so I did that remotely using our remote access systems.
At that predetermined time they would have a sat down with their IT guy and tell him he was fired. The account changes were just a precaution to make sure all of his accounts were locked out.
I had a call at 01:00 that I missed to pick up on my cell, looked like it came from the US but it had no initial 1 as country number, odd.
I called back but ended up at a Swedish automated voice “..cannot connect your call right now”
It was around 02:00 before I fell asleep.
Wednesday:
The next day brought more snow and cold and I had a new IT guy from the Sydney office that we gave some training and then before lunch we changed the Administrator password for the domain (the master key for all systems) and proceeded to hunt down any IT systems that may be affected by the changed password. There shouldn’t be any systems affected by this in a perfect world but there always are because of oversight or laziness by the administrators.
The guy in the office and I managed to find the few systems that seems to be affected and we fixed it in an hour or so and was quite proud of ourselves that we didn’t spend anyway near as much work as the last time we changed the Administrator password.
But one system didn’t work. VMware Lab manger used by our R&D and QA departments.
Luckily that system was still not in “operational” status so outages was acceptable.
I fixed a remote desktop problem on several servers and was then working the Lab manager issue for most of the day but was not able to fix it.
The guy in the office said something of one of the minor SQLs had a problem and the Dynamics AX and he had fixed them but I wasn’t really listening, one of the server had lost its network connection in the vSphere or something.
Late in the afternoon the people in the US office woke up, got to work and then promptly called me.
They had a minor problem in that a new employee was not showing up in the Global address book of the mail system.
His mailbox worked but they wanted him to be visible in the address book as soon as possible.
It was getting late in the afternoon when I called my Exchange guy, he was the one away for training but I figured the training was done for the day.
“Yeah sure, I can have a look at it” he replied on the phone. ”It is rebuilt a few times a day but if they want it now I can fix. Let me just get connected to the network”.
He called me back 10 minutes later, “I at my Hotel in front of my computer but I can’t connect to the mail server with Remote Desktop “ he said.
“Oh, yeah there seems to be some odd problem with Remote desktop and DNS registration after the password change, I have fixed the problems on the DCs but the mail server probably needs to be restarted to get Remote Desktop working” I said.
We then discussed how to best proceed but as the major portion of the European staff would now be going home, a reboot of the mail server seemed OK, their Outlooks was forced in to “Cached Exchange mode” so most of them would not even notice the 5 minute of down time.
I then went in to the server room and logged in to the server using a real keyboard and screen and restarted it using the reboot alternative in the shutdown options.
I had just started to dig a hole for us.
My mail guy then called my back in 15 minutes, “did you really restart it I still can’t get in”
“Yeah sure, hang on while go to the server room and have a look” I said.
The screen on the server room was now still showing the “Shutting down applications” and we decided to give it a few more minutes. After a few minutes it was still sitting still with the same dialog.
I checked my outlook and to my surprise it was still operating and handling E-mails.
I told my mail guy and he said ”Well, not on my outlook, it seems to be working in the mailbox but not in calendar view so we seems to be stuck in a semi working mode, It has hanged at that same place a few times before and I had to kill it using the power button”. We then discussed killing it with the power button rather than having it hanging in a semi-working mode. The server has RAID disks that have battery backuped controllers that store unwritten data between reboots. The NTFS is also a journaling file system so we should be good.
I then pressed and hold the power button for several seconds until it went down. I then restarted it again.
The server started it boot sequence and when it passed the RADI controllers in the BIOS the P400 RAID card that handled the small system disk reported it had unwritten data it would soon commit.
The P800 Controller handling the external disk that holds all mailboxes did not show anything.
At the boot sequence Windows came up but instead of briefly pass a few status screens and present the log in prompt the “Applying computer settings” screen stayed
And stayed.
I called the mail guy and he tested his outlook, contact with mailbox but uncertain if the calendar worked.
We decided to wait it out.
After 15 minutes is still had this screen up, we then discussed what the cause could be, this is the place where Group polices and local policies are applied so went over the scenario that there were some problem with that but soon ruled out bad GPOs.
We then decided to try a second reboot, I once more had to kill it with the power button and I started to fill really uneasy, you definitely do not want to kill a SQL or exchange server that way but we had no alternative at that time.
After the second or third reboot and after playing out all sorts of trouble shooting scenarios between us we were empty of ideas and starting thinking of maybe it was time to throw in the towel and to do a controlled fail-over of the entire mail system to the fail-over server.
That’s why we built it anyway.
So we decided OK, let’s do that!
My mail guy came back on the phone after logging in to the fail-over system in the remote location and stuttered muttering for himself while he had some other problems with that server and then said slowly “it seems we have had no sync since Monday, if we fail-over we will lose all work performed in Outlook and all Emails sent or received for the last 3 days for all 450 people, do we really want to do that?” There was some silence while we started to understand that this was no ordinary problem that we could do a quick fix to get out of.
We had never really been worried until now.
Now it started to dawn upon us that our “magic card” – the fail-over system would be costly to use.
We checked the backups and noticed they had not run since Monday so we were even worse of trying to restore the server from the backups.
We could not fathom this.
Both the backups and the fail-over system had been non-working for the last days. What was the odds for that?
The clock had become 23 at night and it was with heavy steps we decided to call it a night and let the server spend as much time it wanted to get pass the “Applying computer settings”.
Hopefully it would go past that during the night and we would be able to log in on the server and fix the replication issue and then fail it or as a second option try get backups going. But that was a low bet option, trying to restore a server from backups taken when it is already faulty is never a good option. Or as a distant third option we would find and fix the problem and continue run from the server.
The strange thing was that it was actually “working” at least it sent and received mail and peoples mailboxes were visible. We figured Kuala Lumpur and Sydney office would be OK when they started working soon and when they signed of we would be in early in the morning before people in the European offices logged on. And in that small time slot we could fix the problem.
As I drove home that Wednesday I knew that the stakes had been raised and that the ante was being increased for every minute. For every new transaction that was made these new transactions would be lost if we decided to do a fail-over. All the mail that Sydney and KL would process tonight we would erase in one instant. I’m not sure the staff and managers would forgive me for such a thing.
Thursday:
Although I had the Alarm clock on 05:00 I was not able to pull out of home until 07:30, my body heavy and slow.
At this time no one had really understood that we had a mail problem so if we could fix it during the morning we would be in the clear.
When I got in the office 08:15 the server room door was already flung open and the Operation guy was hanging in the door with the help-desk guy with grins on their face, the operations guy already on the phone with my mail guy in his Hotel room.
Judging by their smiles I thought we were in the clear but then I say the server screen still sitting with the dreaded “Applying computer settings”.
I guess it was a grin of “We are so fucked up”
I talked with the mail server guy and we started to go over options.
We could continue to run the server, no one would know there was a problem, but we can’t log in on the keyboard we can’t log in over Remote Desktop, we can’t do anything on it!
Backups were unable to run against it, fail-over system was unable sync against it.
So we weighted our option continue to run the system or force our way in to it and possible break it even more.
We quickly decided we had to fix the problem, the longer we waited the more data we would lose.
So we grabbed our shovels and started to dig some more. We would dig until the hole either swallowed us or the dirt would end.
I went to my whiteboard and wrote down what we knew of the problem to see if we could apply troubleshooting ala “Dr House”
1 Server don’t accept Remote Desktop.
2 Server does run mail services.
3 Hangs at “Applying computer settings”
The operations guy then came by and said that he had been able to log in!
“I tried rebooting it without network plugged in, and it came up, I then could log in using cached credentials”.
“So you have desktop on it?” I asked.
“Yes”
“But it is not on the network now?” I asked.
“No, cables are still out” he replied.
This was both good and bad news, we had been able to login but at the cost that no one was able to work against it, mail at my company was now officially dead.
Helpdesk call would follow in a matter of minute.
Time was now 09:00 and all our European subsidiaries had started working not to mention the Headquarter itself with 200 people in our staff.
We quickly plugged in the network cable again and to our relive the databases had come on line and the Exchange after some housekeeping work started up its services. Mail worked and we were logged in to desktop!
This was progress, we would soon be out of the hole!
Once at the desktop we found several odd things indicating we did not have enough access rights, we could not start or stop all the services etc. We also found that Trend Micros Antivirus process consumed 50% of the CPU power. We could not kill this process, nor could we stop the service, nor could we disable the service, when we tried the computer hang and we had to reboot by pressing the power button.
In fact even though we now had a way in to the desktop we were nowhere closer to a solution.
We rebooted it more times with the power button than what is healthy. I knew that if we continued to press our luck it we would corrupt the file system and then we would have no server at all.
Also our reboot required us to unplug the network cable so that meant that we pulled the mail system for the organizations for 10 minutes during the reboot process and people now had started to complain about the frequent disruptions.
The day progressed with trouble shooting between me and the mail guy that now had abandoned any hope of participating on the training he had traveled to Stockholm for.
We worked the problem, hang the server, rebooted, talked on the phone, and I run to the server room to performed some steps and reported back to to him etc.
Then around noon we were able to get him for the first time to see the desktop when we got a remote control working using HP ILO.
But that remote control frequently hanged so I had to be helping out in front of the keyboard at times.
At 14:00 we decided to use our Premium Support from Microsoft, called it in as a critical case.
We gave it the highest priority “Our Company will go under if not solved” - priority. This meant that we now had expertise from Microsoft that would work with us around the clock to solve our problems but I knew it also meant that we had to work around the clock or they would downgrade the priority.
On my whiteboard it now said:
1 Server don’t accept Remote Desktop.
2 Server does run mail services.
3 Hangs at “Applying computer settings” if cable plugged in.
4 Trend micro consumes 50%
We decided to take one thing out of the equation and to close down the Blackberry system as the frequent mail disruptions might cause it to stop working, its log-files showed it had lots of problems and corruption seemed to build up in it databases, it would probably stop working soon if we continued our actions.
As the mail server was up at times we could still inform our users about the taking down of the Blackberry system and I did so by sending out a mail to our IT staff around the world. Our PS manager in Paris quickly responded that although she understood the need for this there were a lot of people out there that was relying on their BB to communicate as they were snowed in at many of the airports that had closed traffic or was running with long delays.
So we decided to restart the Blackberry server, we now had raised the ante some more. We were now risking damaging one more server.
The Mail and Blackberry service was now so intermittent that head of Sales came down, spotted me coming back from the server room on my way to my desk and asked half jokingly, half serious “ Why are you walking so slowly when there is no Email”? He then saw my eyes and stopped talking. I told him, "this is serious; we may not be able to fix the problem. We may lose all mails since Monday.” He then said “ Wow” irritation in his face changed to surprise and fear.
He turned around and walked away.
Microsoft continued to run troubleshooting with us and run in to several baffling problems, connection to log in servers was OK but some simple actions was denied access.
On the Remote Desktop Management console for example it said “You are not authorized to run this server”
As the day went by the server stayed up less and less time between the outages.
Soon it was not running it services at all and mail was now completely dead.
I regretted that I hadn’t sent out a worldwide mail about our problems before it went dead. We had to put up a notice on our Intranet and hope people would read it.
When the Chilean and US subsidiary woke up we had no mail, they had no working Blackberry witch to them is a big deal. They called and I promised to keep them updated every hour.
Microsoft leaned towards AD problem, and then towards problem with the teaming of the network cards, and then towards the Trend micro and then towards more domain and AD security problems.
At 18:00 the Swedish guy handed the case over to an American Colleague, the new American guy seemed to be less skilled than his Swedish colleague and my mail guy at one time hinted that maybe he should have a brainstorming session with some of his colleagues at Microsoft but it seems to go past him.
At 17:00 the company Christmas party started next to my room, I was not attending.
At 19:00 we managed to briefly get the server going for a few minutes and the Blackberry server spurted out all the backlogged emails that flooded in from our customers out to our devices.
That coincided with the people at the office leaving, going out to Christmas dinner, they were a bit tipsy and some of the blackberry owners noticed they had received Emails.
They were high in spirit and said a few words of encouragement as they passed my room. Among them the Sales manger who said “Blackberry is working again, that’s great! You will nail it I am sure”
I smiled and played along “Yeah, we will nail this thing”
In reality the only thing I was sure about was that, that would be the last time the mail or Blackberry server would be online for along time.
An hour later our suspicion had taken us to the bundling of the Network cards, we have had problems with this on other servers so I tried to start the management tool to manage the teaming of the cards but the server hang when I tried that. I pressed the power button and initiated a new restart, unplug LAN cable at the back, rebooted, logged in, run back and plugged in the LAN again. I had done this so many times now so it should have became routine but it still haunted me whenever I had to forcefully take down the server and make all them disk in the RAID go “kathunk” from instant power loss.
I then forcefully un-teamed the network cards by using the device manager and deleted the team adapter.
When I then tried to pull up the network card settings the server hang.
Rebooted and retried. There was no way to set the IP settings.
The server had now lost its original IP settings and was running on a DHCP assigned address and there was no way for us to change that.
That meant that the server was no longer receiving any mail from the mail gateways and that the Blackberry server could no longer talk to it or any Outlook clients for that matter.
The hole that we had started digging 30 hours ago was now so deep that we started to question the decision to work the problem until resolved.
Every step that we had made had only worsened the problem.
We had started the day with a working mail system but now that was gone.
Time passed as our mail guy continued to work with Microsoft on the phone.
At around 21:30 exhaustion started to set in.
The mail guy and I chatted using Google Talk and keyboards while he was on the phone with MS
“They want us to remove it from the domain and then rejoin it” he told me.
“That sounds risky” I said, “our only means of logging in to it is by using a cached credential from our domain and they want us to remove it from the domain?”
“Yeah, I agree it is totally insane but right now I am too tired to think for myself” he replied.
I thought about it for a while and then replied.
“Yeah me to, but what the fuck? It will surely be the last we do on the server but I am out of options.”
As I walked over to the server room I continued to chat with him using my Android phone when he typed "let’s first create a local account on the server so we have a fallback after we have removed it from the domain."
So I created a local Admin account and logged out of the server and tried to log in using this local Admin account but I was only greeted by a grey background instead of a desktop. I tried starting a new explorer.exe (the process responsible for drawing the desktop) but it did not help.
We had to press the power button for the zillionth time. I flinched so much now pressing the power button I nearly had to do it with my eyes closed.
We figured this fucked up server could probably not create a new user profile.
There must be an local account created earlier that already has a user profile.
My mail guy found another account he had used in the past and logged in using that over the remote ILO console but he had the same problem, no desktop.
I typed on my Android, “The only working account is the domain account that is cached, Do we dare to remove the server from the domain and are we really ready to not being able to log in at all?” I asked.
He came back and typed “No, we can’t risk that.”
Although we did not fully understand it at that time. That was the moment when we hit bottom. The hole was not going to be deeper.
We had just drawn the line and said:
"I will not sacrifice the Enterprise. We've made too many compromises already, too many retreats. They invade our space, and we fall back. They assimilate entire worlds, and we fall back. Not again. The line must be drawn here! This far and no further!
We decided we would not take MS recommendation of removing it from the domain.
My mail guy asked me to talk with the MS guy directly as his remote control session was frequently halting and I with my direct access to the keyboard would be more productive.
When I spoke with the MS guy I could tell he was out of options. He said “my team has identified from the logs you sent in that Trend Micro process is taking a lot of CPU power, at this time we suspect it to be the problem and would ask you to open a case with them and try to solve the heavy CPU utilization.”
I sensed this was his exit, if I went with it we would not make it, Trend would take several hours to get a grip of the problem and at that time we would be completely exhausted.
But strengthened by our earlier decision I sensed I now had to take control.
I replied “Well I could call them but I don’t know what they would be able to do about it”
“We can’t kill the process, we can’t stop the service, and we can’t even disable the service from starting up!”
“Have you been in to safe mode and tried to disable it there?” he asked.
“Well, no” I replied.
We then restarted it in safe mode disabled both the NIC teaming and the Trend Micro service and rebooted the computer with a cable.
It came up perfectly and did not hang but instead presented a login dialog.
At this time I had my mail guy on the phone and we both cheered out when we say the server presenting a dialog. After login everything was working, the synchronization to the failover system had already started to send the backlogged changes. We fixed the IP settings connected the Blackberry system and a lot of other housekeeping tasks.
The time was now 23:00 and we were jubilant, a guy from the US office told me the Intranet server was down, I was so tired but happy that this piece of news did not even break my rhythm, I found the problem, it was the same problem that the IT guy in the office fixed earlier
Booth my mail guy and I was now chatting with the US office people in GTalk and jokes were exchanged.
Both they and we were very happy.
People from the office party started to drop in to get their coats before going home, they were drunk and in a happy mode, I was in the same mode but sober and tired.
As I walked out the door the last one was one of the girls that work for me, we said by-by and she would close up shop, she was fairly cross eyed and slurred when she talked. I laughed for myself.
As I drove home in the night at 23:30 it snowed a lot out of the city and I reduced speed.
As I neared home and descended up the 200m ridge before my hometown I could not make head or tails on what the grey and with smoke or fog was that drifted pass my front window.
I was so tired that my sluggish brain could not make out that I was driving true a full blown blizzard.
But I wasn’t afraid of being stuck or stranded.
In the last days I had proved I could dig myself out of hole and as I sat there in the warmth of my car driving slowly true the blizzard I went over the events or the last days and contrary to Janeway the last days is not something I would like to forget.
I would not like to re-live it anytime soon but I won’t mind remembering it.
In fact I believe that in the years to come and far in the future when we are all retired and no one is longer dependent on our skills or value our opinions some people in the office will remember that day when the Email system was a bit flaky.
But for me and maybe the great guys in my team they will perhaps remember it as the week of Hell.
Friday:
On Friday I worked from home and the three of us that had worked the incident had a telephone conference were we went over everything and the strange thing is that we had practically no task to perform, it was like the shitty week had never happened.
After we hang up I decided to take the day of, my calendar allowed it so why not.
I started thinking maybe I should write down our experience?
As I thought about it for a while I went out to the kitchen.
I noticed the accumulation of paper boxes in the corner from all my ebaY purchases and decided to clear them out, as I lifted one up, it was not empty at all. It was a full box of Carlsberg Hof beers.
So less than 10 hours since the shittiest week for ages I found myself at home, having the day off, and I had a free box of beers!
I must have bought the box of beers some time ago and then accidentally stacked empty boxes on it.
I don’t know, either that or perhaps..
Perhaps someone, at some time, erased an entire time-line…
Live long and prosper![/quote]
Jesus fuck. Here I was thinking I was in trouble because I had almost a grand in Apple IIe computers that were suppost to be in the mail last week. Losing your international mail server is just plain scary.
Edit: To TL; DR...
[quote="Birkett"]Remember kids, Safe Mode. :eng101:[/quote]
Well Fuck
Dam ....
That's gotta suck.
Remember kids, Safe Mode.
Big read, can somebody sum it up please?
[QUOTE=poopsicle;26480358]Big read, can somebody sum it up please?[/QUOTE]
Huge cascading list of problems easily solved by something nobody thought to try until everything was fubar
[QUOTE=birkett;26480268]Remember kids, Safe Mode.[/QUOTE]
oh god this
Why do quotes make the text smaller? *ctrl-plus*
That was an interesting read. My dad always tell me to do these three things:
1. Reboot.
2. Restore.
3. Safe Mode.
So far this methodology works well for him, and he knows less about computers than I do.
I detected the problem within 5 minutes: Windows Server. Also, Exchange.
[QUOTE=nikomo;26487970]I detected the problem within 5 minutes: Windows Server. Also, Exchange.[/QUOTE]
Sorry to break it for you, but a WinServer 2003/2008R2 is better than any Linux server for companies like that.
Was this posted on a Swedish site before being posted on Facepunch? If so can you link it cause i find it more comfortable reading it in Swedish.
Sorry, you need to Log In to post a reply to this thread.