Major Power Outage!! Disaster Recover Series (Part 1)

Posted: December 17, 2014 in Backup and Recovery
Tags: ,

It has been quite a while since my last post so I must start by apologizing a bit. The project I left was reluctant to let me post my experiences on line.  The good news for you is I have joined the University of Colorado and my current supervisor encourages me to talk about my experiences!

Good bye, Department of Defense community; hello academic community!!! This is a community where I thrive and this is the community I plan to stay in!!!

On to today’s post!!

Major Power Outage!!!

Now that I have been an Oracle DBA for going on six years, I decided I needed a specialty. Most Oracle DBA’s would prefer their specialty be SQL Tuning, consulting or some other high paying endeavor. I am not one to follow the sexy trends. In fact, I gravitate away from the crowd and focus on the pieces that most ignore or avoid. So what is the most often ignored piece of DBA work? If you can’t quite grasp it and it’s on the tip of your tongue… take a look at my title and it should help you get it.

I decided my specialty should be backup and recovery, or more specifically, Recovery!

On a recent project, I took over a few Oracle database systems that had, to put it nicely, been woefully neglected. This project had endured some major cuts when it converted to its recent program name. Among those cuts were the DBA’s who maintained these systems. In their place, the technical manager assigned system maintainers to “watch” the databases. Three years later, they decided to hire a DBA to take over the database work on these systems. Have you heard the cliche “Disaster waiting to happen?” Well, that is exactly what was getting ready to happen, I just didn’t know the when or the what.

The job of the “System Maintainer” on the Oracle database was to fire up OEM, observe that the database is up, check space on the tablespaces, add a datafile if necessary, and ensure that last night’s backups completed. All of these procedures were detailed in a “Software Maintenance Manual”. If it wasn’t in the manual, it was not to be done. The crux of ensuring the backups completed was simply to peruse the RMAN log and look for “Backup complete”. I was told that was now my job. I was horrified to say the least.

Disaster Recovery Scenarios

I immediately set out to create a few disaster recovery scenarios. My goal was to create scenarios that were highly likely to occur and somewhat easy to simulate. I took to my test database that was currently being backed up and corrupt a datafile, a tablespace and the database as three separate scenarios. I shall publish a post to address each of these situations and I hope you find it as interesting as it was to me. These are not new concepts but they are critical to your job retention if you plan to make a career out of database administration.

All Hallow’s Eve

I simply don’t believe in superstition or coincidence. As fate would have it, my real life disaster occurred on Halloween. On or about 10:00 on Halloween, as I was performing my morning checks, the power flickered. The next moment, the power to the facility was completely wiped out. No problem! Our facility has generator backup to the server rooms. It should be restored within seconds. What did happen within minutes was a building wide evacuation order. The generators did not kick in. It would be hours before generator power would be restored. At about 11:30, all but essential personnel were sent home. Within hours, the server rooms would heat up to temperatures nearing 120 deg. Fahrenheit. I headed for my car.

At about 4:30 that evening, I was on my way to take my kids trick or treating in my Mom’s neighborhood. I got a phone call from one of my system leads. Generator power had been restored. The servers had been rebooted. My largest database, which included Oracle Streams to three remote sites, did not come up as a part of Solaris Cluster Control. I was going to work. Mama was not happy. I was secretly thrilled because it was my first opportunity to perform a live database recovery but I didn’t let on.

When I arrived at the facility at around 6pm, facility power was fully restored via generator. I was shown a very strange error string gathered from the server logs and alert logs. I ignored it and immediately tried to start up the database.


SQL>startup

ORA-00600: internal error code, arguments:

[kcatr_nab_less_than_odr],[1],[18446],[27840],[27890]


No huge surprise here. This just so happened to be the same string the system lead pulled up from his logs. What was surprising was I had never seen this string before. Here I was heading to the internet while Rome was burning.

Within about fifteen minutes of browsing, a particular article jumped out at me. The result string included something like “Power outage”  “ORA-00600: internal error code, arguments: [kcatr_nab_less_than_odr],[1],[18446],[27840],[27890]”

What an incredible stroke of luck!! I printed out the entry and set out to try it! In my less than expert opinion, the error string was pointing to a log that needed to be restored. The following text is almost a direct quote from the web page and proper credit should be given to the author…


[oracle@blablasrvr01]$ sqlplus / as sysdba

SQL> startup mount;

SQL> show parameter control files;

the estimated purpose of this query is to make sure there is a control file that is not                                                              corrupted. As fate would have it, my control file was ok.

SQL> select a.member, a.group#, b.status from v$logfile a, v$log b

where a.group# = b.group#

and b.status = ‘CURRENT’;

oravol5/qualified/path/redo12.log

oravol6/qualified/path/redo12.log

** I wrote down both redo log paths even thought they appeared to be identical redo log files.

SQL> shutdown abort;

At this point you should take an OS backup of the controlfile (This is to ensure we have a backup of the current state of the controlfile).

SQL> startup mount;

SQL> recover database using backup controlfile until cancel;

** Here, I saw a prompt. You are prompted for the redo logfile you wrote down earlier. Since they appeared to be identical, I chose the odd one for no particular reason

oravol5/qualified/path/redo12.log — hit enter. I waited for approximately ten minutes. It felt like an eternity but I finally got a prompt.

SQL> alter database open resetlogs;

** The web page had two options and I would highly recommend you read about the other option which documents the process to recreate the controlfile using a controlfile recreation script.

Thanks for reading!!!

Jason

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s