OEM Incident: Critical Alerts Part 2: Reducing Alerts

Posted: April 12, 2016 in OEM12c, ORA Errors, Uncategorized
Tags: ,

Happy Tuesday, Data Evangelists and disciples alike! I have made it no secret that I am blown away by the positive features in Oracle Enterprise Manager 12c. There have been groundbreaking improvements in Performance Management, ASH Analytics, Security, RMAN management, and Tablespace Management. The EMCLI is also a very powerful tool in version 12c. I can’t wait to upgrade to 13c which was just recently released!

And here comes the rant… OEM “Critical” Alerts. Do I want to know about every critical database alert at the very moment that it occurs? Of course I do. You need to rethink your career choice if you don’t.

Merriam-Webster.com defines critical (as it relates to critical alerts) as… “being or relating to an illness or condition involving danger of death <critical care> <a patient listed in critical condition>”  Synonyms are crucial, decisive, indispensable, vital. A scientific definition is “of sufficient size to sustain a chain reaction ,the reactor went critical.>”

In other words, I want to know if my database is in critical condition. I want to know if my database is about to go nuclear. That’s my definition of critical. At 2:25 AM on a Monday morning, I don’t want to know if connection to the tns listener has exceeded 500 milliseconds. Yet if you as the DBA don’t reduce your critical alerts and you set up paging for critical alerts, OEM will wake you up to let you know this and many other “critical” alerts. OEM 12c comes with monitoring templates already enabled for almost every metric you can think of. If your organization owns an Exadata, it will alert you on metrics you don’t even know exist. At this point, you have two choices: disable the default monitoring templates or, as my title suggests, reduce the number and metrics that trigger the critical alert.

The policy of our DBA team is all warnings and critical alerts are sent to the Oracle DBA distribution list 24 hours a day, seven days a week. All critical alerts are sent to the Oncall DBA through a page to their cellphone. The Oncall DBA is required to keep his cellphone charged and nearby throughout the oncall week. This was not always the case. When I started with my current company, no alert reduction had been implemented and there was no paging for the Oncall DBA. We were getting over 1500 OEM emails per day. Thank God paging wan not yet implemented! It was the very first thing I noticed when I got all of my accounts and access. The rest of the team had learned to ignore OEM Emails and send them to an Outlook Rule folder. I didn’t do this right away so it drove me bonkers. I set up a rule but I tried to sift through the mail to find the alerts I might need. Within four weeks, I used a classic article by DBAKevlar to set up monitoring templates for our databases and Exadata Servers. As you will find out, the default templates set up in OEM can’t be modified. Although my original templates were replaced, I set the groundwork for having an alert system that allowed us all to get more sleep and only be disturbed for truly critical situations.

Some of the alert metrics I adjusted were Agent metadata channel count, Page Processing Time (OMS), Number of failed logins, Checker found ** new persistent data failures, etc. I adjusted 14 settings in all and I’m sure you will find at least that many.

  1. Alert Reduction: Once you have set up your own monitoring templates and disabled the default templates, you are ready to reduce alerts! Step one in my case was to go to my OEM Spam folder I set up in Outlook and sort the emails by subject. You will quickly see which subjects are creating the most problems. I was able to eliminate five or six alerts right away that reduced email traffic by 90%. I can’t remember which these were but mentioning them really won’t help you. It is just good comedy relief and even less relevant than tns listener connection time exceeding 500ms. When you have eliminated the ridiculous alerts, take a deep breath. You are down to less than 200 emails per day to sift through. Your coworkers and your boss will thank you. But we’re just getting started.
  2. Change Critical Alerts to Warning: In my case, there were quite a few alerts that I wanted to know about but could wait until morning. This way, our team will get an email but no page will be sent to the Oncall DBA.
    1. Navigate to Enterprise –> Monitoring –> Monitoring Templates. Find the Monitoring Template you want to edit.
    2.  Click on the blank box next to the Monitoring Template to highlight it. Click Edit and then Metric Thresholds.
    3. Find the metric and click on the pencils on the right side to edit. For our example, I clicked on Failed Login Count.
    4. Adjust Threshold Settings. I kept the warning threshold to 150 and blanked out the critical threshold box. If it is empty, OEM interprets that as no critical threshold. I set the number of occurrences to 1. So it will trigger a warning as soon as it hits 150 failed logins.
  3. Change Metric Settings and number of Occurrences to meet organizational needs: For this example, I adjusted the metric “Total Global cache block lost is **”.
    1. Follow steps 1-3 above. The Target Type is Database Instance and the metric is global cache blocks lost. Click on the edit pencil.
    2. In my case, I set the warning threshold to 10, the critical threshold to blank and the number of occurrences to 12. The collection schedule is every five minutes. So, setting the number of occurrences to 12 means the warning won’t be triggered until the threshold is violated for one hour. What I found was this warning, when triggered, would clear within thirty minutes. This got rid of a huge amount of warning emails and critical pages.

This has been an iterative process and likely will continue. Some of the metrics I changed at the cluster level were still triggered at the database instance level. In this case, go in and change it at the instance level metric. Please send me a message if you run into any issues. I can help you work through some of the issues I’ve experienced.

Thanks for reading!

Jason

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s