I would like to share a real world example to explain this phenomenon. Few days ago one of our client’s on-premise SCOM started generating hundreds of alerts; hence, same amount of P1 incidents. Those alerts were being generated by the events on Veritas Enterprise Vault servers. We will not discuss the cause of those events. Upon our investigation, we came to know that their SCOM has a customised event detection Rule that is monitoring a specific event ID on Vault servers. As the “Veritas Enterprise Vault” log got flooded with same event ID on Vault servers in a short time frame, that Rule generated hundreds of alerts.
Usually event detection Rules don’t generate alert for every single event, author configures the alert suppression in the Rule while creating it in first place. When alert suppression is enabled for a rule, only the first alert is sent and further alerts are suppressed. A suppressed alert is not displayed in the Operations console. Operations Manager suppresses only duplicate alerts as defined by the alert suppression criteria. Fields stated in the suppression criteria must be identical for the alert to be considered a duplicate and suppressed. An alert must be created by the same rule and be unresolved to be considered a duplicate. The repeat count for an alert with suppression enabled will be incremented for each suppressed alert. You can also view the Repeat Count in the properties for an alert.
We enabled the alert suppression to stop the P1 incidents, but there is a downside of enabling suppression for Rules. If we suppress the alerts and leave things the way they are, then no new alerts will ever get generated until the original alert gets clear. By default, this is 7 days. That means if the condition occurs but is promptly fixed, then occurs again within 7 days, no new alert will be generated; hence, no new ticket will be raised — potentially never. So there should be a reset timer on this to ‘re-arm’ the rule. Now the negative to this approach is that if the condition is still existing and we reset the rule, then it will generate a new alert. So we have to find a balance. i.e., if it’s a P1 and we have an SLA to fix within 4 hours, we should configure the rule to reset after 4 hours. It would generate a new alert after 4 hours if the condition exists, but that’s how rules work. Still is better than 100’s of them.
So, we discussed with the stakeholders and they advised us to configure the timer reset for 1 day. There are two ways of doing it: first, write a Rule from the scratch and configure the consolidation settings; second, create Windows Events Timer Reset Monitor for the specific event ID. The second option is the quicker one that I followed.
Here are the step-by-step instructions:
Pre-configuration information required:
- Intermediate knowledge of SCOM
- Appropriate rights and permissions (SCOM Administrator)
- Proposed Alert
- Information about the event, i.e. Log Name, Event ID, and Source (for e.g. log name = Veritas Enterprise Vault, Event ID = 6702, Source: Enterprise Vault)
- Wait time before triggering the auto reset state of the monitor, for e.g. 30 mins, 4 hrs, 16 hrs, 1 day, etc.
Please follow these steps to create a Windows Events Timer Reset Monitor for Event ID 6702 and add it to the Notifications Subscriptions
- Launch the Operations Manager Console → Click the Authoring pane.
- Go to Management Pack Objects and right click on Monitors; hover over Create Monitor → Select Unit Monitor.
- Expand Windows Events → Expand Simple Event Detection → Select Timer Reset
Manual Reset: With manual reset, the monitor never returns to a healthy state automatically. The user must determine whether the problem was corrected and then select the monitor in the Health Explorer and select Reset Health.
Timer Reset: A timer reset acts the same as a manual reset except that if the user does not manually reset the monitor after a specified time, it will reset automatically.
Windows Event Reset: With event reset, the monitor is reset when a single occurrence of a specific event is detected. The event must be the same type as the event used for detecting the error condition.
- Select the destination Management Pack → Click Next.
- Name your new monitor and write a brief Description → select the Monitor Target.
Note: In reality, I didn’t choose Windows Server Operating System, but I created a specific target for Enterpeise Vault Servers.
- Mention the Log Name where your software writes events
Note: In my case, I logon to Vault servers and verified the log name.
- In the expression builder provide the Event ID and Event Source → Click Next
Note: I would recommend to provide the Event Source to make sure you get the right event. I logged on to Vault Servers and verified the source from the events.
- Now we will set a timer when the alarms should be reset, check if this monitoring should follow any SLA, press Next to continue.
Note: Usually SLA is 4 hours. It’s fine to set 4 hrs here, but in this case, stakeholders advised us 1 day.
- Set the Health Conditions for this alert, you can choose the severity for this alert, by default when an Event is Raised then the status is Warning otherwise the status will be Healthy , to continue click Next.
Note: I chose Critical Health state for event raised.
- In the last step we will need to activate the alerting whenever an event is created, check the box for Generate Alerts for this Monitor → Write the expression for description → Click Create.
Note: For event description write this expression: Event description: $Data/Context/EventDescription$
Create a Subscription
If you don’t already have a subscription, create one or create a separate subscription just for this alert. SCOM 2019 Notifications Subscriptions are way powerful than its previous version.
- Launch Operations Manager Console and head to the Administration pane.
- Expand Notifications, right click Subscriptions -> New subscription…
- Give your new subscription a Name and a Description, click Next to continue.
- Select “raised by any instance of a specific class” -> Click specific
Note: It’s recommended that your scope should be specific and narrowed. You can select a specific group or class. In this example, I am selecting class.
- Search for Windows Server Operating System -> Click Add -> Click OK
Note: In this case, I know that my Enterprise Vault servers are running Windows Server Operating System; hence added that class.
- You will be back to Subscription Scope window -> Click Next
- Note: SCOM 2019 Subscription Criteria is a powerful expression builder window.
- Click Insert -> Select AND group
- Click Insert again -> Select Expression
- Click on arrow sign in Criteria column -> Select Monitors
Note: As I have mentioned earlier, it’s an amazing expression builder tool. You will see heaps of options in Criteria column. As we are adding Monitor; hence select the Monitors. Otherwise, you can select the Rules in case of adding Rules.
- Select Equals in the drop list of Operator column
- Click three dots in Value column
- Search for the monitor you want to add to this subscription
Note: If you are not sure about the exact name of the monitor, try any part of the monitor name or select the management pack and search all of its monitors.
- Will return to Subscription Criteria Windows, Add another Expression like you did in step 8
- Add another criteria and select Resolution State
Note: You can build the expression as per your need but I usually need two or three criteria in my AND group, i.e. Rules, Monitors, and Resolution State
- Add Operator like step 9
- Add Value like step 10
- select New (0) -> Click OK
Note: Resolution states are also very subjective, but I want to be notified when an alert is generated; thus selecting New (0)
- Click Next
Note: Once you complete your query, it will look like this
- Click Add… -> Search the subscriber and add it -> Click Nex
Note: If you have already created the Subscribers, add them here. You can add the multiple subscribers too. Here, I have already created a subscriber for my Service Now
- Click Add… -> Search for the channel(s) and Add
Note: Like subscriber, if you have already created the channel(s), add them here. You can add multiple channels too. You can create a new channel as well.
- select Enable this notification subscription checkbox -> Click Finish
You have now successfully created Monitor, Subscription and added the monitor to the subscription.
Now if an event with the event ID 6720 appears in the Windows event log of Veritas Enterprise Vault Windows Server, your subscriber recipient(s) should receive an alarm notification.
What would you do differently? If you have questions or require further clarification, please leave a comment.