OpsMgr Event Correlation – Branch Site Monitoring

December 11th, 2008 by Steve Leave a reply »

I am a big fan of System Center Operations Manager 2007, but in it’s current version it has a limitation.  If your management server is at your primary datacenter and it monitors servers in a remote site, what happens if the WAN link between the two locations goes down?  Ideally your monitoring solution is smart enough to know that the branch site is unavailable and due to this condition it will cease trying to monitor the end points at that branch until the site is available again.  This capability is known as event correlation.  However I have found no way to configure this functionality in OpsMgr.  I have even posed the question on OpsMgr forums and asked OpsMgr MVPs, all to no avail.  So, necessity being the mother of invention (or creative re-utilization) I have pieced together a solution.

Assumptions:

  1. You have computer groups set up in OpsMgr for each branch site.  In my case, my branch sites are also set up in Active Directory as AD sites.  If this is true for you as well be sure to check out Cameron Fuller’s blog about setting up dynamic computer groups based on AD sites.
  2. You know the IP address of the branch site’s router.  We will be ping’ing this to check the availability of the site as a whole.
  3. You have Boris Yanushpolsky’s “rock-on awesome” script to place an entire computer group in maintenance mode (get it here).  Put his script in a folder on your management server, such as “C:\Scripts\GroupMM”.
  4. Your management server has Powershell and the OpsMgr Powershell console component installed.

How Does It Work

The heavy lifting is done by Boris’ maintenance mode script.  All this solution does is determine whether the branch site router is reachable (via our friend “Ping”).  If it is then we record that in the results file (auto-created).  If the router is not reachable then put the computer group in maintenance mode and create a tracking file with the site name (so when we run later we don’t try to put it in maintenance mode again).  The condition and action taken are recorded in the results file.  Later when the router becomes reachable again we remove the computer group from maintenance mode, delete the tracking file and record the condition and action taken in the results file.

To Implement the Solution:

  1. Create a directory on your management server for the branch availabilty monitoring script, such as “C:\Scripts\BranchSiteMonitoring”.
  2. Edit and save the BranchSiteMonitoring script (at the end of this blog entry) in the directory you just created.
  3. Create an input file called “input.txt” and place in the directory you created in step 1. This input file should list the IP address of the branch site router and the OpsMgr computer group name that corresponds to that site.  Separate these two bits of information with a “~”.  Each new IP address and computer group name should be on a new line.  Ensure there is not a blank line at the end!  The input file should look like this:

192.168.1.254~Houston Branch Site

192.168.2.254~Austin Branch Site

etc…

4.    Schedule a task on the management server to run the BranchSiteMonitoring script as often as you like.  Keep in mind the idea is to “catch” the site outage before OpsMgr does so the script can place the end points at the unavailable site in maintenance mode BEFORE getting a bunch of unneeded alerts.  I would start out with running it once every 5 minutes.

Cowardly Disclaimer:

Use this solution at your own risk.  I’ll take responsibility for my actions (like running scripts on my corporate network written by people I don’t know) and you can take responsibility for yours <g>.

The Script:

Big thanks to dm_4ever for his WMI ping function (you thought I was kidding about “re-utilization!).  If the WMI version gives you grief you can try his non-WMI version instead.

 Script Download

Advertisement

10 comments

  1. Steve says:

    Note the script has been updated to accomplish log rotation. The “results.txt” file will be archived into a ZIP file once it becomes larger than 2 MB and a new results.txt file will be created at the next run.

  2. Steve says:

    Another script update! Now if all routers are up we log an event ID 18040 in the OpsMgr event log. If a router is down we log an event 18041. Also, we now try to ping the router twice (5 seconds in between attempts) before designating it as unavailable and placing servers in that site in maintenance mode.

  3. Stefan Koell says:

    Very cool idea, thanks for sharing!

  4. cornasdf says:

    Hi, i like the solution and I am looking at implementing it. What do you suggest to make sure you are notified when the router goes down. I don’t want 300 alerts but I do need one! B)

    Thanks
    ej

    • Steve says:

      Hi EJ. Check out the comments above…I commented on a addition I made where we log an event in the OpsMgr event log if a router is down (18041). You can create an event monitor to watch for this event and send you an alert. Note that if a router is down then all the agents in the computer group get put into maintnenace mode…hence no baraage of alerts from the those agents.

      Note I have not tested this solution on R2 yet.

      Let me know how the implementation goes and if you have any questions/problems.

      • cornasdf says:

        well, i can tell you it works on R2. B)

        I assumed that the event logging was for that purpose and that is what I have set up. One thing I am trying to iron out is that I want a different alert on each down branch. right now, i am getting the first event but no others. I created a simple windows event reset monitor targeted at windows computers (but disabled). I enabled it on the computer running my tests. I am just getting my feet wet with monitor creation. is there a better monitor type to be creating?

        Thanks
        ej

        • Steve says:

          I’m glad to hear it works on R2!

          The event reset monitor is exactly what I would use. There should be an event ID 18040 if all routers are up and an 18041 if one or more is down. I like the idea of alerting based on which branch(s) are down. I will look into that.

          What I really need to do (in addition to that change) is just make this an importable mgmt pack as well…oh if I only had more time!

  5. cornasdf says:

    Steve,
    I ended up making a few tweaks. basically, each branch spits out a different error code and I watch for those w/ different monitors. Maybe not the most elegant way but it is working for us. I posted some details here: http://cornasdf.blogspot.com/2009/10/stopping-alert-floods-when-branch.html

    Thanks again for the solution.

Leave a Reply