Archive for December, 2008

OpsMgr Event Correlation – Branch Site Monitoring

December 11th, 2008

I am a big fan of System Center Operations Manager 2007, but in it’s current version it has a limitation.  If your management server is at your primary datacenter and it monitors servers in a remote site, what happens if the WAN link between the two locations goes down?  Ideally your monitoring solution is smart enough to know that the branch site is unavailable and due to this condition it will cease trying to monitor the end points at that branch until the site is available again.  This capability is known as event correlation.  However I have found no way to configure this functionality in OpsMgr.  I have even posed the question on OpsMgr forums and asked OpsMgr MVPs, all to no avail.  So, necessity being the mother of invention (or creative re-utilization) I have pieced together a solution.

Assumptions:

  1. You have computer groups set up in OpsMgr for each branch site.  In my case, my branch sites are also set up in Active Directory as AD sites.  If this is true for you as well be sure to check out Cameron Fuller’s blog about setting up dynamic computer groups based on AD sites.
  2. You know the IP address of the branch site’s router.  We will be ping’ing this to check the availability of the site as a whole.
  3. You have Boris Yanushpolsky’s “rock-on awesome” script to place an entire computer group in maintenance mode (get it here).  Put his script in a folder on your management server, such as “C:\Scripts\GroupMM”.
  4. Your management server has Powershell and the OpsMgr Powershell console component installed.

How Does It Work

The heavy lifting is done by Boris’ maintenance mode script.  All this solution does is determine whether the branch site router is reachable (via our friend “Ping”).  If it is then we record that in the results file (auto-created).  If the router is not reachable then put the computer group in maintenance mode and create a tracking file with the site name (so when we run later we don’t try to put it in maintenance mode again).  The condition and action taken are recorded in the results file.  Later when the router becomes reachable again we remove the computer group from maintenance mode, delete the tracking file and record the condition and action taken in the results file.

To Implement the Solution:

  1. Create a directory on your management server for the branch availabilty monitoring script, such as “C:\Scripts\BranchSiteMonitoring”.
  2. Edit and save the BranchSiteMonitoring script (at the end of this blog entry) in the directory you just created.
  3. Create an input file called “input.txt” and place in the directory you created in step 1. This input file should list the IP address of the branch site router and the OpsMgr computer group name that corresponds to that site.  Separate these two bits of information with a “~”.  Each new IP address and computer group name should be on a new line.  Ensure there is not a blank line at the end!  The input file should look like this:

192.168.1.254~Houston Branch Site

192.168.2.254~Austin Branch Site

etc…

4.    Schedule a task on the management server to run the BranchSiteMonitoring script as often as you like.  Keep in mind the idea is to “catch” the site outage before OpsMgr does so the script can place the end points at the unavailable site in maintenance mode BEFORE getting a bunch of unneeded alerts.  I would start out with running it once every 5 minutes.

Cowardly Disclaimer:

Use this solution at your own risk.  I’ll take responsibility for my actions (like running scripts on my corporate network written by people I don’t know) and you can take responsibility for yours <g>.

The Script:

Big thanks to dm_4ever for his WMI ping function (you thought I was kidding about “re-utilization!).  If the WMI version gives you grief you can try his non-WMI version instead.

 Script Download