Posts Tagged ‘OpsMgr’

System Center Operations Manager 2007 R2 – How to Install a Gateway Server

March 10th, 2010

Installing a Gateway server in your existing OpsMgr environment is a fairly challenging task.  When accomplishing this task for a customer I found many resources which helped, but I could not find one that was complete.  The process I found which was the most helpful was Brad Hearn’s blog.  Much of my documented process comes from his document.  Honestly I would never have gotten my gateways operational without Brad’s excellent post so many thanks!

Download “How to Install a Gateway Server.pdf” here.

OpsMgr 2007 DMZ-Based Agents Fail to Report to Gateway with Event ID 20070

February 19th, 2010

We recently noticed most of our DMZ-based OpsMgr agents were not connecting to their gateway server.  On the agent we saw the following event:

Event Type:          Error
Event Source:       OpsMgr Connector
Event Category:    None
Event ID:              20070
Computer:            <Computer>
Description:          The OpsMgr Connector connected to <domain>, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.

On the gateway server the following event was being logged:

Event Type: Information
Event Source:        OpsMgr Connector
Task Category: None
Event ID:      20000
Description: A device which is not part of this management group has attempted to access this Health Service.  Requesting Device Name : <computer>
The strange thing is that these agents had been working fine, and a few still were working!  We checked the usual things and did the usual recovery steps:
  • Is TCP 5723 open to the gateway server?
  • Restart the HealthService
  • Is agent in Pending Management?
  • Restart the HealthService
  • Wait 5 minutes
  • Restart the HealthService

All to no avail.  We found http://blogs.technet.com/operationsmgr/archive/2009/02/17/opsmgr-2007-port-requirements-for-scom-agents-in-a-dmz.aspx which suggested opening ports 88 and 389 from the agent to the RMS. This did not make sense to us since some agents were working.  So we used Netmon 3.3 to trace the client while the HealthService starts.  It never used any port but 5723.

We even enabled verbose diagnostic tracing (http://support.microsoft.com/kb/942864) and reviewed the logs.  We saw where the 20070 event was being generated but not much interesting besides that:

5412.5956::02/19/2010-10:46:56.978 [Common] [] [Verbose] :Common::EventLogUtil::LogEvent{EventLogUtil_cpp311}Logging error event 20070 with args “<servername>”, “NULL”,”NULL”, “NULL”, “NULL”, “NULL”, “NULL”, “NULL”, “NULL”

5412.5956::02/19/2010-10:46:56.978 [Common] [] [Information] :Common::EventLogUtil::LogEvent{EventLogUtil_cpp397}Logging event 20070 from source “OpsMgr Connector” with severity Error and description “The OpsMgr Connector connected to <GatewayServer>, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.”.

Solution…

We finally had to call Microsoft.  After about 30 minutes of troubleshooting the engineer saw that the OpsMgrConnector.Config.xml file in the C:\Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<MgmtGrpName> folder on the gateway server was last modified several weeks ago.  He had us rename the Health Service State folder under C:\Program Files\System Center Operations Manager 2007 and restart the HealthService.  After this a new Health Service State folder was created and the OpsMgrConnector.Config.xml had a much more current last modified date.  We then restarted the HealthService on the agents and they reported in to the gateway server correctly.

OpsMgr Event Correlation – Branch Site Monitoring

December 11th, 2008

I am a big fan of System Center Operations Manager 2007, but in it’s current version it has a limitation.  If your management server is at your primary datacenter and it monitors servers in a remote site, what happens if the WAN link between the two locations goes down?  Ideally your monitoring solution is smart enough to know that the branch site is unavailable and due to this condition it will cease trying to monitor the end points at that branch until the site is available again.  This capability is known as event correlation.  However I have found no way to configure this functionality in OpsMgr.  I have even posed the question on OpsMgr forums and asked OpsMgr MVPs, all to no avail.  So, necessity being the mother of invention (or creative re-utilization) I have pieced together a solution.

Assumptions:

  1. You have computer groups set up in OpsMgr for each branch site.  In my case, my branch sites are also set up in Active Directory as AD sites.  If this is true for you as well be sure to check out Cameron Fuller’s blog about setting up dynamic computer groups based on AD sites.
  2. You know the IP address of the branch site’s router.  We will be ping’ing this to check the availability of the site as a whole.
  3. You have Boris Yanushpolsky’s “rock-on awesome” script to place an entire computer group in maintenance mode (get it here).  Put his script in a folder on your management server, such as “C:\Scripts\GroupMM”.
  4. Your management server has Powershell and the OpsMgr Powershell console component installed.

How Does It Work

The heavy lifting is done by Boris’ maintenance mode script.  All this solution does is determine whether the branch site router is reachable (via our friend “Ping”).  If it is then we record that in the results file (auto-created).  If the router is not reachable then put the computer group in maintenance mode and create a tracking file with the site name (so when we run later we don’t try to put it in maintenance mode again).  The condition and action taken are recorded in the results file.  Later when the router becomes reachable again we remove the computer group from maintenance mode, delete the tracking file and record the condition and action taken in the results file.

To Implement the Solution:

  1. Create a directory on your management server for the branch availabilty monitoring script, such as “C:\Scripts\BranchSiteMonitoring”.
  2. Edit and save the BranchSiteMonitoring script (at the end of this blog entry) in the directory you just created.
  3. Create an input file called “input.txt” and place in the directory you created in step 1. This input file should list the IP address of the branch site router and the OpsMgr computer group name that corresponds to that site.  Separate these two bits of information with a “~”.  Each new IP address and computer group name should be on a new line.  Ensure there is not a blank line at the end!  The input file should look like this:

192.168.1.254~Houston Branch Site

192.168.2.254~Austin Branch Site

etc…

4.    Schedule a task on the management server to run the BranchSiteMonitoring script as often as you like.  Keep in mind the idea is to “catch” the site outage before OpsMgr does so the script can place the end points at the unavailable site in maintenance mode BEFORE getting a bunch of unneeded alerts.  I would start out with running it once every 5 minutes.

Cowardly Disclaimer:

Use this solution at your own risk.  I’ll take responsibility for my actions (like running scripts on my corporate network written by people I don’t know) and you can take responsibility for yours <g>.

The Script:

Big thanks to dm_4ever for his WMI ping function (you thought I was kidding about “re-utilization!).  If the WMI version gives you grief you can try his non-WMI version instead.

 Script Download