Troubleshooting Azure AD Hybrid Join and Intune AutoEnrollMDM
Working for a number of clients recently and we were deploying Self-Service Password Reset from a Windows 10 logon screen ( https://docs.microsoft.com/en-us/azure/active-directory/authentication/tutorial-sspr-windows ) and we came across was the machines would not Hybrid Join . using dsregcmd /status we could see the AzureAdJoined still had a value of No so w e went through the following checklist: Checked Hybrid Join was enabled using the Azure AD Connect wizard — https://docs.microsoft.com/en-us/azure/active-directory/devices/hybrid-azuread-join-managed-domains Checked the device control Group policies — https://docs.microsoft.com/en-us/azure/active-directory/devices/hybrid-azuread-join-control Checked the device settings within the Azure Portal — https://docs.microsoft.com/en-us/azure/active-directory/devices/azureadjoin-plan#configure-your-device-settings Checked the SSO URLs had been added into local intranet zone (SSPR) — https://docs.micr…
Modern Management — Part Seven — Bitlocker
Sorry it’s been a while but following on from my last post Modern Management — Part Six — Resetting Autopilot Devices , here is my lastest post around Modern Management and deploying Bitlocker Device Configuration Profiles as part of an Autopilot deployment. Modern Management — Part One — Autopilot Demo on Hyper-V Modern Management — Part Two — Office 365 Deployment via Intune Modern Management — Part Three — Packaging Win32 Application for Intune Modern Management — Part Four — OneDrive Silent Configuration Modern Management — Part Five — Windows Updates Modern Management — Part Six — Resetting Autopilot Devices Modern Management — Part Seven — Bitlocker Modern Management — Part Eight — Windows Activation Modern Management — Part Nine — BGinfo via Intune Modern Management — Part Ten — Harvesting Autopilot Hardware IDs Modern Management — Part Eleven — Migrate File Shares to Teams Modern Management — Part Twelve — Synchronising AD G…
Exchange Transport Rules — Export and Import to EXO
Today I attempted to export transport rules from an Exchnage Server (2013) and Import these into Exchnage Online using the script: #Set-ExecutionPolicy Set-ExecutionPolicy Unrestricted -Scope Process -Force #Set PowerShell to use TLS 1.2 [ Net.ServicePointManager ]:: SecurityProtocol = «tls12» #Check and create directory IF ( Test-Path C:\Temp\TransportRules ) { Write-Host «directory exists» } ELSE { New-Item -path C:\Temp\ -Name TransportRules -ItemType Directory } ##### EOP ###### #Load the Exchange Management Shell Module $CallEMS = «. ‘ $env:ExchangeInstallPath \bin\RemoteExchange.ps1’; Connect-ExchangeServer -auto -ClientApplication:ManagementShell » Invoke-Expression $CallEMS #Export Transport Rules $file = Export-TransportRuleCollection Set-Content -Path «C:\Temp\TransportRules\Rules.xml» -Value $file . FileData -Encoding Byte ##### EXOL ###### #Install Modules Inst…
AADC — 0x8023134a — AttributeValueMustBeUnique
I ran into an error during an SMTP matching excercise while merging Active Directory accounts with existing Azure AD Accounts for an Office 365 project I was working on. The account would just not sychronise. Unable to update this object because the following attributes associated with this object have values that may already be associated with another object in your local directory services: [ProxyAddresses SMTP:user.name@domain.com;]. Correct or remove the duplicate values in your local directory. Please refer to http://support.microsoft.com/kb/2647098 for more information on identifying objects with duplicate attribute values. Tracking Id: f6334212-fc15-4eea-9407-xxxxxxxxxxxx ExtraErrorDetails: [{«Key»:»ObjectId»,»Value»:[«cb866447-5dfb-4fdf-xxxx-xxxxxxxxxxxxx»]},{«Key»:»ObjectIdInConflict»,»Value»:[«516338b6-8cc4-4d78-xxxx-xxxxxxxxxxxxx»]},{«Key»:»AttributeConflictName»…
Azure AD SSO — Troubleshooting
If you have configured Azure Active Directory Connect to use Seamless Single Sign on and are having trouble with signing on ensure the following: You are logging onto a Domain Joined machine connected to the corporate network, the machine must have line of sight to a Domain Controller to request a Kerberos ticket. The following URLS are added to the Local Intranet zone via GPO (User Configuration\Administrative Templates\Windows Components\Internet Explorer\Internet Control Panel\Security Page by modifying the «Site to Zone Assignment List») https://autologon.microsoftazuread-sso.com https://aadg.windows.net.nsatc.net Enhanced Protected Mode is disabled (Computer Configuration\Computer Policy\Administrative Templates\Windows Components\Internet Explorer\Internet Control Panel\Advanced Page\Turn on Enhanced Protected Mode) You can use klist purge to purge the Kerberos tickets, then klist get AZUREADSSOACC to ensure that you can receiver a Kerberos ticket…
EOL — Troubleshooting Exchange Online Mailbox Migration Speeds
Working for a client we run into numerous issues where we were seeing sluggish performance when migrating mailboxes to Exchange Online, here are a few of the troubleshooting steps we went through. (The actual was being caused by some barracuda load balancer devices which we had to remove out of the migration path.) Ensure the Exchange servers are patched with the latest CU Ensure all flood mitigation, SSL offload, traffic inspection and any IP connection limits are removed from the firewall connections to mail.domain.com (including TMG) Review Migration endpoint and ensure that MaxConcurrentMigrations are set to a reasonable number for your infrastructure, currently using 35 and 25 MaxConcurrentIncrementalSyncs (use Get-MigrationEndpoint | fl to check.) Ensure that the correct AV exclusions are in place as per — https://docs.microsoft.com/en-us/exchange/anti-virus-software-in-the-operating-system-on-exchange-servers-exchange-2013-he…
Exchange Hybrid — Mailbox Permissions — ACLableSyncedObjectEnabled
The key here is to review the guidance at https://docs.microsoft.com/en-gb/exchange/hybrid-deployment/set-up-delegated-mailbox-permissions?redirectedfrom=MSDN#enable-aclable-object-synchronization It is important that the ACLableSyncedObjectEnabled $True before you migrate mailboxes or you will need to reconfigure these manually following the migraiton of mailboxes to Exchange Online. It is important that the ACLableSyncedObjectEnabled $True before you migrate mailboxes or you will need to reconfigure these manually following the migraiton of mailboxes to Exchange Online. To enable ACLable Synced Object run the following on the on-premise Exchnage Organisation: Set-OrganizationConfig -ACLableSyncedObjectEnabled $True If you have forgotten to enabel this then you change change this buy running: #To enable ACLs on a single mailbox, run the following command Get-AdUser «UserMailbox Identity» | Set-AdObject -Replace @{msExchRecipientDisplayType = -1073741818 } #To ena…
SCCM — PS — Detection Method for User APPDATA
I needed a detection method to confirm the presence of a file under the %APPDATALOCAL% directory and a HKCU registry key. Thanks to this post and this one I was able to achieve this. #Detects if C:\Users\%username%\Appdata\Roaming\folder_name\file_or_subfolder_name exists Function CurrentUser{ #CurrentUser function converts the username object string «@{username=domain\user}» # to the exact logon string «user» like the example below #@{username=DOMAIN\USER} #@{username DOMAIN\USER} #DOMAIN\USER} #DOMAIN\USER #DOMAIN USER #USER $loggedInUserName = get-wmiobject win32_computersystem | select username $loggedInUserName = [string]$loggedInUserName $loggedinUsername = $loggedInUserName.Split(«=») $loggedInUserName = $loggedInUserName[1] $loggedInUserName = $loggedInUserName.Split(«}») $loggedInUserName = $loggedInUserName[0] $loggedInUserName = $loggedInUserName.Split(«\») $loggedInUserName = $loggedInUserName…
Office 365 — Deleting Enterprise Vault Items from Exchange Online Mailboxes
I had a number of Exchange Online mailboxes containing legacy Enterprise Vault items which I need to remove. I managed this with the help of Michel de Rooij’s script on the TechNet gallery (1.72 is the latest version at the time of writing this) — https://gallery.technet.microsoft.com/office/Removing-Messages-by-e0f21615 Also the configuration of a new Role Group within Exchange Online thanks to another of his articles — https://eightwone.com/2014/08/13/application-impersonation-to-be-or-pretend-to-be/ You will need to save the script into a directory ( my example — D:\ExchangeMigraiton\EVCleanup\) and also install/copy the EWS dll from the Microsoft Exchange Web Services Managed API 2.2 . After you have all this in your directory, from an a dministrative PowerShell session, change the directory and then execute: Set-ExecutionPolicy unrestricted $UserCredential = Get-Credential (Enter your Office365 admin…
SSPR 0029 We are unable to reset your password due to an error in your on-premises configuration.
This was one of those annoying ones that took hours (days) with Microsoft to resolve. We’re sorry We cannot reset your password at this time because of a problem with your organisation’s password reset configuration. There is no further action you can take to resolve this situation. Please contact your admin and ask them to investigate. To learn more about the potential issue, read the article Troubleshoot password writeback. If you’d like, we can contact another administrator in your organisation to reset your password for you. Additional details: SSPR _0029: We are unable to reset your password due to an error in your on-premises configuration. Please contact your admin and ask them to investigate. EVENTID 6329 — An unexpected error has occurred during a password set operation. «ERR_: MMS(5624): E:\bt\863912\repo\src\dev\sync\ma\shared\inc\MAUtils.h(58): Failed getting registry value ‘ADMADoNormalization’, 0x2 B…
Posted Apr 18, 2011 05:17 PM
The tools were pretty old. From the 3.5 installs we had here. Upgrade was from that to 4.1.
From the logs, it appears that the cluster resources never fully got moved from node01 to node 02 or at least node 01 thought it was still the owner when it came back up. The last event log show a duplicate IP on the network. I know there’s quite a few, but I posted the logs below in sequence if you’d like to see them.
We are using a heartbeat network.
————————————————————————
Event ID: 1129 Source: FailoverClustering Node: ClusterNode02
Cluster network ‘Cluster Network 1’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1126 Source: FailoverClustering Node: ClusterNode02
Cluster network interface ‘ClusterNode 02 — Local Area Connection’ for cluster node ‘ClusterNode 02’ on network ‘Cluster Network 1’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1135 Source: FailoverClustering Node: ClusterNode02
Cluster node ‘ClusterNode 01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1177 Source: FailoverClustering Node: ClusterNode02
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1564 Source: FailoverClustering Node: ClusterNode02
File share witness resource ‘File Share Witness (\\shareSvr\share$)’ failed to arbitrate for the file share ‘\\shareSvr\share$’. Please ensure that file share ‘\\shareSvr\share$’ exists and is accessible by the cluster.
————————————————————————
Event ID: 1561 Source: FailoverClustering Node: ClusterNode02
The cluster service has determined that this node does not have the latest copy of cluster configuration data. Therefore, the cluster service has prevented itself from starting on this node.
Try starting the cluster service on all nodes in the cluster. If the cluster service can be started on other nodes with the latest copy of the cluster configuration data, this node will be able to subsequently join the started cluster successfully.
If there are no nodes available with the latest copy of the cluster configuration data, please consult the documentation for ‘Force Cluster Start’ in the failover cluster management snapin, or the ‘forcequorum’ startup option. Note that this action of forcing quorum should be considered a last resort, since some cluster configuration changes may well be lost.
————————————————————————
Event ID: 1126 Source: FailoverClustering Node: ClusterNode01
Cluster network interface ‘ClusterNode02 — Local Area Connection’ for cluster node ‘ClusterNode02’ on network ‘Cluster Network 1’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1126 Source: FailoverClustering Node: ClusterNode01
Cluster network ‘Cluster Network 1’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1135 Source: FailoverClustering Node: ClusterNode01
Cluster node ‘ClusterNode02’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
————————————————————————
Event ID: 1069 Source: FailoverClustering Node: ClusterNode01
Cluster resource ‘File Share Witness (\\shareSvr\share$)’ in clustered service or application ‘Cluster Group’ failed.
————————————————————————
Event ID: 1564 Source: FailoverClustering Node: ClusterNode01
File share witness resource ‘File Share Witness (\\shareSvr\share$)’ failed to arbitrate for the file share ‘\\SshareSvr\share$$’. Please ensure that file share ‘\\shareSvr\share$’ exists and is accessible by the cluster.
————————————————————————
Event ID: 1069 Source: FailoverClustering Node: ClusterNode01
Cluster resource ‘Cluster IP Address’ in clustered service or application ‘Cluster Group’ failed.
————————————————————————
Event ID: 1205 Source: FailoverClustering Node: ClusterNode01
The Cluster service failed to bring clustered service or application ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
————————————————————————
Event ID: 1069 Source: FailoverClustering Node: ClusterNode01
Cluster resource ‘IPv4 Static Address 1 (ClusterResource01)’ in clustered service or application ‘ClusterResource01’ failed.
————————————————————————
Event ID: 1205 Source: FailoverClustering Node: ClusterNode01
Cluster IP address resource ‘BAN FrontEnd’ cannot be brought online because a duplicate IP address ‘192.168.8.27’ was detected on the network. Please ensure all IP addresses are unique.
————————————————————————
Event ID: 1205 Source: FailoverClustering Node: ClusterNode01
The Cluster service failed to bring clustered service or application ‘ClusterResource01’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
Exchange 2007 SP1 Cluster Continuous Replication (CCR) clusters built on Windows 2008 most commonly uses the quorum model «Node Majority with File Share Witness».
In the past few weeks administrators have noticed that the file share resource is sometimes in a failed state. Under normal circumstances this is not an issue since both nodes of the cluster are always available (meaning that there are two votes) and the cluster services stay running. Where this has become and issue is where the file share witness resource is failed and a node of the cluster is also unavailable.
Some background….
In Windows 2003 the file share witness is a private property of the «majority node set» quorum resource. If the file share witness was unavailable, an event would be logged but the resource would continue to remain online. (Reference http://support.microsoft.com/?kbid=921181 for information on the file share witness on Windows 2003). The file share witness resource is only used when necessary to maintain quorum for the cluster. If at that time the witness was still unavailable, the cluster would loose quorum and the cluster service would be terminated.
In Windows 2008, when using the node majority with file share witness quorum model, the file share witness resource is enumerated as an actual resource. It can be seen in failover cluster manager -> cluster core resources.
*Cluster core resources.
Now that the file share witness exists as a resource cluster can do additional health checking. One of the health checks is to ensure that the file share witness folder is online and accessible. In the event that the FSW folder is not accessible, cluster will fail the FSW resource and attempt to online the FSW on the other node. If the FSW folder continues to remain unavailable, cluster will fail the resource. By default, cluster will attempt to restart any resource that is failed every 60 minutes (1 hour). If the resource continues to fail to come online, it will remain on the node it failed on until administrator intervention is taken or the resource can be brought online during one of the 60 minute intervals.
*Failed file share witness resource in cluster core resources.
Just like Windows 2003 the file share witness in Windows 2008 is only accessed when it is necessary for the cluster to maintain quorum. If the file share witness resource is in a failed state, and the use of the FSW becomes necessary to maintain quorum, Windows 2008 will not attempt to access the file share witness resulting in a lost quorum state and termination of cluster services.
Impact to Exchange…
Where this seems to impact Exchange administrators the most is during patch management. During patch management, patches are applied to the hub transport server owning the file share which supports the FSW resource in the cluster. When the server is rebooted, the file share witness is no longer available. Cluster performs status checking, and determines that the FSW is not available. At this point the Core Cluster Group, containing the FSW, is failed over to the second node. If the hub transport server is available, the file share witness will come online. Most commonly the hub transport server is not available resulting in the file share witness failing, and remaining failed, on the second node. If left alone for 60 minutes, the resource will be automatically restarted and by this time the hub transport server will be available.
In my experience the issue arises during that 60 minutes. It is possible that a reboot or loss of a cluster node could cause the cluster to loose quorum. If the file share witness resource is failed, and I reboot the node not owning the cluster core resources, this leaves one vote in the cluster. Since cluster requires a majority of votes to be available, this results in the termination of the cluster service on the remaining node.
It is important that administrators understand this difference between Windows 2003 and Windows 2008, and account for it in how they manage their cluster.
What can be done to alleviate this condition?
As you have already read cluster automatically attempts to restart failed resources every 60 minutes. This time limit is the default setting for all resources. The rational here is that if a resource is failing, the administrator should have the opportunity to troubleshoot, identify, and correct the issue causing the resource to fail. From a monitoring standpoint, here is the entire process — monitoring software bubbles up the alert, helpdesk notifies the admin, the admin accesses the machine, the admin reviews the logs, and the admin takes appropriate action(s). On the other hand, this process may not work as the admin may be unavailable etc, in which case the cluster will still try to self heal if no administrator intervention is taken. So essentially the first method of alleviating this condition is to understand the defaults, how the solution operates, what to look for before rebooting nodes etc (all cluster core resources healthy), and make no configuration changes.
On the properties of each resource is the retry interval to restart failed resources. The minimum value that is allowed here is 15 minutes. This would cause the cluster, in the event that the resource is failed, to be more aggressive in terms of attempting to online the resource. From an Exchange 2-Node perspective, this would limit the failure window to 15 minutes verses 60 minutes (assuming the witness is available after 15 minutes). To change this value:
1) Open the Failover Cluster Manager and connect to the cluster.
2) Select the cluster name at the top of the left hand pane.
3) In the center pane of the MMC, expand Cluster Core Resources.
4) Get the properties of the File Share Witness (\PathShare) resource.
5) Select the Policies tab on the resource.
6) On the policies tab, you will see an option «If all the restart attempts fail, begin restarting again after the specified period (hh:mm) with a default of 01:00 (1 hour / 60 minutes). Here you could adjust this value to 15 minutes using the input box. If a change is made, select apply -> ok to exit properties.
Another method would be someway to issue an online command to the group, for example through a script. Post reboot it would be possible to issue a command, from the server with the failover cluster manager installed, similar to this: cluster.exe «cluster name» group «Cluster Group» /online [note, cluster group is the name of the group holding the cluster core resources, only cluster name has to be configured to the FQDN of the cluster management name.] For example, cluster.exe 2008-Cluster3.exchange.msft group «Cluster Group» /online. If this command is run manually the following output would be returned in the command window:
Bringing resource group ‘Cluster Group’ online…
Group Node Status
——————– ————— ——
Cluster Group 2008-Node5 Online
DO NOT use any other distributed method, such as distributed file systems, to host the file share witness.
What will be done to correct this condition?
We have worked with the Windows Product Team to bring this behavior to their attention. We are working on a possible design change in the behavior of the file share witness on Windows 2008. As this progresses I will continue to update this blog with more information.
Relevant Event IDs…
- System Log – Event ID 1562 – Indicates that the file share witness is unavailable.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 1/7/2009 8:07:24 AM
Event ID: 1562
Task Category: File Share Witness Resource
Level: Warning
Keywords:
User: SYSTEM
Computer: 2008-Node1.exchange.msft
Description:
File share witness resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ failed a periodic health check on file share ‘\2008-dc1MNS_FSW_2008-Cluster1’. Please ensure that file share ‘\2008-dc1MNS_FSW_2008-Cluster1’ exists and is accessible by the cluster.
- System Log – Event ID 1069 – Indicates the file share witness resource is in failed state.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 1/7/2009 8:07:24 AM
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: 2008-Node1.exchange.msft
Description:
Cluster resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ in clustered service or application ‘Cluster Group’ failed.
- System Log – Event ID 1564 – Indicates that the cluster cannot access the file share witness directory.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 1/7/2009 8:07:25 AM
Event ID: 1564
Task Category: File Share Witness Resource
Level: Critical
Keywords:
User: SYSTEM
Computer: 2008-Node1.exchange.msft
Description:
File share witness resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ failed to arbitrate for the file share ‘\2008-dc1MNS_FSW_2008-Cluster1’. Please ensure that file share ‘\2008-dc1MNS_FSW_2008-Cluster1’ exists and is accessible by the cluster.
- System Log – Event ID 1205 – Indicates that the cluster core resources group («Cluster Group») is not completely online or offline due to a failure of the file share witness resource.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 1/7/2009 8:07:25 AM
Event ID: 1205
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: 2008-Node1.exchange.msft
Description:
The Cluster service failed to bring clustered service or application ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
*Thanks to Chuck Timon, Sr Support Escalation Engineer, Platforms CSS for assisting in reviewing and modifying this information.
=======================================
Updated Wednesday – 08/19/09
Jeff Guillet – a Microsoft Windows MVP – has posted a sample batch file for starting the FSW and moving the cluster core resources group to a desired node. The instructions also include how to schedule the batch file as a startup script.
http://www.expta.com/2009/06/failure-of-fsw-causes-cluster-group-to.html
=======================================
=======================================
Updated Wednesday 12/14/2011
I failed to update this blog post previously with a windows hotfix that corrects this behavior and makes work arounds not necessary.
http://blogs.technet.com/b/timmcmic/archive/2010/02/15/kb978790-update-to-windows-2008-to-change-the-failure-behavior-of-the-file-share-witness-quorum-resource.aspx
=======================================
Hi John,
Thank you for your help
Cluster.log on both nodes were not very useful. The log on the active node
has not logged events between 1.04 and 22.04. The log on the passive node has
some events logged on 4th and 8th April. Both logs have no events logged for
the time of the failures.
Failover Cluster Operational Log also appears to have missed some periods of
time although not that large — no events were logged between 1:08 AM on 17.04
and 3:29 PM on 20.04. The first time stamp coincides with the time when the
cluster recovered from a failure, the second timestamp is when the backup
started
Windows System Event log seems to be the most useful. I’m not sure if the
cluster service has crashed and that caused the disconnection to the active
node, or the node has lost connectivity to the quorum and that caused the
cluster service to terminate. It also looks like there is some pattern in the
time of the fault: Occurrences in the last 2 weeks are
23.04 — From 1:05:18 AM to 1:07:51 AM
17.04 — From 1:06:18 AM to 1:07:55 AM
14.04 — From 11:30:37 PM to 11:33:09 PM
Regards,
Ilian
Windows System Log
————————————————————————————
Level Date and Time Source Event ID Task Category
Information 23/04/2009 1:07:54
a.m. Microsoft-Windows-Time-Service 37 None The time provider NtpClient is
currently receiving valid time data from nzsakldc01.nzsakl.bhp.com.au
(ntp.d|0.0.0.0:123->152.153.40.60:123).
Information 23/04/2009 1:07:51 a.m. Tcpip 4201 None The system detected that
network adapter Local Area Connection* 12 was connected to the network, and
has initiated normal operation.
Information 23/04/2009 1:07:51 a.m. Tcpip 4201 None The system detected that
network adapter Local Area Connection* 12 was connected to the network, and
has initiated normal operation.
Information 23/04/2009 1:07:52
a.m. Microsoft-Windows-Time-Service 37 None The time provider NtpClient is
currently receiving valid time data from nzsakldc01.nzsakl.bhp.com.au
(ntp.d|0.0.0.0:123->152.153.40.60:123).
Information 23/04/2009 1:07:51 a.m. Service Control Manager 7036 None The
Cluster Service service entered the running state.
Warning 23/04/2009 1:07:01
a.m. Microsoft-Windows-Time-Service 131 None NtpClient was unable to set a
domain peer to use as a time source because of DNS resolution error on
‘nzsakldc01.nzsakl.bhp.com.au’. NtpClient will try again in 15 minutes and
double the reattempt interval thereafter. The error was: No such host is
known. (0x80072AF9).
Critical 23/04/2009 1:06:55
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource » failed to arbitrate for the file
share ‘\\STLAKLXCH03\Quorum’. Please ensure that file share
‘\\STLAKLXCH03\Quorum’ exists and is accessible by the cluster.
Error 23/04/2009 1:06:56 a.m. Service Control Manager 7031 None The Cluster
Service service terminated unexpectedly. It has done this 1 time(s). The
following corrective action will be taken in 60000 milliseconds: Restart the
service.
Error 23/04/2009 1:06:56 a.m. Service Control Manager 7024 None The Cluster
Service service terminated with service-specific error 5925 (0x1725).
Information 23/04/2009 1:06:55 a.m. Service Control Manager 7036 None The
Cluster Service service entered the stopped state.
Critical 23/04/2009 1:06:49
a.m. Microsoft-Windows-FailoverClustering 1177 None «The Cluster service is
shutting down because quorum was lost. This could be due to the loss of
network connectivity between some or all nodes in the cluster, or a failover
of the witness disk.
Run the Validate a Configuration wizard to check your network configuration.
If the condition persists, check for hardware or software errors related to
the network adapter. Also check for failures in any other network components
to which the node is connected such as hubs, switches, or bridges.»
Error 23/04/2009 1:06:48
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:06:47
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Critical 23/04/2009 1:06:40
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:06:40
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Error 23/04/2009 1:06:32
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:06:32
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Critical 23/04/2009 1:06:24
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:06:24
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Error 23/04/2009 1:06:15
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:06:15
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:06:08
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:06:07
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:05:59
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:05:59
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:05:51
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:05:51
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:05:44
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:05:44
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:05:37
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:05:37
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Error 23/04/2009 1:05:31
a.m. Microsoft-Windows-FailoverClustering 1069 Resource Control
Manager Cluster resource ‘File Share Witness (\\STLAKLXCH03\Quorum)’ in
clustered service or application ‘Cluster Group’ failed.
Critical 23/04/2009 1:05:31
a.m. Microsoft-Windows-FailoverClustering 1564 File Share Witness
Resource File share witness resource ‘File Share Witness
(\\STLAKLXCH03\Quorum)’ failed to arbitrate for the file share
‘\\STLAKLXCH03\Quorum’. Please ensure that file share ‘\\STLAKLXCH03\Quorum’
exists and is accessible by the cluster.
Information 23/04/2009 1:05:31 a.m. Service Control Manager 7036 None The
Windows Modules Installer service entered the running state.
Information 23/04/2009 1:05:21 a.m. Tcpip 4201 None The system detected that
network adapter Local Area Connection* 12 was connected to the network, and
has initiated normal operation.
Information 23/04/2009 1:05:21 a.m. Tcpip 4201 None The system detected that
network adapter Local Area Connection* 12 was connected to the network, and
has initiated normal operation.
Information 23/04/2009 1:05:22
a.m. Microsoft-Windows-Time-Service 37 None The time provider NtpClient is
currently receiving valid time data from nzsakldc01.nzsakl.bhp.com.au
(ntp.d|0.0.0.0:123->152.153.40.60:123).
Critical 23/04/2009 1:05:18
a.m. Microsoft-Windows-FailoverClustering 1135 None Cluster node ‘STLAKLMB01’
was removed from the active failover cluster membership. The Cluster service
on this node may have stopped. This could also be due to the node having lost
communication with other active nodes in the failover cluster. Run the
Validate a Configuration wizard to check your network configuration. If the
condition persists, check for hardware or software errors related to the
network adapters on this node. Also check for failures in any other network
components to which the node is connected such as hubs, switches, or bridges.
————————————————————————————
-> I was working on an Always On failover issue. Always on Availability group was failing over everyday anytime between 22:00 to 23:00.
-> Below messages were found in Event viewer logs,
Log Name: Application
Source: MSSQL$SQL01
Date: 6/01/2020 10:29:21 PM
Event ID: 41144
Task Category: Server
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The local availability replica of availability group ‘JBAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:29:18 PM
Event ID: 1561
Task Category: Startup/Shutdown
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER1.JBS.COM
Description:
Cluster service failed to start because this node detected that it does not have the latest copy of cluster configuration data. Changes to the cluster occurred while this node was not in membership and as a result was not able to receive configuration data updates.
Guidance:
Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. This node will then be able join the cluster and will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the ‘Start-ClusterNode -FQ’ Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node’s copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.
Log Name: System
Source: Service Control Manager
Date: 6/01/2020 10:29:21 PM
Event ID: 7024
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The Cluster Service service terminated with the following service-specific error:
A quorum of cluster nodes was not present to form a cluster.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 7/01/2020 11:45:47 AM
Event ID: 1146
Task Category: Resource Control Manager
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:28:25 PM
Event ID: 1135
Task Category: Node Mgr
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
Cluster node ‘JBSERVER1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
-> Below messages were found in Cluster.log
[System] 00002420.00002004::2020/01/01-00:40:48.745 DBG Cluster node ‘JBSERVER3’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00002004::2020/01/01-00:40:48.746 DBG Cluster node ‘JBSERVER2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00004598::2020/01/01-00:40:48.809 DBG The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was ‘1359’.
[System] 00002420.0000438c::2020/01/01-00:40:49.173 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
[System] 00002420.00005e5c::2020/01/01-00:40:49.174 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
-> The messages indicate that the Always On Availability group failover may be due to a network issue. I requested help from my networking team and was advised that there were no network issues.
-> I configured verbose logging for Always On Availability group using this article and generated cluster.log when the issue happened next time.
-> I started a continuous ping with a timestamp embedded into it till the issue occurred next time using below powershell command. From JBSERVER1, I started pinging JBSERVER2, JBSERVER3, File share witness server. From JBSERVER2, I started pinging JBSERVER1, JBSERVER3, File share witness server. From JBSERVER3, I started pinging JBSERVER1, JBSERVER2, File share witness server.
ping.exe -t JBSERVER1|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER1.txt
ping.exe -t JBSERVER2|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER2.txt
ping.exe -t JBSERVER3|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER3.txt
-> The issue happened next day and below is the SQL Server error log details,
2020-01-06 22:28:16.580 spid22s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is shutting down. For more information, see the SQL Server
2020-01-06 22:29:02.950 spid47s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For
-> I checked the ping results,
-> I provided these results to the Network team and requested the reason why there is a “Request timed out” if there are no network issues.
-> While the Network team was investigating I requested my Infrastructure team to check if the network card and firmware drivers were up to date. I got an update that they were latest.
-> I also wanted to ensure Anti-virus software is not a problem. Hence wanted to uninstall and verify. But this request was denied.
-> In the meantime, Application team requested for any temporary workaround or fix till the network team complete their troubleshooting.
-> I advised them that we can increase the values of below properties till we get to the root cause of network issue. I have clearly advised the application team that the default values for below properties are the recommended values and changing these to a higher value as recommended below can increase the RTO (Recovery Time Objective) as there will be a delay in failover in case of a genuine server/SQL down scenario. It just masks or delays the problem, but will never completely fix the issue. The best thing to do is find out the root cause of the heartbeat failures and get it fixed. Application team understood the risk and accepted to increase the values as it will be temporary.
PS C:\Windows\system32> (get-cluster).SameSubnetDelay = 2000
PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20
PS C:\Windows\system32> (get-cluster).CrossSubnetDelay = 4000
PS C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20
-> You can check what these values are before and after the change using below command,
PS C:\Windows\system32> get-cluster | fl *subnet*
-> This gave us some temporary relief. After 1 week, Infrastructure team have advised that there was a VM level backup happening at that time everyday through Commvault which may/can freeze the servers for 4 or 5 seconds. It seems like they have suspended it as it was not required anymore.
-> Same time, Network team advised that they have fixed the network issue and updated us to monitor.
-> I changed SameSubnetDelay, SameSubnetThreshold, CrossSubnetDelay, CrossSubnetThreshold to its default value. There were no issues after this. Everyone were happy!
Thank You,
Vivek Janakiraman
Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.