-> I was working on an Always On failover issue. Always on Availability group was failing over everyday anytime between 22:00 to 23:00.
-> Below messages were found in Event viewer logs,
Log Name: Application
Source: MSSQL$SQL01
Date: 6/01/2020 10:29:21 PM
Event ID: 41144
Task Category: Server
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The local availability replica of availability group ‘JBAG’ is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: 41005). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:29:18 PM
Event ID: 1561
Task Category: Startup/Shutdown
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER1.JBS.COM
Description:
Cluster service failed to start because this node detected that it does not have the latest copy of cluster configuration data. Changes to the cluster occurred while this node was not in membership and as a result was not able to receive configuration data updates.
Guidance:
Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. This node will then be able join the cluster and will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the ‘Start-ClusterNode -FQ’ Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node’s copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.
Log Name: System
Source: Service Control Manager
Date: 6/01/2020 10:29:21 PM
Event ID: 7024
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: JBSERVER1.JBS.COM
Description:
The Cluster Service service terminated with the following service-specific error:
A quorum of cluster nodes was not present to form a cluster.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 7/01/2020 11:45:47 AM
Event ID: 1146
Task Category: Resource Control Manager
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 6/01/2020 10:28:25 PM
Event ID: 1135
Task Category: Node Mgr
Level: Critical
Keywords:
User: SYSTEM
Computer: JBSERVER2.JBS.COM
Description:
Cluster node ‘JBSERVER1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
-> Below messages were found in Cluster.log
[System] 00002420.00002004::2020/01/01-00:40:48.745 DBG Cluster node ‘JBSERVER3’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00002004::2020/01/01-00:40:48.746 DBG Cluster node ‘JBSERVER2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
[System] 00002420.00004598::2020/01/01-00:40:48.809 DBG The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was ‘1359’.
[System] 00002420.0000438c::2020/01/01-00:40:49.173 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
[System] 00002420.00005e5c::2020/01/01-00:40:49.174 DBG The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
-> The messages indicate that the Always On Availability group failover may be due to a network issue. I requested help from my networking team and was advised that there were no network issues.
-> I configured verbose logging for Always On Availability group using this article and generated cluster.log when the issue happened next time.
-> I started a continuous ping with a timestamp embedded into it till the issue occurred next time using below powershell command. From JBSERVER1, I started pinging JBSERVER2, JBSERVER3, File share witness server. From JBSERVER2, I started pinging JBSERVER1, JBSERVER3, File share witness server. From JBSERVER3, I started pinging JBSERVER1, JBSERVER2, File share witness server.
ping.exe -t JBSERVER1|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER1.txt
ping.exe -t JBSERVER2|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER2.txt
ping.exe -t JBSERVER3|Foreach{"{0} - {1}" -f (Get-Date),$_} > C:\temp\ping\JBSERVER3.txt
-> The issue happened next day and below is the SQL Server error log details,
2020-01-06 22:28:16.580 spid22s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is shutting down. For more information, see the SQL Server
2020-01-06 22:29:02.950 spid47s The state of the local availability replica in availability group ‘JBAG’ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For
-> I checked the ping results,
-> I provided these results to the Network team and requested the reason why there is a “Request timed out” if there are no network issues.
-> While the Network team was investigating I requested my Infrastructure team to check if the network card and firmware drivers were up to date. I got an update that they were latest.
-> I also wanted to ensure Anti-virus software is not a problem. Hence wanted to uninstall and verify. But this request was denied.
-> In the meantime, Application team requested for any temporary workaround or fix till the network team complete their troubleshooting.
-> I advised them that we can increase the values of below properties till we get to the root cause of network issue. I have clearly advised the application team that the default values for below properties are the recommended values and changing these to a higher value as recommended below can increase the RTO (Recovery Time Objective) as there will be a delay in failover in case of a genuine server/SQL down scenario. It just masks or delays the problem, but will never completely fix the issue. The best thing to do is find out the root cause of the heartbeat failures and get it fixed. Application team understood the risk and accepted to increase the values as it will be temporary.
PS C:\Windows\system32> (get-cluster).SameSubnetDelay = 2000
PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20
PS C:\Windows\system32> (get-cluster).CrossSubnetDelay = 4000
PS C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20
-> You can check what these values are before and after the change using below command,
PS C:\Windows\system32> get-cluster | fl *subnet*
-> This gave us some temporary relief. After 1 week, Infrastructure team have advised that there was a VM level backup happening at that time everyday through Commvault which may/can freeze the servers for 4 or 5 seconds. It seems like they have suspended it as it was not required anymore.
-> Same time, Network team advised that they have fixed the network issue and updated us to monitor.
-> I changed SameSubnetDelay, SameSubnetThreshold, CrossSubnetDelay, CrossSubnetThreshold to its default value. There were no issues after this. Everyone were happy!
Thank You,
Vivek Janakiraman
Disclaimer:
The views expressed on this blog are mine alone and do not reflect the views of my company or anyone else. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.
Today, I had issue with one of the node cluster. After installing DPM agent on one of the nodes, the Server didn’t want to join to cluster. The error like this is not something to be happy about.
Cluster node ‘ServerName’ was removed from the active failover cluster membership.!
From cluster logs, I could see lot of event ID:1135
Cluster node ‘ServerName’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Tracking down systems logs from server itself i could see few other errors.
EventID 1146; 1070;
All in all, no usable information from these logs.
Troubleshooting
To get correct information and logs, I used Get-Clusterlog Powershell command to generate log file for each member of cluster.
The one I used on healthy node is:
Get-Clusterlog -Timespan 5 -Destination \\Node1\C$\ClusterLogs\
This way, you have initiate creating cluster logs for each cluster member in time period of 5 minutes and to collect data to same destination. In this period, you should try to start Cluster service from node that is causing the issue.
After command is completed, and you tried to start cluster service on node that is having the issue, you will end up with cluster log files.
I analyzed the the log file from node that is having issue. (as you can see, that log file is the largest in above picture.)
From logs provided by ClusterLogs, found out that one of NICs (iSCSI B2 in this case) is cousin this issue.
ERR mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'(Node5 – iSCSI B2)
WARN [DM] Aborting group transaction 80:80:5843+1
ERR [CORE] Node 5: exception caught (183)’ because of ‘Gum handler completed as failed’
ERR Exception in the PostForm is fatal (status = 183)
WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
Resolution.
After I identified what could be the issue, I tried simple fix, renaming affected NIC adapter. (in this case iSCSI B2), and trying to start cluster service again.
After few seconds, the service was running and Node was again joined to cluster.
After you confirm that cluster node is operating normally, you can stop the cluster service from failover managment console and rename NIC to be same as on rest of nodes.
Hope it will help someone.
Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership.
- How do I find cluster event logs?
- How do I check my cluster failover time?
- What is Failover Clustering in Windows Server 2016?
- How do I check Windows cluster status?
- What is the event ID for cluster failover?
- How do I monitor Windows failover cluster?
- Where are Microsoft cluster logs?
- What is a cluster event?
- How do you check failover in Alwayson availability groups?
- How check if node is active in cluster?
- How do I check cluster resource status?
- How do you check if the server is clustered?
How do I find cluster event logs?
To locate the log, in Event Viewer, expand Applications and Services Logs, expand Microsoft, expand Windows, and then expand FailoverClustering. The log file is stored in systemroot\system32\winevt\Logs.
How do I check my cluster failover time?
The time of the failover will be the same as that of the starting of the new log file. The only way to find whether it was a restart\failover from Error log is to look for this message in the current log, «The NETBIOS name of the local node that is running the server is ‘XXXXXXXXXX’.
What is Failover Clustering in Windows Server 2016?
A failover cluster is a group of independent computers that work together to increase the availability and scalability of clustered roles (formerly called clustered applications and services).
How do I check Windows cluster status?
Expand the cluster menu, using the pop out arrow to the left of the cluster name, this will present the various options which are available to you. You will now be able to see 5 menu items below the cluster name which you have expanded, please select the “Roles” option from the menu.
What is the event ID for cluster failover?
Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership.
How do I monitor Windows failover cluster?
Add a Failover Cluster Monitor
Once the Windows agent installation is complete, the agent will auto-discover and add the failover cluster for monitoring. Go to Server > Microsoft Failover Cluster to view the performance metrics for the failover cluster monitor.
Where are Microsoft cluster logs?
Default location is C:\Windows\Cluster\Reports.
What is a cluster event?
They are events organised by one or several Horizon 2020 consortia focusing on a specific research topic or area, hence the name cluster.
How do you check failover in Alwayson availability groups?
Automatic failover in SQL Server Always On Availability Groups. This type of failover automatically occurs in the case of the primary replica goes down. To verify your AG’s support automatic failover, right-click on the availability group and open its properties. It shows the highlighted configuration.
How check if node is active in cluster?
In “Failover Cluster Management” console, you can see the current active node under the label “Current Owner” from summary screen.
How do I check cluster resource status?
Open Command Line and Type “cluster group” and press enter to list all the available resource groups of SQL Server Failover Cluster as shown in the snippet below. This command will list the Group, Node and Status of each group.
How do you check if the server is clustered?
You can query each node to see if it has the failover cluster feature installed, but that does not mean that it is part of a cluster. get-cluster -domain <domainname> will return a list of the clusters in your domain. get-clusternode -cluster <clustername> will return a list of the nodes in a cluster.
Troubleshooting Azure AD Hybrid Join and Intune AutoEnrollMDM
Working for a number of clients recently and we were deploying Self-Service Password Reset from a Windows 10 logon screen ( https://docs.microsoft.com/en-us/azure/active-directory/authentication/tutorial-sspr-windows ) and we came across was the machines would not Hybrid Join . using dsregcmd /status we could see the AzureAdJoined still had a value of No so w e went through the following checklist: Checked Hybrid Join was enabled using the Azure AD Connect wizard — https://docs.microsoft.com/en-us/azure/active-directory/devices/hybrid-azuread-join-managed-domains Checked the device control Group policies — https://docs.microsoft.com/en-us/azure/active-directory/devices/hybrid-azuread-join-control Checked the device settings within the Azure Portal — https://docs.microsoft.com/en-us/azure/active-directory/devices/azureadjoin-plan#configure-your-device-settings Checked the SSO URLs had been added into local intranet zone (SSPR) — https://docs.micr…
Modern Management — Part Seven — Bitlocker
Sorry it’s been a while but following on from my last post Modern Management — Part Six — Resetting Autopilot Devices , here is my lastest post around Modern Management and deploying Bitlocker Device Configuration Profiles as part of an Autopilot deployment. Modern Management — Part One — Autopilot Demo on Hyper-V Modern Management — Part Two — Office 365 Deployment via Intune Modern Management — Part Three — Packaging Win32 Application for Intune Modern Management — Part Four — OneDrive Silent Configuration Modern Management — Part Five — Windows Updates Modern Management — Part Six — Resetting Autopilot Devices Modern Management — Part Seven — Bitlocker Modern Management — Part Eight — Windows Activation Modern Management — Part Nine — BGinfo via Intune Modern Management — Part Ten — Harvesting Autopilot Hardware IDs Modern Management — Part Eleven — Migrate File Shares to Teams Modern Management — Part Twelve — Synchronising AD G…
Exchange Transport Rules — Export and Import to EXO
Today I attempted to export transport rules from an Exchnage Server (2013) and Import these into Exchnage Online using the script: #Set-ExecutionPolicy Set-ExecutionPolicy Unrestricted -Scope Process -Force #Set PowerShell to use TLS 1.2 [ Net.ServicePointManager ]:: SecurityProtocol = «tls12» #Check and create directory IF ( Test-Path C:\Temp\TransportRules ) { Write-Host «directory exists» } ELSE { New-Item -path C:\Temp\ -Name TransportRules -ItemType Directory } ##### EOP ###### #Load the Exchange Management Shell Module $CallEMS = «. ‘ $env:ExchangeInstallPath \bin\RemoteExchange.ps1’; Connect-ExchangeServer -auto -ClientApplication:ManagementShell » Invoke-Expression $CallEMS #Export Transport Rules $file = Export-TransportRuleCollection Set-Content -Path «C:\Temp\TransportRules\Rules.xml» -Value $file . FileData -Encoding Byte ##### EXOL ###### #Install Modules Inst…
AADC — 0x8023134a — AttributeValueMustBeUnique
I ran into an error during an SMTP matching excercise while merging Active Directory accounts with existing Azure AD Accounts for an Office 365 project I was working on. The account would just not sychronise. Unable to update this object because the following attributes associated with this object have values that may already be associated with another object in your local directory services: [ProxyAddresses SMTP:user.name@domain.com;]. Correct or remove the duplicate values in your local directory. Please refer to http://support.microsoft.com/kb/2647098 for more information on identifying objects with duplicate attribute values. Tracking Id: f6334212-fc15-4eea-9407-xxxxxxxxxxxx ExtraErrorDetails: [{«Key»:»ObjectId»,»Value»:[«cb866447-5dfb-4fdf-xxxx-xxxxxxxxxxxxx»]},{«Key»:»ObjectIdInConflict»,»Value»:[«516338b6-8cc4-4d78-xxxx-xxxxxxxxxxxxx»]},{«Key»:»AttributeConflictName»…
Azure AD SSO — Troubleshooting
If you have configured Azure Active Directory Connect to use Seamless Single Sign on and are having trouble with signing on ensure the following: You are logging onto a Domain Joined machine connected to the corporate network, the machine must have line of sight to a Domain Controller to request a Kerberos ticket. The following URLS are added to the Local Intranet zone via GPO (User Configuration\Administrative Templates\Windows Components\Internet Explorer\Internet Control Panel\Security Page by modifying the «Site to Zone Assignment List») https://autologon.microsoftazuread-sso.com https://aadg.windows.net.nsatc.net Enhanced Protected Mode is disabled (Computer Configuration\Computer Policy\Administrative Templates\Windows Components\Internet Explorer\Internet Control Panel\Advanced Page\Turn on Enhanced Protected Mode) You can use klist purge to purge the Kerberos tickets, then klist get AZUREADSSOACC to ensure that you can receiver a Kerberos ticket…
EOL — Troubleshooting Exchange Online Mailbox Migration Speeds
Working for a client we run into numerous issues where we were seeing sluggish performance when migrating mailboxes to Exchange Online, here are a few of the troubleshooting steps we went through. (The actual was being caused by some barracuda load balancer devices which we had to remove out of the migration path.) Ensure the Exchange servers are patched with the latest CU Ensure all flood mitigation, SSL offload, traffic inspection and any IP connection limits are removed from the firewall connections to mail.domain.com (including TMG) Review Migration endpoint and ensure that MaxConcurrentMigrations are set to a reasonable number for your infrastructure, currently using 35 and 25 MaxConcurrentIncrementalSyncs (use Get-MigrationEndpoint | fl to check.) Ensure that the correct AV exclusions are in place as per — https://docs.microsoft.com/en-us/exchange/anti-virus-software-in-the-operating-system-on-exchange-servers-exchange-2013-he…
Exchange Hybrid — Mailbox Permissions — ACLableSyncedObjectEnabled
The key here is to review the guidance at https://docs.microsoft.com/en-gb/exchange/hybrid-deployment/set-up-delegated-mailbox-permissions?redirectedfrom=MSDN#enable-aclable-object-synchronization It is important that the ACLableSyncedObjectEnabled $True before you migrate mailboxes or you will need to reconfigure these manually following the migraiton of mailboxes to Exchange Online. It is important that the ACLableSyncedObjectEnabled $True before you migrate mailboxes or you will need to reconfigure these manually following the migraiton of mailboxes to Exchange Online. To enable ACLable Synced Object run the following on the on-premise Exchnage Organisation: Set-OrganizationConfig -ACLableSyncedObjectEnabled $True If you have forgotten to enabel this then you change change this buy running: #To enable ACLs on a single mailbox, run the following command Get-AdUser «UserMailbox Identity» | Set-AdObject -Replace @{msExchRecipientDisplayType = -1073741818 } #To ena…
SCCM — PS — Detection Method for User APPDATA
I needed a detection method to confirm the presence of a file under the %APPDATALOCAL% directory and a HKCU registry key. Thanks to this post and this one I was able to achieve this. #Detects if C:\Users\%username%\Appdata\Roaming\folder_name\file_or_subfolder_name exists Function CurrentUser{ #CurrentUser function converts the username object string «@{username=domain\user}» # to the exact logon string «user» like the example below #@{username=DOMAIN\USER} #@{username DOMAIN\USER} #DOMAIN\USER} #DOMAIN\USER #DOMAIN USER #USER $loggedInUserName = get-wmiobject win32_computersystem | select username $loggedInUserName = [string]$loggedInUserName $loggedinUsername = $loggedInUserName.Split(«=») $loggedInUserName = $loggedInUserName[1] $loggedInUserName = $loggedInUserName.Split(«}») $loggedInUserName = $loggedInUserName[0] $loggedInUserName = $loggedInUserName.Split(«\») $loggedInUserName = $loggedInUserName…
Office 365 — Deleting Enterprise Vault Items from Exchange Online Mailboxes
I had a number of Exchange Online mailboxes containing legacy Enterprise Vault items which I need to remove. I managed this with the help of Michel de Rooij’s script on the TechNet gallery (1.72 is the latest version at the time of writing this) — https://gallery.technet.microsoft.com/office/Removing-Messages-by-e0f21615 Also the configuration of a new Role Group within Exchange Online thanks to another of his articles — https://eightwone.com/2014/08/13/application-impersonation-to-be-or-pretend-to-be/ You will need to save the script into a directory ( my example — D:\ExchangeMigraiton\EVCleanup\) and also install/copy the EWS dll from the Microsoft Exchange Web Services Managed API 2.2 . After you have all this in your directory, from an a dministrative PowerShell session, change the directory and then execute: Set-ExecutionPolicy unrestricted $UserCredential = Get-Credential (Enter your Office365 admin…
SSPR 0029 We are unable to reset your password due to an error in your on-premises configuration.
This was one of those annoying ones that took hours (days) with Microsoft to resolve. We’re sorry We cannot reset your password at this time because of a problem with your organisation’s password reset configuration. There is no further action you can take to resolve this situation. Please contact your admin and ask them to investigate. To learn more about the potential issue, read the article Troubleshoot password writeback. If you’d like, we can contact another administrator in your organisation to reset your password for you. Additional details: SSPR _0029: We are unable to reset your password due to an error in your on-premises configuration. Please contact your admin and ask them to investigate. EVENTID 6329 — An unexpected error has occurred during a password set operation. «ERR_: MMS(5624): E:\bt\863912\repo\src\dev\sync\ma\shared\inc\MAUtils.h(58): Failed getting registry value ‘ADMADoNormalization’, 0x2 B…
Issue Description:
Cluster unable to communicate to DC on Cluster Running on Node ABSVS3 and ABSVS4 Running a copy of Microsoft Windows Server 2012 R2 Datacenter Version 6.3.9600 Build 9600
Initial Description:
>>As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.
Why is Event ID 1135 Logged ?
This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.
What caused the node to be marked down?
All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:
If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.
By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.
If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.
Reference :
Having a problem with nodes being removed from active Failover Cluster membership?
http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx
_____________________________________________________________________________________
- Checked the Logs of the VM which was crashing and found that the machine crashed on :
_________________________________________________________________________
Log Name: System
Source: EventLog
Date: 6/8/2016 1:39:07 AM
Event ID: 6008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: SPSQUEEN.abc.local
Description:
The previous system shutdown at 21:06:45 on 07/06/2016 was unexpected.
_________________________________________________________________________
- As per this log the issue occurred on 21:06:45 on 07/06/2016
- Checked the events on the cluster at the time of issue.
System Information: ABSVS3
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABSVS3
System Manufacturer HP
System Model ProLiant DL380 Gen9
System Type x64-based PC
System SKU K8P38A
Processor Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P89, 27/12/2015
System Events:
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
6/7/2016 |
9:06:51 PM |
Error |
ABSVS3.abc.local |
5120 |
Microsoft-Windows-FailoverClustering |
Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
- Checked the events and found that the Network start going down after the Backup Job started on the Server.
6/7/2016 |
9:06:53 PM |
Critical |
ABSVS3.abc.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABSVS4’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
6/7/2016 |
9:06:55 PM |
Warning |
ABSVS3.abc.local |
9 |
bxfcoe |
The SAN link is down for port WWN 20:00:2C:44:FD:99:F5:B9. Check to make sure the network cable is properly connected. |
6/7/2016 |
9:06:55 PM |
Warning |
ABSVS3.abc.local |
140 |
Microsoft-Windows-Ntfs |
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, DeviceName: \Device\HarddiskVolume8. (A device which does not exist was specified.) |
6/7/2016 |
9:06:56 PM |
Warning |
ABSVS3.abc.local |
4 |
l2nd |
HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199: The network link is down. Check to make sure the network cable is properly connected. |
6/7/2016 |
9:06:56 PM |
Warning |
ABSVS3.abc.local |
22 |
Microsoft-Windows-Hyper-V-VmSwitch |
Media disconnected on NIC /DEVICE/{406F2556-68B8-466C-A934-13988D1727B9} (Friendly Name: HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199). |
6/7/2016 |
9:06:56 PM |
Error |
ABSVS3.abc.local |
1127 |
Microsoft-Windows-FailoverClustering |
Cluster network interface ‘ABSVS3 – Embedded LOM 1 Port 1’ for cluster node ‘ABSVS3’ on network ‘Cluster Network 2’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
6/7/2016 |
9:06:56 PM |
Error |
ABSVS3.abc.local |
1130 |
Microsoft-Windows-FailoverClustering |
Cluster network ‘Cluster Network 2’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
6/7/2016 |
9:08:48 PM |
Error |
ABSVS3.abc.local |
1291 |
NIC Agents |
NIC Agent: Connectivity has been lost for the NIC in slot 0, port 1. [SNMP TRAP: 18012 in CPQNIC.MIB] |
6/7/2016 |
9:08:49 PM |
Warning |
ABSVS3.abc.local |
1014 |
Microsoft-Windows-DNS-Client |
Name resolution for the name _kerberos._tcp.Default-First-Site-Name._sites.dc._msdcs.abc.local. timed out after none of the configured DNS servers responded. |
Application Events:
- Checked the event logs and found that the Backup Job Started on 9:00:02 PM and it failed on 9:09:23 PM.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
6/7/2016 |
9:00:02 PM |
Information |
ABSVS3.abc.local |
5632 |
BackupAssist |
Starting Job ‘DailyDataBackup’ for scheduled time: 07/06/2016 21:00 Job Method: File Replication Destination: Network location Job Execution ID: 5.454 Tag:FtpY1Ml3VmPO+DP9lqNlwkenKQK/EMTsA/1IVBnw6fw= |
6/7/2016 |
9:09:23 PM |
Error |
ABSVS3.abc.local |
5634 |
BackupAssist |
Backup job DailyDataBackup failed with errors. Information: Could not copy directory attributes Ultra critical error: The network path was not found Destination: Network location Bytes: 123781013031 Files: 108217 Start time: 07/06/2016 21:00:07 End time: 07/06/2016 21:09:14 Duration: 00:09:07.4098342 Job Execution ID: 5.454 |
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
2/12/2010 23:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
2/18/2014 16:02 |
(3.23:1.0) |
(3.23:1.0) |
Sophos Limited |
SAV On-Access and HIPS for Windows Vista (AMD64) |
7/28/2014 15:26 |
(10.3:13.0) |
(3.4:9.0) |
Sophos Limited |
Sophos Web Intelligence |
5/22/2013 22:41 |
(3.9:0.0) |
(3.9:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
11/24/2013 2:26 |
(3.10:0.0) |
(3.10:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Channel Interface Driver |
2/26/2014 13:04 |
(1.1:303.0) |
(1.1:303.0) |
Certit PTY LTD |
VHD Virtual Disk Driver |
3/1/2013 1:31 |
(4.1:0.2980) |
(4.1:0.2980) |
Riverbed Technology, Inc. |
npf.sys (NT5/6 AMD64) Kernel Driver |
_________________________________________________________________________________________
System
Information: ABSVS4
OS Name Microsoft
Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build
9600
Other OS Description
Not Available
OS Manufacturer
Microsoft Corporation
System Name ABSVS4
System Manufacturer HP
System Model ProLiant
DL380 Gen9
System Type x64-based
PC
System SKU K8P38A
Processor Intel(R)
Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R)
Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP
P89, 27/12/2015
System
Events:
- Checked the
logs and found the Node S3 got evicted from the FCM at the time of Issue.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
6/7/2016 |
9:06:53 PM |
Critical |
ABSVS4.abc.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABSVS3’ was removed from the active failover cluster |
- After this
we found that the network start going down.
6/7/2016 |
9:06:55 PM |
Warning |
ABSVS4.abc.local |
9 |
bxfcoe |
The SAN link is down for port WWN 20:00:2C:44:FD:99:D2:89. Check to make sure the network cable is |
6/7/2016 |
9:06:56 PM |
Warning |
ABSVS4.abc.local |
4 |
q57nd60a |
HP Ethernet 1Gb 4-port 331i Adapter: The network link is |
6/7/2016 |
9:06:56 PM |
Error |
ABSVS4.abc.local |
1127 |
Microsoft-Windows-FailoverClustering |
Cluster network interface ‘ABSVS4 – Embedded LOM 1 Port 1’ for |
6/7/2016 |
9:06:56 PM |
Error |
ABSVS4.abc.local |
1130 |
Microsoft-Windows-FailoverClustering |
Cluster network ‘Cluster Network 2’ is down. None of the available |
6/7/2016 |
9:06:56 PM |
Warning |
ABSVS4.abc.local |
4 |
l2nd |
HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199: The network link |
6/7/2016 |
9:06:56 PM |
Warning |
ABSVS4.abc.local |
22 |
Microsoft-Windows-Hyper-V-VmSwitch |
Media disconnected on NIC |
Application
Events:
- Checked the
Application logs but was not able to find any events at the time of issue.
List
of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
2/12/2010 23:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
2/18/2014 16:02 |
(3.23:1.0) |
(3.23:1.0) |
Sophos Limited |
SAV On-Access and HIPS for Windows Vista (AMD64) |
7/28/2014 15:26 |
(10.3:13.0) |
(3.4:9.0) |
Sophos Limited |
Sophos Web Intelligence |
5/22/2013 22:41 |
(3.9:0.0) |
(3.9:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
11/24/2013 2:26 |
(3.10:0.0) |
(3.10:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Channel Interface Driver |
2/26/2014 13:04 |
(1.1:303.0) |
(1.1:303.0) |
Certit PTY LTD |
VHD Virtual Disk Driver |
_________________________________________________________________________
Conclusion:
- After analyzing
the logs we can see that there are network various Network failure on the
Cluster due to which the Node got evicted and which in the end crashed the
Virtual Machines on the Cluster. As I can also check that the issue
started after the backup job initiated. Kindly uninstall the Backup
utility and monitor the machine.
- For
monitoring purposes kindly uninstall the Antivirus from the Cluster.
- Investigate the Network timeout / latency
/ packet drops with the help of in house networking team.
Please Note : This step is the most critical while
dealing with network connectivity issues.
Investigation of Network Issues :
We
need to investigate the Network Connectivity Issues with the help of in-house
networking team.
In order to avoid this issue in future the most
critical part is to diagnose & investigate the consistent Network
Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and
network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the
networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs
with the help of in-house Networking Team.