PROXMOX - Troubleshoot Cluster

Cluster Communication Issues

Cluster communication issues often occur when nodes are unable to properly sync with each other. These issues can prevent quorum, affect high availability, or result in split-brain scenarios.

Checking Cluster Status

To check the overall status of the cluster, use the `pvecm status` command. This will show whether all nodes are communicating properly and whether quorum is reached.

# Check the status of the cluster
pvecm status

If there are issues with the cluster communication, the output might show one or more nodes as `Offline` or `No quorum`. If the quorum is not met, consider the following solutions.

Verifying Corosync Status

Corosync is the tool that manages communication between the nodes. If Corosync is not working properly, nodes might fail to join the cluster or lose communication.

# Check if Corosync is running
systemctl status corosync

# Check the logs for Corosync errors
journalctl -u corosync

If Corosync is down or showing errors, restart it and check the logs for any specific issues.

# Restart Corosync service
systemctl restart corosync

If the service fails to restart, examine the `/etc/corosync/corosync.conf` configuration for errors or misconfigurations.

Reviewing Node Logs

Check the system logs on each node for any errors related to cluster communication. The logs might provide insights into network issues, Corosync failures, or other underlying problems.

# Check system logs for errors
journalctl -xe

Look for entries related to Corosync, network interfaces, or other services that might affect cluster communication.

Network Issues

Network misconfigurations or issues are often the root cause of communication problems in a cluster. Ensure that all nodes are using static IP addresses within the same subnet, and verify that the firewall is configured to allow Corosync and Proxmox ports.

Check that nodes can ping each other over the network:

# Ping Node 1 from Node 2
ping node1

# Ping Node 2 from Node 3
ping node2

If the ping test fails, check your network interface configuration on all nodes.

# Verify network interfaces
ip a

Ensure that the network interfaces are properly configured, up, and connected.

Firewall Issues

Sometimes, firewall rules can block communication between nodes, preventing them from syncing. Verify that required ports are open.

Proxmox and Corosync typically use the following ports:

Corosync: TCP 5404 and 5405
Proxmox Web UI: TCP 8006
SSH: TCP 22

Ensure these ports are open on all nodes. If using `ufw` (Uncomplicated Firewall), you can check the status with:

# Check UFW status
ufw status

# Allow Corosync and Proxmox ports if they are blocked
ufw allow 5404,5405/tcp
ufw allow 8006/tcp
ufw allow 22/tcp

If using `iptables`, ensure the rules allow communication on the required ports.

# View iptables rules
iptables -L

Storage Issues

Storage problems can arise in a Proxmox cluster when nodes are unable to access shared storage or if the Ceph storage cluster is not functioning correctly.

NFS or Shared Storage Issues

If your cluster relies on NFS or any other shared storage system (such as iSCSI or Ceph), issues with storage access can cause VM performance degradation or failures.

First, check if the storage is mounted on all nodes:

# Check if the NFS share is mounted
df -h

If the NFS share is not mounted on a node, try manually mounting it:

# Mount NFS share manually
mount node1:/mnt/nfs_share /mnt/nfs_share

If there are errors mounting the NFS share, verify the NFS server on the host that provides the share.

# Check NFS status
systemctl status nfs-kernel-server

# Check the NFS exports
exportfs -v

If NFS services are not running, start them:

systemctl start nfs-kernel-server

If you’re using Ceph or another distributed storage system, check the status of the Ceph services:

# Check Ceph status
ceph -s

Look for any errors or warnings that may indicate storage problems.

Disk Space and I/O Errors

Ensure that your nodes have enough disk space for operation. A lack of disk space can cause problems with VM operations or even Proxmox cluster communication.

# Check disk space
df -h

# Check for I/O errors
dmesg | grep -i error

If disk space is full or I/O errors are reported, address the underlying issue (e.g., increase storage, clean up unused data, fix hardware errors).

Disk Management with ZFS

ZFS is a popular filesystem and volume manager that can be used with Proxmox. If you are using ZFS and experiencing issues, follow these steps to troubleshoot.

First, check the status of your ZFS pools:

# Check ZFS pool status
zpool status

If any pool is showing errors, such as "DEGRADED" or "FAULTED," examine the specific disk for failures.

# Check detailed ZFS disk errors
zpool status -v

If there are disk failures, attempt to replace the disk and then resilver the pool:

# Replace failed disk
zpool replace <pool_name> <old_disk> <new_disk>

# Resilver the pool
zpool status <pool_name>

Ensure that the pool is in an "ONLINE" state after the operation.

If there are issues with ZFS performance or disk failures are recurring, check the underlying hardware for issues.

Troubleshooting ZFS Performance Issues

ZFS is a resource-intensive file system, and performance can degrade if the hardware is not properly configured. Common performance issues include high CPU usage, disk latency, and memory bottlenecks.

Use the following commands to check the ZFS performance:

# Check ZFS ARC (Adaptive Replacement Cache) usage
arcstat -f 'time size c_hits c_misses' 1

# Check ZFS disk latency
zpool iostat -v

High latency on ZFS disks may indicate disk performance issues or misconfiguration. You can also check CPU usage and RAM consumption:

# Check CPU usage
top

# Check memory usage
free -h

If you notice high memory consumption, consider increasing system RAM or adjusting ZFS ARC parameters.

Proxmox Backup Issues

Proxmox Backup is used for managing backups of virtual machines. If backups are failing or not being stored correctly, troubleshoot the following:

First, check if the Proxmox Backup Server is accessible:

# Check the backup server status
systemctl status proxmox-backup

# Check backup logs for errors
cat /var/log/proxmox-backup/proxmox-backup.log

If the backup service is not running, restart it:

# Restart the backup service
systemctl restart proxmox-backup

Ensure that the backup server is properly configured in the Proxmox GUI and that there is sufficient space for backups.

Check the storage location for the backups:

# Check backup storage usage
df -h

If the storage is full, free up space or adjust the retention policy for backups.

If backup jobs are failing, verify the backup configuration on the Proxmox node:

# Check backup job configuration
cat /etc/proxmox-backup/backup.conf

Ensure that all paths and configurations are correct, and verify network connectivity if using remote backup storage.

Disk Configuration Conflicts

If you experience issues related to disk recognition or configuration conflicts, ensure that all disks are properly recognized by Proxmox. Use the following commands to check disk configuration:

# List all block devices
lsblk

# Check disk partitions
fdisk -l

Ensure that disks are correctly formatted and that there are no conflicting partition tables or missing devices.

VM Issues

Sometimes, issues arise specifically with virtual machines running on the Proxmox cluster. Common issues include VM not starting, performance degradation, or network problems.

Checking VM Logs

For VM-specific problems, check the VM’s log files. Each VM has its own log located in `/var/log/pve/tasks/`. Look for specific error messages that can help diagnose the issue.

Check VM log for errors

cat /var/log/pve/tasks/vm-100.log</nowiki>

Virtual Machine Network Issues

If a VM cannot access the network, check if the virtual network interface is properly configured. Verify that the VM’s virtual NIC is connected to the correct bridge.

# Check VM network configuration
qm config 100

Make sure the VM's NIC is connected to the correct bridge (e.g., `vmbr0`).

If the VM is using a static IP, ensure that the subnet and gateway are correctly configured. For dynamic IP assignment, ensure the DHCP server is functioning.

VM Storage Issues

If a VM is unable to access its storage or if it cannot start due to storage issues, ensure that the VM’s storage is available and mounted.

# Check if the VM disk is accessible
ls /mnt/pve/vm-100-disk-1.raw

If the disk file is missing or corrupted, restore it from a backup or troubleshoot the underlying storage issue.

VM Performance Issues

If a VM is experiencing performance degradation, check the resource usage for both the VM and the host node.

Check the CPU, RAM, and disk usage of the VM:

# Check VM resource usage
qm monitor 100

Check the load on the host node:

# Check host node CPU load
top

# Check host node memory usage
free -h

Also, verify that the VM is not over-committing CPU or memory resources. You can adjust the resource allocation (CPU and RAM) for the VM via the Proxmox web interface or with the `qm` command.

# Adjust CPU and memory allocation
qm set 100 -cpu cputype -memory 4096

If the problem persists, consider checking the disk I/O performance of the host node, as storage bottlenecks can also affect VM performance.

# Check disk I/O stats on the node
iostat -x 1

High Availability (HA) Issues

If HA is enabled, ensure that resources (e.g., VMs) are correctly configured and can failover to another node.

Checking HA Status

Check the status of HA resources and see if any resource is stuck or in a failed state.

# Check HA status
ha-manager status

If any resources are in a `failed` state, try to restart them:

# Restart a failed HA resource
ha-manager resource restart <resource_name>

Check the HA logs for more information:

# Check HA logs
cat /var/log/pve-ha-manager.log

Verifying HA Resource Configuration

Ensure that all HA resources are correctly configured and have appropriate constraints (e.g., node preference, failover policies).

# Check HA configuration
cat /etc/pve/ha/haresources

Check for any configuration mismatches or errors that could prevent proper HA operation.

Disk Management and Troubleshooting ZFS

If you are using ZFS for storage in Proxmox, disk management issues can arise, especially when dealing with disk failures or degraded pools. ZFS provides robust management, but it requires careful monitoring.

Checking ZFS Pool Health

Use `zpool status` to check the health of your ZFS pool. Any errors or warnings related to disk failure will be reported here.

# Check ZFS pool status
zpool status

A `DEGRADED` state means that one or more disks in the pool have failed or are no longer accessible. If your pool is in a `DEGRADED` state, identify the affected disk using `zpool status -v`, and take action to replace the failed disk.

# Check detailed ZFS pool status
zpool status -v

If a disk is faulty, replace it with a new one, and let ZFS resilver the data:

# Replace failed disk in ZFS pool
zpool replace <pool_name> <failed_disk> <new_disk>

# Monitor resilvering progress
zpool status <pool_name>

Once the pool is fully resilvered, ensure it returns to an `ONLINE` state.

ZFS Performance Issues

If you notice ZFS performance degradation, check the system’s resources (CPU, RAM, and disk). ZFS is memory-intensive, and insufficient resources can cause slowdowns. Use `arcstat` to monitor ZFS ARC (Adaptive Replacement Cache) usage:

# Monitor ZFS ARC usage
arcstat -f 'time size c_hits c_misses' 1

Check the system's CPU and memory usage:

# Check CPU usage
top

# Check memory usage
free -h

You can also check disk I/O for latency:

# Check disk I/O stats
zpool iostat -v

If necessary, consider adjusting ZFS settings such as increasing system RAM, optimizing ARC size, or tweaking other ZFS tunables for performance.

ZFS Dataset Issues

Sometimes, specific datasets or volumes may experience issues. You can verify the dataset status with:

# Check ZFS dataset status
zfs list

If any dataset is reporting errors, consider exporting and importing the dataset:

# Export ZFS dataset
zfs export <dataset_name>

# Import ZFS dataset
zfs import <dataset_name>

Ensure that the dataset is mounted and accessible.

Proxmox Backup Interaction

Proxmox Backup is essential for backing up and restoring virtual machines in the Proxmox environment. Sometimes issues arise related to backup configurations, restoration, or access.

Checking Proxmox Backup Server Status

Ensure that the Proxmox Backup Server is running. Use the following command to check the status of the backup service:

# Check Proxmox Backup Server status
systemctl status proxmox-backup

If the service is not running, attempt to start it:

# Restart Proxmox Backup Server
systemctl restart proxmox-backup

Check logs for any issues:

# Check Proxmox Backup logs
cat /var/log/proxmox-backup/proxmox-backup.log

Look for error messages related to storage or access issues.

Verifying Backup Storage Configuration

Ensure that your backup storage is correctly configured and accessible from all nodes in the cluster. Check the configuration in `/etc/proxmox-backup/backup.conf` and ensure that all paths are correct.

# Check backup configuration
cat /etc/proxmox-backup/backup.conf

Verify the disk space on the backup storage:

# Check disk space on backup storage
df -h

Ensure there is sufficient free space on the backup storage, as Proxmox Backup may fail if the storage is full.

Troubleshooting Backup Failures

If backups are failing, check the logs for errors related to permission issues, storage access, or network issues. You can also check the status of backup jobs:

# Check backup job status
pve-backup status

Make sure that the backup job configuration is correct. If backups are scheduled, ensure that the backup server is accessible and there are no connectivity issues.

# Check backup job configuration
cat /etc/pve/storage.cfg

Ensure that the correct backup type (e.g., `proxmox-backup` or `nfs`) is configured and that the backup server is reachable.

Useful Links

[Proxmox Documentation - Troubleshooting](https://pve.proxmox.com/pve-docs/)
[Proxmox Support Forum](https://forum.proxmox.com/)
[Proxmox Cluster Documentation](https://pve.proxmox.com/wiki/Cluster_Manager)
[Corosync Troubleshooting](https://corosync.github.io/corosync/)
[Ceph Documentation](https://docs.ceph.com/en/latest/)
[ZFS Documentation](https://openzfs.github.io/)
[NFS Troubleshooting](https://wiki.archlinux.org/title/NFS)
[Proxmox High Availability (HA)](https://pve.proxmox.com/wiki/High_Availability)
[Proxmox Backup Documentation](https://pbs.proxmox.com/docs/)
[ZFS Performance Tuning](https://docs.oracle.com/cd/E23824_01/html/819-5461/zfsperf-1.html)

Anonymous

Search