Troubleshooting Guide: Common Issues & Solutions

Real problems encountered during migration and their solutions. This guide covers actual issues we faced and how we solved them.

Boot Issues

Problem: Server Boots to Proxmox Instead of Talos ISO

Symptoms:

Root Cause:

Solutions:

1. Wipe Bootloader:

# SSH into Proxmox
ssh root@192.168.1.191

# Wipe first 100MB of disk
dd if=/dev/zero of=/dev/sda bs=1M count=100

# Reboot
reboot

2. Force Boot from Virtual Media:

3. Reset USB Status:

Problem: “No Bootable Device Found”

Symptoms:

Solution:

Talos Installation Issues

Problem: “specified install disk does not exist: /dev/sda”

Error:

error applying new configuration: rpc error: code = InvalidArgument 
desc = configuration validation failed: 
specified install disk does not exist: "/dev/sda"

Root Cause:

Solution:

# In controlplane.yaml
machine:
  install:
    disk: /dev/sdb  # Change from /dev/sda

Find Correct Disk:

# From Talos ISO boot
lsblk
# Identify your disk (usually sdb if sda is virtual CD)

Problem: Connection Refused After Installation

Symptoms:

Root Cause:

Solution:

# Wait for Talos to fully boot (2-3 minutes)
# Then bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.100

# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.100

# Verify
kubectl get nodes

Kubernetes Issues

Problem: kubeconfig Not Working

Symptoms:

The connection to the server localhost:8080 was refused

Root Cause:

Solution:

# Set TALOSCONFIG
export TALOSCONFIG=~/r430-migration/talos-config/talosconfig

# Force regenerate kubeconfig
talosctl -e 192.168.1.100 --nodes 192.168.1.100 kubeconfig --force

# Verify
kubectl get nodes

Problem: Pods Stuck in Pending

Symptoms:

kubectl get pods
# NAME    READY   STATUS    RESTARTS
# my-pod  0/1     Pending   0

Diagnosis:

# Check why pending
kubectl describe pod my-pod

# Common causes:
# - No nodes available
# - PVC not bound
# - Resource constraints
# - PodSecurity violations

Solutions:

PVC Not Bound:

# Check PVC
kubectl get pvc
kubectl describe pvc my-pvc

# Check StorageClass
kubectl get storageclass
# Should have default StorageClass

PodSecurity Violations:

# Check namespace labels
kubectl get namespace my-namespace --show-labels

# Add privileged label
kubectl label namespace my-namespace \
  pod-security.kubernetes.io/enforce=privileged --overwrite

Problem: TLS Errors with kubectl logs

Symptoms:

Error from server: remote error: tls: internal error

Root Cause:

Solution:

# List pending CSRs
kubectl get csr

# Approve kubelet-serving CSRs
kubectl get csr -o name | grep kubelet-serving | xargs kubectl certificate approve

# Verify
kubectl logs <pod-name>

Storage Issues

Problem: Longhorn Pods CrashLoopBackOff

Symptoms:

kubectl get pods -n longhorn-system
# longhorn-manager-xxxxx   0/1   CrashLoopBackOff

Root Cause:

Solution:

# Check for iscsiadm
talosctl -e 192.168.1.100 --nodes 192.168.1.100 read /usr/sbin/iscsiadm
# Error: no such file or directory

# Switch to Local Path Provisioner instead
kubectl delete namespace longhorn-system
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml

Problem: PVC Stuck in Pending

Symptoms:

kubectl get pvc
# NAME    STATUS    VOLUME
# my-pvc  Pending

Diagnosis:

# Check provisioner pod
kubectl get pods -n local-path-storage

# Check logs
kubectl logs -n local-path-storage -l app=local-path-provisioner

# Check StorageClass
kubectl get storageclass

Solution:

# Label namespace for privileged access
kubectl label namespace local-path-storage \
  pod-security.kubernetes.io/enforce=privileged --overwrite

# Delete and recreate PVC
kubectl delete pvc my-pvc
kubectl apply -f pvc.yaml

Networking Issues

Problem: MetalLB No External IP

Symptoms:

kubectl get svc
# NAME    TYPE           EXTERNAL-IP
# my-svc  LoadBalancer   <pending>

Diagnosis:

# Check MetalLB pods
kubectl get pods -n metallb-system

# Check IP pool
kubectl get ipaddresspool -n metallb-system

# Check L2 advertisement
kubectl get l2advertisement -n metallb-system

Solution:

# Verify IP pool configuration
kubectl describe ipaddresspool -n metallb-system default-pool

# IPs must be:
# - On same subnet as nodes
# - Not in DHCP range
# - Not assigned to other devices

Problem: Traefik Not Routing

Symptoms:

Diagnosis:

# Check Ingress
kubectl get ingress
kubectl describe ingress my-ingress

# Check Traefik pods
kubectl get pods -n traefik

# Check Traefik logs
kubectl logs -n traefik -l app=traefik

Solution:

# Verify IngressClass
kubectl get ingressclass

# Ensure Ingress uses correct class
# ingressClassName: traefik

Application Issues

Problem: Docker Image Architecture Mismatch

Symptoms:

exec ./app: exec format error

Root Cause:

Solution:

# In Dockerfile
FROM golang:1.21-alpine AS builder
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build
# In build script
docker build --platform linux/amd64 -t my-image:latest .

Problem: Image Pull Errors

Symptoms:

ErrImagePull
pull access denied
http: server gave HTTP response to HTTPS client

Root Cause:

Solution:

# Patch Talos config
machine:
  registries:
    config:
      192.168.1.100:30500:
        protocol: http
        tls:
          insecureSkipVerify: true
# Apply patch
talosctl patch machineconfig --patch @patch-registry.yaml

Problem: NPM 404 Errors

Symptoms:

Diagnosis:

# Check NPM config
kubectl exec -n npm <pod-name> -- cat /data/nginx/proxy_host/1.conf

# Test service from NPM pod
kubectl exec -n npm <pod-name> -- curl http://service.namespace.svc.cluster.local:80

Solution:

KubeVirt Issues

Problem: VM Won’t Start

Symptoms:

kubectl get vmi
# NAME        AGE   PHASE
# my-vm       5m    Pending

Diagnosis:

# Check VMI status
kubectl describe vmi my-vm

# Check virt-handler logs
kubectl logs -n kubevirt-system -l kubevirt.io=virt-handler

# Check hardware virtualization
talosctl -e 192.168.1.100 --nodes 192.168.1.100 read /proc/cpuinfo | grep vmx

Solution:

Problem: VM Script Detection False Positive

Symptoms:

Root Cause:

Solution:

General Debugging Commands

Cluster Health

# Node status
kubectl get nodes -o wide

# All pods
kubectl get pods -A

# Cluster info
kubectl cluster-info

# Events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

Talos Health

# Talos health check
talosctl -e 192.168.1.100 --nodes 192.168.1.100 health

# Services
talosctl -e 192.168.1.100 --nodes 192.168.1.100 service list

# Logs
talosctl -e 192.168.1.100 --nodes 192.168.1.100 logs kubelet

Resource Usage

# Node resources
kubectl top nodes

# Pod resources
kubectl top pods -A

# Disk usage (Talos)
talosctl -e 192.168.1.100 --nodes 192.168.1.100 df -h

Prevention Tips

  1. Always Backup - Before major changes
  2. Test in Stages - Don’t change everything at once
  3. Check Logs First - Most issues visible in logs
  4. Verify Prerequisites - Hardware, BIOS, network
  5. Document Changes - Keep track of what you modify
  6. Use Git - Version control for configs

Getting Help

Resources:

Community:


Next: Automation Scripts