Troubleshooting Guide: Common Issues & Solutions

Real problems encountered during migration and their solutions. This guide covers actual issues we faced and how we solved them.

Boot Issues

Problem: Server Boots to Proxmox Instead of Talos ISO

Symptoms:

After mounting ISO via iDRAC, server still boots to Proxmox
GRUB menu appears instead of Talos installer

Root Cause:

Old Proxmox bootloader still on disk
BIOS boot order not updated
Virtual media not properly mapped

Solutions:

1. Wipe Bootloader:

# SSH into Proxmox
ssh root@192.168.1.191

# Wipe first 100MB of disk
dd if=/dev/zero of=/dev/sda bs=1M count=100

# Reboot
reboot

2. Force Boot from Virtual Media:

Access iDRAC Boot Manager (F11)
Select “Virtual CD/DVD” explicitly
Or use iDRAC web interface → Virtual Console → Boot from ISO

3. Reset USB Status:

iDRAC → Virtual Media → Reset USB Status
Remap ISO as CD/DVD (not removable disk)

Problem: “No Bootable Device Found”

Symptoms:

Server tries to boot from virtual media but fails
Error: “No bootable device found”

Solution:

Verify ISO is properly mapped in iDRAC
Try remapping as “removable disk” instead of CD/DVD
Check ISO file integrity (re-download if needed)

Talos Installation Issues

Problem: “specified install disk does not exist: /dev/sda”

Error:

error applying new configuration: rpc error: code = InvalidArgument 
desc = configuration validation failed: 
specified install disk does not exist: "/dev/sda"

Root Cause:

Virtual CD takes /dev/sda
Talos disk is actually /dev/sdb or another device

Solution:

# In controlplane.yaml
machine:
  install:
    disk: /dev/sdb  # Change from /dev/sda

Find Correct Disk:

# From Talos ISO boot
lsblk
# Identify your disk (usually sdb if sda is virtual CD)

Problem: Connection Refused After Installation

Symptoms:

Talos installed successfully
Server reboots
talosctl connection refused
Kubernetes API not accessible

Root Cause:

Kubernetes not yet bootstrapped
Normal after fresh installation

Solution:

# Wait for Talos to fully boot (2-3 minutes)
# Then bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.100

# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.100

# Verify
kubectl get nodes

Kubernetes Issues

Problem: kubeconfig Not Working

Symptoms:

The connection to the server localhost:8080 was refused

Root Cause:

kubeconfig not properly configured
Wrong context selected

Solution:

# Set TALOSCONFIG
export TALOSCONFIG=~/r430-migration/talos-config/talosconfig

# Force regenerate kubeconfig
talosctl -e 192.168.1.100 --nodes 192.168.1.100 kubeconfig --force

# Verify
kubectl get nodes

Problem: Pods Stuck in Pending

Symptoms:

kubectl get pods
# NAME    READY   STATUS    RESTARTS
# my-pod  0/1     Pending   0

Diagnosis:

# Check why pending
kubectl describe pod my-pod

# Common causes:
# - No nodes available
# - PVC not bound
# - Resource constraints
# - PodSecurity violations

Solutions:

PVC Not Bound:

# Check PVC
kubectl get pvc
kubectl describe pvc my-pvc

# Check StorageClass
kubectl get storageclass
# Should have default StorageClass

PodSecurity Violations:

# Check namespace labels
kubectl get namespace my-namespace --show-labels

# Add privileged label
kubectl label namespace my-namespace \
  pod-security.kubernetes.io/enforce=privileged --overwrite

Problem: TLS Errors with kubectl logs

Symptoms:

Error from server: remote error: tls: internal error

Root Cause:

Kubelet server certificates not approved
CSR (Certificate Signing Request) pending

Solution:

# List pending CSRs
kubectl get csr

# Approve kubelet-serving CSRs
kubectl get csr -o name | grep kubelet-serving | xargs kubectl certificate approve

# Verify
kubectl logs <pod-name>

Storage Issues

Problem: Longhorn Pods CrashLoopBackOff

Symptoms:

kubectl get pods -n longhorn-system
# longhorn-manager-xxxxx   0/1   CrashLoopBackOff

Root Cause:

Missing iSCSI tools (open-iscsi)
Talos doesn’t include iSCSI by default

Solution:

# Check for iscsiadm
talosctl -e 192.168.1.100 --nodes 192.168.1.100 read /usr/sbin/iscsiadm
# Error: no such file or directory

# Switch to Local Path Provisioner instead
kubectl delete namespace longhorn-system
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml

Problem: PVC Stuck in Pending

Symptoms:

kubectl get pvc
# NAME    STATUS    VOLUME
# my-pvc  Pending

Diagnosis:

# Check provisioner pod
kubectl get pods -n local-path-storage

# Check logs
kubectl logs -n local-path-storage -l app=local-path-provisioner

# Check StorageClass
kubectl get storageclass

Solution:

# Label namespace for privileged access
kubectl label namespace local-path-storage \
  pod-security.kubernetes.io/enforce=privileged --overwrite

# Delete and recreate PVC
kubectl delete pvc my-pvc
kubectl apply -f pvc.yaml

Networking Issues

Problem: MetalLB No External IP

Symptoms:

kubectl get svc
# NAME    TYPE           EXTERNAL-IP
# my-svc  LoadBalancer   <pending>

Diagnosis:

# Check MetalLB pods
kubectl get pods -n metallb-system

# Check IP pool
kubectl get ipaddresspool -n metallb-system

# Check L2 advertisement
kubectl get l2advertisement -n metallb-system

Solution:

# Verify IP pool configuration
kubectl describe ipaddresspool -n metallb-system default-pool

# IPs must be:
# - On same subnet as nodes
# - Not in DHCP range
# - Not assigned to other devices

Problem: Traefik Not Routing

Symptoms:

Ingress created but not accessible
404 errors

Diagnosis:

# Check Ingress
kubectl get ingress
kubectl describe ingress my-ingress

# Check Traefik pods
kubectl get pods -n traefik

# Check Traefik logs
kubectl logs -n traefik -l app=traefik

Solution:

# Verify IngressClass
kubectl get ingressclass

# Ensure Ingress uses correct class
# ingressClassName: traefik

Application Issues

Problem: Docker Image Architecture Mismatch

Symptoms:

exec ./app: exec format error

Root Cause:

Image built on ARM64 (Apple Silicon)
Server is AMD64

Solution:

# In Dockerfile
FROM golang:1.21-alpine AS builder
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build

# In build script
docker build --platform linux/amd64 -t my-image:latest .

Problem: Image Pull Errors

Symptoms:

ErrImagePull
pull access denied
http: server gave HTTP response to HTTPS client

Root Cause:

Local registry uses HTTP
Talos tries HTTPS by default

Solution:

# Patch Talos config
machine:
  registries:
    config:
      192.168.1.100:30500:
        protocol: http
        tls:
          insecureSkipVerify: true

# Apply patch
talosctl patch machineconfig --patch @patch-registry.yaml

Problem: NPM 404 Errors

Symptoms:

Proxy host configured but returns 404
Service accessible directly but not via NPM

Diagnosis:

# Check NPM config
kubectl exec -n npm <pod-name> -- cat /data/nginx/proxy_host/1.conf

# Test service from NPM pod
kubectl exec -n npm <pod-name> -- curl http://service.namespace.svc.cluster.local:80

Solution:

Verify forward scheme is http (not https)
Verify service name and namespace
Check service is accessible from NPM pod

KubeVirt Issues

Problem: VM Won’t Start

Symptoms:

kubectl get vmi
# NAME        AGE   PHASE
# my-vm       5m    Pending

Diagnosis:

# Check VMI status
kubectl describe vmi my-vm

# Check virt-handler logs
kubectl logs -n kubevirt-system -l kubevirt.io=virt-handler

# Check hardware virtualization
talosctl -e 192.168.1.100 --nodes 192.168.1.100 read /proc/cpuinfo | grep vmx

Solution:

Enable VT-x in BIOS
Verify CPU supports virtualization
Check resource availability

Problem: VM Script Detection False Positive

Symptoms:

Script reports “Virtualisation matérielle NON détectée!”
But cpuinfo shows vmx flags

Root Cause:

Script detection method not working
Hardware actually supports virtualization

Solution:

Ignore script warning
Proceed with installation
KubeVirt will detect and use VT-x automatically

General Debugging Commands

Cluster Health

# Node status
kubectl get nodes -o wide

# All pods
kubectl get pods -A

# Cluster info
kubectl cluster-info

# Events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

Talos Health

# Talos health check
talosctl -e 192.168.1.100 --nodes 192.168.1.100 health

# Services
talosctl -e 192.168.1.100 --nodes 192.168.1.100 service list

# Logs
talosctl -e 192.168.1.100 --nodes 192.168.1.100 logs kubelet

Resource Usage

# Node resources
kubectl top nodes

# Pod resources
kubectl top pods -A

# Disk usage (Talos)
talosctl -e 192.168.1.100 --nodes 192.168.1.100 df -h

Prevention Tips

Always Backup - Before major changes
Test in Stages - Don’t change everything at once
Check Logs First - Most issues visible in logs
Verify Prerequisites - Hardware, BIOS, network
Document Changes - Keep track of what you modify
Use Git - Version control for configs

Getting Help

Resources:

Talos Docs: https://www.talos.dev/docs/
Kubernetes Docs: https://kubernetes.io/docs/
KubeVirt Docs: https://kubevirt.io/user-guide/
Stack Overflow: Tag [kubernetes], [talos-linux]

Community:

Talos Slack: https://www.talos.dev/slack/
Kubernetes Forums: https://discuss.kubernetes.io/

Next: Automation Scripts