Troubleshooting Guide: Common Issues & Solutions
Real problems encountered during migration and their solutions. This guide covers actual issues we faced and how we solved them.
Boot Issues
Problem: Server Boots to Proxmox Instead of Talos ISO
Symptoms:
- After mounting ISO via iDRAC, server still boots to Proxmox
- GRUB menu appears instead of Talos installer
Root Cause:
- Old Proxmox bootloader still on disk
- BIOS boot order not updated
- Virtual media not properly mapped
Solutions:
1. Wipe Bootloader:
# SSH into Proxmox
ssh root@192.168.1.191
# Wipe first 100MB of disk
dd if=/dev/zero of=/dev/sda bs=1M count=100
# Reboot
reboot
2. Force Boot from Virtual Media:
- Access iDRAC Boot Manager (F11)
- Select “Virtual CD/DVD” explicitly
- Or use iDRAC web interface → Virtual Console → Boot from ISO
3. Reset USB Status:
- iDRAC → Virtual Media → Reset USB Status
- Remap ISO as CD/DVD (not removable disk)
Problem: “No Bootable Device Found”
Symptoms:
- Server tries to boot from virtual media but fails
- Error: “No bootable device found”
Solution:
- Verify ISO is properly mapped in iDRAC
- Try remapping as “removable disk” instead of CD/DVD
- Check ISO file integrity (re-download if needed)
Talos Installation Issues
Problem: “specified install disk does not exist: /dev/sda”
Error:
error applying new configuration: rpc error: code = InvalidArgument
desc = configuration validation failed:
specified install disk does not exist: "/dev/sda"
Root Cause:
- Virtual CD takes
/dev/sda - Talos disk is actually
/dev/sdbor another device
Solution:
# In controlplane.yaml
machine:
install:
disk: /dev/sdb # Change from /dev/sda
Find Correct Disk:
# From Talos ISO boot
lsblk
# Identify your disk (usually sdb if sda is virtual CD)
Problem: Connection Refused After Installation
Symptoms:
- Talos installed successfully
- Server reboots
talosctlconnection refused- Kubernetes API not accessible
Root Cause:
- Kubernetes not yet bootstrapped
- Normal after fresh installation
Solution:
# Wait for Talos to fully boot (2-3 minutes)
# Then bootstrap Kubernetes
talosctl bootstrap --nodes 192.168.1.100
# Get kubeconfig
talosctl kubeconfig --nodes 192.168.1.100
# Verify
kubectl get nodes
Kubernetes Issues
Problem: kubeconfig Not Working
Symptoms:
The connection to the server localhost:8080 was refused
Root Cause:
- kubeconfig not properly configured
- Wrong context selected
Solution:
# Set TALOSCONFIG
export TALOSCONFIG=~/r430-migration/talos-config/talosconfig
# Force regenerate kubeconfig
talosctl -e 192.168.1.100 --nodes 192.168.1.100 kubeconfig --force
# Verify
kubectl get nodes
Problem: Pods Stuck in Pending
Symptoms:
kubectl get pods
# NAME READY STATUS RESTARTS
# my-pod 0/1 Pending 0
Diagnosis:
# Check why pending
kubectl describe pod my-pod
# Common causes:
# - No nodes available
# - PVC not bound
# - Resource constraints
# - PodSecurity violations
Solutions:
PVC Not Bound:
# Check PVC
kubectl get pvc
kubectl describe pvc my-pvc
# Check StorageClass
kubectl get storageclass
# Should have default StorageClass
PodSecurity Violations:
# Check namespace labels
kubectl get namespace my-namespace --show-labels
# Add privileged label
kubectl label namespace my-namespace \
pod-security.kubernetes.io/enforce=privileged --overwrite
Problem: TLS Errors with kubectl logs
Symptoms:
Error from server: remote error: tls: internal error
Root Cause:
- Kubelet server certificates not approved
- CSR (Certificate Signing Request) pending
Solution:
# List pending CSRs
kubectl get csr
# Approve kubelet-serving CSRs
kubectl get csr -o name | grep kubelet-serving | xargs kubectl certificate approve
# Verify
kubectl logs <pod-name>
Storage Issues
Problem: Longhorn Pods CrashLoopBackOff
Symptoms:
kubectl get pods -n longhorn-system
# longhorn-manager-xxxxx 0/1 CrashLoopBackOff
Root Cause:
- Missing iSCSI tools (
open-iscsi) - Talos doesn’t include iSCSI by default
Solution:
# Check for iscsiadm
talosctl -e 192.168.1.100 --nodes 192.168.1.100 read /usr/sbin/iscsiadm
# Error: no such file or directory
# Switch to Local Path Provisioner instead
kubectl delete namespace longhorn-system
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
Problem: PVC Stuck in Pending
Symptoms:
kubectl get pvc
# NAME STATUS VOLUME
# my-pvc Pending
Diagnosis:
# Check provisioner pod
kubectl get pods -n local-path-storage
# Check logs
kubectl logs -n local-path-storage -l app=local-path-provisioner
# Check StorageClass
kubectl get storageclass
Solution:
# Label namespace for privileged access
kubectl label namespace local-path-storage \
pod-security.kubernetes.io/enforce=privileged --overwrite
# Delete and recreate PVC
kubectl delete pvc my-pvc
kubectl apply -f pvc.yaml
Networking Issues
Problem: MetalLB No External IP
Symptoms:
kubectl get svc
# NAME TYPE EXTERNAL-IP
# my-svc LoadBalancer <pending>
Diagnosis:
# Check MetalLB pods
kubectl get pods -n metallb-system
# Check IP pool
kubectl get ipaddresspool -n metallb-system
# Check L2 advertisement
kubectl get l2advertisement -n metallb-system
Solution:
# Verify IP pool configuration
kubectl describe ipaddresspool -n metallb-system default-pool
# IPs must be:
# - On same subnet as nodes
# - Not in DHCP range
# - Not assigned to other devices
Problem: Traefik Not Routing
Symptoms:
- Ingress created but not accessible
- 404 errors
Diagnosis:
# Check Ingress
kubectl get ingress
kubectl describe ingress my-ingress
# Check Traefik pods
kubectl get pods -n traefik
# Check Traefik logs
kubectl logs -n traefik -l app=traefik
Solution:
# Verify IngressClass
kubectl get ingressclass
# Ensure Ingress uses correct class
# ingressClassName: traefik
Application Issues
Problem: Docker Image Architecture Mismatch
Symptoms:
exec ./app: exec format error
Root Cause:
- Image built on ARM64 (Apple Silicon)
- Server is AMD64
Solution:
# In Dockerfile
FROM golang:1.21-alpine AS builder
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build
# In build script
docker build --platform linux/amd64 -t my-image:latest .
Problem: Image Pull Errors
Symptoms:
ErrImagePull
pull access denied
http: server gave HTTP response to HTTPS client
Root Cause:
- Local registry uses HTTP
- Talos tries HTTPS by default
Solution:
# Patch Talos config
machine:
registries:
config:
192.168.1.100:30500:
protocol: http
tls:
insecureSkipVerify: true
# Apply patch
talosctl patch machineconfig --patch @patch-registry.yaml
Problem: NPM 404 Errors
Symptoms:
- Proxy host configured but returns 404
- Service accessible directly but not via NPM
Diagnosis:
# Check NPM config
kubectl exec -n npm <pod-name> -- cat /data/nginx/proxy_host/1.conf
# Test service from NPM pod
kubectl exec -n npm <pod-name> -- curl http://service.namespace.svc.cluster.local:80
Solution:
- Verify forward scheme is
http(nothttps) - Verify service name and namespace
- Check service is accessible from NPM pod
KubeVirt Issues
Problem: VM Won’t Start
Symptoms:
kubectl get vmi
# NAME AGE PHASE
# my-vm 5m Pending
Diagnosis:
# Check VMI status
kubectl describe vmi my-vm
# Check virt-handler logs
kubectl logs -n kubevirt-system -l kubevirt.io=virt-handler
# Check hardware virtualization
talosctl -e 192.168.1.100 --nodes 192.168.1.100 read /proc/cpuinfo | grep vmx
Solution:
- Enable VT-x in BIOS
- Verify CPU supports virtualization
- Check resource availability
Problem: VM Script Detection False Positive
Symptoms:
- Script reports “Virtualisation matérielle NON détectée!”
- But
cpuinfoshowsvmxflags
Root Cause:
- Script detection method not working
- Hardware actually supports virtualization
Solution:
- Ignore script warning
- Proceed with installation
- KubeVirt will detect and use VT-x automatically
General Debugging Commands
Cluster Health
# Node status
kubectl get nodes -o wide
# All pods
kubectl get pods -A
# Cluster info
kubectl cluster-info
# Events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
Talos Health
# Talos health check
talosctl -e 192.168.1.100 --nodes 192.168.1.100 health
# Services
talosctl -e 192.168.1.100 --nodes 192.168.1.100 service list
# Logs
talosctl -e 192.168.1.100 --nodes 192.168.1.100 logs kubelet
Resource Usage
# Node resources
kubectl top nodes
# Pod resources
kubectl top pods -A
# Disk usage (Talos)
talosctl -e 192.168.1.100 --nodes 192.168.1.100 df -h
Prevention Tips
- Always Backup - Before major changes
- Test in Stages - Don’t change everything at once
- Check Logs First - Most issues visible in logs
- Verify Prerequisites - Hardware, BIOS, network
- Document Changes - Keep track of what you modify
- Use Git - Version control for configs
Getting Help
Resources:
- Talos Docs: https://www.talos.dev/docs/
- Kubernetes Docs: https://kubernetes.io/docs/
- KubeVirt Docs: https://kubevirt.io/user-guide/
- Stack Overflow: Tag [kubernetes], [talos-linux]
Community:
- Talos Slack: https://www.talos.dev/slack/
- Kubernetes Forums: https://discuss.kubernetes.io/
Next: Automation Scripts