Troubleshooting
Verify that Ironic and Baremetal Operator are healthy
There is no point continuing before you have verified that the controllers are
healthy. A “standard” deployment will have Ironic and Baremetal Operator running
in the baremetal-operator-system
namespace. Check that the containers are
running, not restarting or crashing:
kubectl -n baremetal-operator-system get pods
Note: If you deploy Ironic outside of Kubernetes you will need to check on it in a different way.
Healthy example output:
NAME READY STATUS RESTARTS AGE
baremetal-operator-controller-manager-85b896f688-j27g5 1/1 Running 0 5m13s
ironic-6bcdcb99f8-6ldlz 3/3 Running 1 (2m2s ago) 5m15s
(There has been one restart, but it is not constantly restarting.)
Unhealthy example output:
NAME READY STATUS RESTARTS AGE
baremetal-operator-controller-manager-85b896f688-j27g5 1/1 Running 0 3m35s
ironic-6bcdcb99f8-6ldlz 1/3 Running 1 (24s ago) 3m37s
Waiting for IP
Make sure to check the logs also since Ironic may be stuck on “waiting for IP”. For example:
kubectl -n baremetal-operator-system logs ironic-6bcdcb99f8-6ldlz -c ironic
If Ironic is waiting for IP, you need to check the network configuration. Some things to look out for:
- What IP or interface is Ironic configured to use?
- Is Ironic using the host network?
- Is Ironic running on the expected (set of) Node(s)?
- Does the Node have the expected IP assigned?
- Are you using keepalived or similar to manage the IP, and is it working properly?
Host is stuck in cleaning, how do I delete it?
First and foremost, avoid using forced deletion, otherwise you’ll have a conflict. If you don’t care about disks being cleaned, you can edit the BareMetalHost resource and disable cleaning:
spec:
automatedCleaningMode: disabled
Alternatively, you can wait for 3 cleaning retries to finish. After that, the host will be deleted. If you do care about cleaning, you need to figure out why it does not finish.
MAC address conflict on registration
If you force deletion of a host after registration, Baremetal Operator will not be able to delete the corresponding record from Ironic. If you try to enroll the same host again, you will see the following error:
Normal RegistrationError 4m36s metal3-baremetal-controller MAC address 11:22:33:44:55:66 conflicts with existing node namespace~name
Currently, the only way to get rid of this error is to re-create the Ironic’s internal database. If your deployment uses SQLite (the default), it is enough to restart the pod with Ironic. If you use MariaDB, you need to restart its pod, clearing any persistent volumes.
Power requests are issued for deleted hosts
Similarly to the previous question, a host is not deleted from Ironic in case of a forced deletion of its BareMetalHost object. If valid BMC credentials were provided, Ironic will keep checking the power state of the host and enforcing the last requested power state. The only solution is again to delete the Ironic’s internal database.
BMH registration errors
BMC credentials may be incorrect or missing. These issues appear in the BareMetalHost’s status and in Events.
Check both kubectl describe bmh <name>
and recent Events for details.
Example output:
Normal RegistrationError 23s metal3-baremetal-controller Failed to get
power state for node 67ac51af-a6b3. Error: Redfish exception occurred.
Error: HTTP GET https://192.168.111.1:8000/redfish/v1/Systems/... returned code 401.
BMH inspection errors
The host is not able to communicate back results to Ironic
If the host cannot communicate with Ironic, it will result in a timeout. Accessing serial logs is necessary to determine the exact issue.
Example output from kubectl get bmh -A
:
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
metal3 node-1 inspecting true inspection error 46m
BareMetalHost’s events from kubectl describe bmh <name> -n <namespace>
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal InspectionStarted 37m metal3-baremetal-controller Hardware inspection started
Normal InspectionError 7m12s metal3-baremetal-controller timeout reached while inspecting the node
Incompatible configuration
This can occur when attempting to use virtual media or UEFI on hardware that does not support it. The error will show in status and in events.
Example kubectl get bmh -A
:
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
metal3 node-1 inspecting true inspection error 8m17s
BareMetalHost’s events:
Normal InspectionError 5s metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection:
Redfish exception occurred. Error: Setting boot mode to bios failed for node ceec28f5-cedb.rror: HTTP PATCH
https://192.168.111.1:8000/redfish/v1/Systems/... returned code 500.
Provisioning errors
Errors during provisioning will be visible when listing the BareMetalHosts:
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
metal3 node-1 provisioning test1-dt8j2 true provisioning error 149m
Check BareMetalHost’s events for the specific reason.
Wrong image checksum example:
Normal ProvisioningError 10m metal3-baremetal-controller Image provisioning failed: Deploy
step deploy.write_image failed on node df880558-09da. Image failed to verify against checksum.
location: CENTOS_9_NODE_IMAGE.img; image ID: /dev/sda; image checksum: abcd1234; verification checksum: ...
No root device found example:
Normal ProvisioningStarted 15s metal3-baremetal-controller Image provisioning started for http://172.22.0.1/images/CENTOS_9_NODE_IMAGE.img
Normal ProvisioningError 1s metal3-baremetal-controller Image provisioning failed: Deploy step deploy.write_image failed on node d25ce8de-914e-4146-a0c0-58825274572d. No suitable device was found for deployment using these hints {'name': 's== /dev/vdb'}
No BareMetalHost available or matching
This appears in the Metal3Machine status:
Status:
Conditions:
Last Transition Time: 2025-08-15T10:53:05Z
Message: No available host found. Requeuing.. Object will be requeued after 30s
Reason: AssociateBMHFailed
Severity: Error
Status: False
Type: AssociateBMH
CAPM3 controller logs when there is no available hosts:
I0815 11:10:35.699004 1 metal3machine_manager.go:332] "No available host found. Requeuing." logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="metal3/test-no-match-2" machine="test-no-match-2" cluster="test1" metal3-cluster="test1"
CAPM3 controller logs when the annotated host is not found:
I0815 06:08:54.687380 1 metal3machine_manager.go:788] "Annotated host not found" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="metal3/test1-zxzn7-qvl6n" machine="test1-zxzn7-qvl6n" cluster="test1" metal3-cluster="test1" host="metal3/node-0"
Provider ID is missing
This occurs if cloudProviderEnabled
is set to true
on Metal3Cluster when no
external cloud provider is used. The Metal3Machine will remain stuck in the
Provisioning phase.
Example output from kubectl get metal3machine -A
:
NAMESPACE NAME AGE PROVIDERID READY CLUSTER PHASE
metal3 test1-82ljr 160m metal3://metal3/node-0/test1-82ljr true test1
metal3 test1-bv9mv-2f8th 35m test1
Metal3Machine’s status:
Status:
Conditions:
Reason: NotReady
Status: False
Type: Available
Message: * NodeHealthy: Waiting for Metal3Machine to report spec.providerID
nodeRef
missing
A CAPI-level issue. This can be caused by a failure to boot the image or join it to the cluster. Access to the node or serial logs is needed to determine the exact cause. In particular, cloud-init logs can help pinpoint the issue.
CAPM3 controller logs:
I0815 11:10:36.545990 1 metal3labelsync_controller.go:150] "Could not find Node Ref on Machine object, will retry" logger="controllers.Metal3LabelSync.metal3-label-sync-controller" metal3-label-sync="metal3/node-0"