Metal³

The Metal³ project (pronounced: “Metal Kubed”) provides components for bare metal host management with Kubernetes. You can enrol your bare metal machines, provision operating system images, and then, if you like, deploy Kubernetes clusters to them. From there, operating and upgrading your Kubernetes clusters can be handled by Metal³. Moreover, Metal³ is itself a Kubernetes application, so it runs on Kubernetes, and uses Kubernetes resources and APIs as its interface.

Metal³ is one of the providers for the Kubernetes sub-project Cluster API. Cluster API provides infrastructure agnostic Kubernetes lifecycle management, and Metal³ brings the bare metal implementation.

This is paired with one of the components from the OpenStack ecosystem, Ironic for booting and installing machines. Metal³ handles the installation of Ironic as a standalone component (there’s no need to bring along the rest of OpenStack). Ironic is supported by a mature community of hardware vendors and supports a wide range of bare metal management protocols which are continuously tested on a variety of hardware. Backed by Ironic, Metal³ can provision machines, no matter the brand of hardware.

In summary, you can write Kubernetes manifests representing your hardware and your desired Kubernetes cluster layout. Then Metal³ can:

  • Discover your hardware inventory
  • Configure BIOS and RAID settings on your hosts
  • Optionally clean a host’s disks as part of provisioning
  • Install and boot an operating system image of your choice
  • Deploy Kubernetes
  • Upgrade Kubernetes or the operating system in your clusters with a non-disruptive rolling strategy
  • Automatically remediate failed nodes by rebooting them and removing them from the cluster if necessary

You can even deploy Metal³ to your clusters so that they can manage other clusters using Metal³…

Metal³ is open-source and welcomes community contributions. The community meets at the following venues:

  • #cluster-api-baremetal on Kubernetes Slack
  • Metal³ development mailing list
  • From the mailing list, you’ll also be able to find the details of a weekly Zoom community call on Wednesdays at 14:00 GMT

About this guide

This user guide aims to explain the Metal³ feature set, and provide how-tos for using Metal³. It’s not a tutorial (for that, see the Getting Started Guide). Nor is it a reference (for that, see the API Reference Documentation, and of course, the code itself.)

Project overview

Metal3 consists of multiple sub-projects. The most notable are Bare Metal Operator, Cluster API provider Metal3 and the IP address manager. There is no requirement to use all of them.

The stack, when including Cluster API and Ironic, looks like this:

Metal3 stack

From a user perspective it may be more useful to visualize the Kubernetes resources. When using Cluster API, Metal3 works as any other infrastructure provider. The Machines get corresponding Metal3Machines, which in turn reference the BareMetalHosts.

CAPI-machines

The following diagram shows more details about the Metal3 objects. Note that it is not showing everything and is meant just as an overview.

Metal3-CAPI objects

How does it work?

Metal3 relies on Ironic for interacting with the physical machines. Ironic in turn communicates with Baseboard Management Controllers (BMCs) to manage the machines. Ironic can communicate with the BMCs using protocols such as Redfish, IPMI, or iDRAC. In this way, it can power on or off the machines, change the boot device, and so on. For more information, see Ironic in Metal3

For more advanced operations, like writing an image to the disk, the Ironic Python Agent (IPA) is first booted on the machine. Ironic can then communicate with the IPA to perform the requested operation.

The BareMetal Operator (BMO) is a Kubernetes controller that exposes parts of Ironics capabilities through the Kubernetes API. This is essentially done through the BareMetalHost custom resource.

The Cluster API infrastructure provider for Metal3 (CAPM3) provides the necessary functionality to make Metal3 work with Cluster API. This means that Cluster API can be used to provision bare metal hosts into workload clusters. Similar to other infrastructure providers, CAPM3 adds custom resources such as Metal3Cluster and Metal3MachineTemplate in order to implement the Cluster API contract.

A notable addition to the contract is the management of metadata through Metal3DataTemplates and related objects. Users can provide metadata and network data through these objects. For network data specifically, it is worth mentioning the Metal3 IP address manager (IPAM) that can be used to assign IP addresses to the hosts.

Requirements

  • Server(s) with baseboard management capabilities (i.e. Redfish, iDRAC, IPMI, etc.). For development you can use virtual machines with Sushy-tools. More information here.
  • An Ironic instance. More information here.
  • A Kubernetes cluster (the management cluster) where the user stores and manages the Metal3 resources. A kind cluster is enough for bootstrapping or development.

Quick-start for Metal3

This guide has been tested on Ubuntu server 22.04. It should be seen as an example rather than the absolute truth about how to deploy and use Metal3. We will cover two environments and two scenarios. The environments are

  1. a baremetal lab with actual physical servers and baseboard management controllers (BMCs), and
  2. a virtualized baremetal lab with virtual machines and sushy-tools acting as BMC.

In both of these, we will show how to use Bare Metal Operator and Ironic to manage the servers through a Kubernetes API, as well as how to turn the servers into Kubernetes clusters managed through Cluster API. These are the two scenarios.

In a nut-shell, this is what we will do:

  1. Setup a management cluster
  2. Setup a DHCP server
  3. Setup a disk image server
  4. Deploy Ironic
  5. Deploy Bare Metal Operator
  6. Create BareMetalHosts to represent the servers
  7. (Scenario 1) Provision the BareMetalHosts
  8. (Scenario 2) Deploy Cluster API and turn the BareMetalHosts into a Kubernetes cluster

Prerequisites

You will need the following tools installed.

  • docker (or podman)
  • kind or minikube (management cluster, not needed if you already have a “real” cluster that you want to use)
  • clusterctl
  • kubectl
  • htpasswd
  • virsh and virt-install for the virtualized setup

Baremetal lab configuration

The baremetal lab has two servers that we will call bml-01 and bml-02, as well as a management computer where we will set up Metal3. The servers are equipped with iLO 4 BMCs. These BMCs are connected to an “out of band” network (192.168.1.0/24) and they have the following IP addresses.

  • bml-01: 192.168.1.13
  • bml-02: 192.168.1.14

There is a separate network for the servers (192.168.0.0/24). The management computer is connected to both of these networks with IP addresses 192.168.1.7 and 192.168.0.150 respectively.

Finally, we will need the MAC addresses of the servers to keep track of which is which.

  • bml-01: 80:c1:6e:7a:e8:10
  • bml-02: 80:c1:6e:7a:5a:a8

Virtualized configuration

If you do not have the hardware or perhaps just want to test things out without committing to a full baremetal lab, you may simulate it with virtual machines. In this section we will show how to create a virtual machine and use sushy-tools as a baseboard management controller for it.

The configuration is a bit simpler than in the baremetal lab because we don’t have a separate out of band network here. In the end we will have the BMC available as

  • bml-vm-01: 192.168.222.1:8000/redfish/v1/Systems/bmh-vm-01

and the MAC address:

  • bml-vm-01: 00:60:2f:31:81:01

Start by defining a libvirt network:

<network>
  <name>baremetal</name>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='metal3'/>
  <ip address='192.168.222.1' netmask='255.255.255.0'>
  </ip>
</network>

Save this as net.xml, define it and start it.

virsh -c qemu:///system net-define net.xml
virsh -c qemu:///system net-start baremetal

Next, we will create a virtual machine. Feel free to adjust at as you see fit, but make sure to note the MAC address. That will be needed later. You can also create more than one if you like.

# use --ram=8192 for Scenario 2
virt-install \
  --connect qemu:///system \
  --name bmh-vm-01 \
  --description "Virtualized BareMetalHost" \
  --osinfo=ubuntu-lts-latest \
  --ram=4096 \
  --vcpus=2 \
  --disk size=25 \
  --graphics=none \
  --console pty \
  --serial pty \
  --pxe \
  --network network=baremetal,mac="00:60:2f:31:81:01" \
  --noautoconsole

Sushy-tools - AKA the BMC

Metal3 relies on baseboard management controllers to manage the baremetal servers, so we need something similar for our virtual machines. This comes in the form of sushy-tools.

We need to create configuration file first:

# Listen on 192.168.222.1:8000
SUSHY_EMULATOR_LISTEN_IP = u'192.168.222.1'
SUSHY_EMULATOR_LISTEN_PORT = 8000
# The libvirt URI to use. This option enables libvirt driver.
SUSHY_EMULATOR_LIBVIRT_URI = u'qemu:///system'
docker run --name sushy-tools --rm --network host -d \
  -v /var/run/libvirt:/var/run/libvirt \
  -v "$(pwd)/sushy-tools.conf:/etc/sushy/sushy-emulator.conf" \
  -e SUSHY_EMULATOR_CONFIG=/etc/sushy/sushy-emulator.conf \
  quay.io/metal3-io/sushy-tools:latest sushy-emulator

Common setup

This section is common for both the baremetal configuration and the virtualized environment. Specific configuration will always differ between environments though. We will go through how to configure and deploy Ironic and Baremetal Operator.

Management cluster

If you already have a Kubernetes cluster that you want to use, go ahead and use that. Please ensure that it is connected to the relevant networks so that Ironic can reach the BMCs and so that the BareMetalHosts can reach Ironic.

If you do not have an cluster already, you can create one using kind. Please note that this is absolutely not intended for production environments.

We will use the following configuration file for kind, save it as kind.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  # Open ports for Ironic
  extraPortMappings:
  # Ironic httpd
  - containerPort: 6180
    hostPort: 6180
    listenAddress: "0.0.0.0"
    protocol: TCP
  # Ironic API
  - containerPort: 6385
    hostPort: 6385
    listenAddress: "0.0.0.0"
    protocol: TCP
  # Inspector API
  - containerPort: 5050
    hostPort: 5050
    listenAddress: "0.0.0.0"
    protocol: TCP

As you can see, it has a few ports forwarded from the host. This is to make Ironic reachable when it is running inside the kind cluster.

Now go ahead and create the cluster:

kind create cluster --config kind.yaml

We will need to install cert-manager also. It will be used to manage the certificates for Ironic later.

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.3/cert-manager.yaml

DHCP server

The BareMetalHosts must be able to call back to Ironic when going through the inspection phase. This means that they must have IP addresses in a network where they can reach Ironic. We will set up a DHCP server for this purpose.

Any DHCP server can be used for this. We will here use the Ironic container image that incudes dnsmasq and some scripts for configuring it.

Create a configuration file and save it as dnsmasq.env.

Baremetal lab:

# Specify the MAC addresses (separated by ;) of the hosts we know about and want to use
DHCP_HOSTS=80:c1:6e:7a:e8:10;80:c1:6e:7a:5a:a8
# Ignore unknown hosts so we don't accidentally give out IP addresses to other hosts in the network
DHCP_IGNORE=tag:!known
# Listen on this IP (management computer)
PROVISIONING_IP=192.168.0.150
# Give out IP addresses in this range
DHCP_RANGE=192.168.0.100,192.168.0.149
GATEWAY_IP=192.168.0.1

Virtualized environment:

DHCP_HOSTS=00:60:2f:31:81:01
DHCP_IGNORE=tag:!known
# IP of the host from VM perspective
PROVISIONING_IP=192.168.222.1
GATEWAY_IP=192.168.222.1
DHCP_RANGE=192.168.222.100,192.168.222.149

You can now run the DHCP server like this:

docker run --name dnsmasq --rm -d --net=host --privileged --user 997:994 \
  --env-file dnsmasq.env --entrypoint /bin/rundnsmasq \
  quay.io/metal3-io/ironic

Image server

In order to do anything useful, we will need a server for hosting disk images that can be used to provision the servers.

Create a directory to hold the disk images:

mkdir disk-images

Download images to use for testing (pick those that you want):

pushd disk-images
wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
wget https://cloud-images.ubuntu.com/jammy/current/SHA256SUMS
sha256sum --ignore-missing -c SHA256SUMS
wget https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-latest.x86_64.qcow2
wget https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-latest.x86_64.qcow2.SHA256SUM
sha256sum -c CentOS-Stream-GenericCloud-9-latest.x86_64.qcow2.SHA256SUM
wget https://artifactory.nordix.org/artifactory/metal3/images/k8s_v1.29.0/CENTOS_9_NODE_IMAGE_K8S_v1.29.0.qcow2
sha256sum CENTOS_9_NODE_IMAGE_K8S_v1.29.0.qcow2
popd

Run a basic http server to expose the disk images:

docker run --name image-server --rm -d -p 80:8080 \
  -v "$(pwd)/disk-images:/usr/share/nginx/html" nginxinc/nginx-unprivileged

Deploy Ironic

In this section we will create a kustomization containing configuration and credentials for deploying Ironic.

Create a folder to hold the kustomization:

mkdir ironic

Authentication configuration

Create authentication configuration for Ironic and Inspector. You will need to generate a username and password for each. We will here refer to them as IRONIC_USERNAME, IRONIC_PASSWORD, INSPECTOR_USERNAME and INSPECTOR_PASSWORD.

Create a file ironic-auth-config with configuration for how to access Ironic. This will be use by Inspector. It should have the following content:

[ironic]
auth_type=http_basic
username=IRONIC_USERNAME
password=IRONIC_PASSWORD

Create a file ironic-inspector-auth-config with configuration for how to access Inspector. This will be used by Ironic. It should have the following content:

[inspector]
auth_type=http_basic
username=INSPECTOR_USERNAME
password=INSPECTOR_PASSWORD

To enable basic auth, we need to create secrets containing the keys IRONIC_HTPASSWD and INSPECTOR_HTPASSWD with values generated from the credentials using htpasswd. We will do this by creating two files ironic-htpasswd and ironic-inspector-htpasswd with the following content.

ironic-htpasswd:

IRONIC_HTPASSWD="<output of `htpasswd -n -b -B IRONIC_USERNAME IRONIC_PASSWORD`>"

Similarly for ironic-inspector-htpasswd:

INSPECTOR_HTPASSWD="<output of `htpasswd -n -b -B INSPECTOR_USERNAME INSPECTOR_PASSWORD`>"

Ironic environment variables

In this section we will create a file containing environment variables used to configure Ironic and related components. We will call the file ironic_bmo.env. It looks like this for the baremetal lab:

# Same port as exposed in kind.yaml
HTTP_PORT=6180
# This is the interface inside the container
PROVISIONING_INTERFACE=eth0
# URL where the http server is exposed (IP of management computer)
CACHEURL=http://192.168.0.150
IRONIC_KERNEL_PARAMS=console=ttyS0
# IP where the BMCs can access Ironic to get the virtualmedia boot image.
# This is the IP of the management computer in the out of band network.
IRONIC_EXTERNAL_IP=192.168.1.7
# URLs where the servers can callback during inspection.
# IP of management computer in the other network and same ports as in kind.yaml
IRONIC_EXTERNAL_CALLBACK_URL=https://192.168.0.150:6385
IRONIC_INSPECTOR_CALLBACK_ENDPOINT_OVERRIDE=https://192.168.0.150:5050

For the virtualized environment it looks like this:

HTTP_PORT=6180
PROVISIONING_INTERFACE=eth0
CACHEURL=http://192.168.222.1/images
IRONIC_KERNEL_PARAMS=console=ttyS0

For more details on available variables, see the ironic-image repository.

Patch Ironic Deployment

The Ironic kustomization that we build on includes a dnsmasq container used for DHCP and PXE booting. However, we already set this up separately, because it is tricky to expose a DHCP server running inside kind. This means that we do not need the dnsmasq container that comes with the kustomization by default.

We will create a patch for removing it. It looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ironic
spec:
  template:
    spec:
      containers:
      - name: ironic-dnsmasq
        $patch: delete

Save it as ironic-patch.yaml.

Ironic kustomization

Time to tie it all together by creating a kustomization.yaml. At this point you should have a file structure like this:

ironic/
├── ironic-auth-config
├── ironic-htpasswd
├── ironic-inspector-auth-config
├── ironic-inspector-htpasswd
├── ironic-patch.yaml
├── ironic_bmo.env
└── kustomization.yaml

Here is a commented kustomization.yaml. Check carefully the IP addresses as these will always differ depending on environment.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: baremetal-operator-system
# These are the kustomizations we build on. You can download them and change the URLs to relative
# paths if you do not want to access them over the network.
# Note that the ref=v0.5.1 specifies the version to use.
resources:
- https://github.com/metal3-io/baremetal-operator/config/namespace?ref=v0.5.1
- https://github.com/metal3-io/baremetal-operator/ironic-deployment/base?ref=v0.5.1
# The kustomize components configure basic-auth and TLS
components:
- https://github.com/metal3-io/baremetal-operator/ironic-deployment/components/basic-auth?ref=v0.5.1
- https://github.com/metal3-io/baremetal-operator/ironic-deployment/components/tls?ref=v0.5.1
images:
- name: quay.io/metal3-io/ironic
  newTag: v24.0.0
# Create a ConfigMap from ironic_bmo.env and call it ironic-bmo-configmap.
# This ConfigMap will be used to set environment variables for the containers.
configMapGenerator:
- envs:
  - ironic_bmo.env
  name: ironic-bmo-configmap
  behavior: create

patches:
# Patch for removing dnsmasq
- path: ironic-patch.yaml
# The TLS component adds certificates but it cannot know the exact IPs of our environment.
# Here we patch the certificates to have the correct IPs.
# - 192.168.1.7: management computer IP in out of band network
# - 172.18.0.2: kind cluster node IP. This is what Ironic will see attached to the interface
#   and use to communicate with Inspector.
# - 192.168.0.150: management computer IP in the other network
- patch: |-
    - op: replace
      path: /spec/ipAddresses/0
      value: 192.168.1.7
    - op: add
      path: /spec/ipAddresses/-
      value: 172.18.0.2
    - op: add
      path: /spec/ipAddresses/-
      value: 192.168.0.150
  # The same patch in the virtualized environment looks like this:
  # - op: replace
  #   path: /spec/ipAddresses/0
  #   value: 192.168.222.1
  # - op: add
  #   path: /spec/ipAddresses/-
  #   value: 172.18.0.2
  target:
    kind: Certificate
    name: ironic-cert|ironic-inspector-cert
# The CA certificate should not have any IP address so we remove it.
- patch: |-
    - op: remove
      path: /spec/ipAddresses
  target:
    kind: Certificate
    name: ironic-cacert
# Create secrets from the authentication configuration.
# These will be mounted or used for environment variables.
# See the basic-auth component for more details on how they are used.
secretGenerator:
- name: ironic-htpasswd
  behavior: create
  envs:
  - ironic-htpasswd
- name: ironic-inspector-htpasswd
  behavior: create
  envs:
  - ironic-inspector-htpasswd
- name: ironic-auth-config
  files:
  - auth-config=ironic-auth-config
- name: ironic-inspector-auth-config
  files:
  - auth-config=ironic-inspector-auth-config

You can check that it works and inspect the resulting manifest by running this:

kubectl create -k ironic --dry-run=client -o yaml

When you are happy with the output, apply it in the cluster:

kubectl apply -k ironic

Deploy Bare Metal Operator

Similar to Ironic, we will create a kustomization for deploying Baremetal Operator. It will include credentials for accessing Ironic. Start with creating a folder for the kustomization:

mkdir bmo

Create files containing the credentials for Ironic and Inspector:

  • ironic-username
  • ironic-password
  • ironic-inspector-username
  • ironic-inspector-password

We will use kustomize to create secrets from these that Bare Metal Operator can use to access Ironic.

Next, create a file for environment variables. We will call it ironic.env. The content looks like this for the baremetal lab:

DEPLOY_KERNEL_URL=http://192.168.0.150:6180/images/ironic-python-agent.kernel
DEPLOY_RAMDISK_URL=http://192.168.0.150:6180/images/ironic-python-agent.initramfs
IRONIC_ENDPOINT=https://192.168.0.150:6385/v1/

The IP address is that of the management computer. The same in the virtualized environment looks like this:

DEPLOY_KERNEL_URL=http://192.168.222.1:6180/images/ironic-python-agent.kernel
DEPLOY_RAMDISK_URL=http://192.168.222.1:6180/images/ironic-python-agent.initramfs
IRONIC_ENDPOINT=https://192.168.222.1:6385/v1/

Finally, create the kustomization.yaml with this content:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: baremetal-operator-system
# This is the kustomization that we build on. You can download it and change
# the URL to a relative path if you do not want to access it over the network.
# Note that the ref=v0.5.1 specifies the version to use.
resources:
- https://github.com/metal3-io/baremetal-operator/config/overlays/basic-auth_tls?ref=v0.5.1
images:
- name: quay.io/metal3-io/baremetal-operator
  newTag: v0.5.1
# Create a ConfigMap from ironic.env and name it ironic.
configMapGenerator:
- name: ironic
  behavior: create
  envs:
  - ironic.env

# We cannot use suffix hashes since the kustomizations we build on
# cannot be aware of what suffixes we add.
generatorOptions:
  disableNameSuffixHash: true
# Create secrets with the credentials for accessing Ironic.
secretGenerator:
- name: ironic-credentials
  files:
  - username=ironic-username
  - password=ironic-password
- name: ironic-inspector-credentials
  files:
  - username=ironic-inspector-username
  - password=ironic-inspector-password

At this point, you should have a folder structure like this:

bmo/
├── ironic-password
├── ironic-username
├── ironic-inspector-username
├── ironic-inspector-password
├── ironic.env
└── kustomization.yaml

You can check that the kustomization works and inspect the resulting manifest by running this:

kubectl create -k bmo --dry-run=client -o yaml

When you are happy with the output, apply it in the cluster:

kubectl apply -k bmo

Deployment summary

You are not expected to go through all the above steps each time you want to deploy Metal3. Store the configuration and reuse it the next time.

Here is a summary of the deploy steps when all configuration is already in place.

  1. Create the management cluster.

    kind create cluster --config kind.yaml
    
  2. Deploy cert-manager.

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.3/cert-manager.yaml
    
  3. Start the DHCP server.

    docker run --name dnsmasq --rm -d --net=host --privileged --user 997:994 \
      --env-file dnsmasq.env --entrypoint /bin/rundnsmasq \
      quay.io/metal3-io/ironic
    
  4. Start the image server.

    docker run --name image-server --rm -d -p 80:8080 \
      -v "$(pwd)/disk-images:/usr/share/nginx/html" nginxinc/nginx-unprivileged
    
  5. Deploy Ironic.

    kubectl apply -k ironic
    
  6. Deploy Bare Metal Operator.

    kubectl apply -k bmo
    

Create BareMetalHosts

Now that we have Bare Metal Operator deployed, let’s put it to use by creating BareMetalHosts (BMHs) to represent our servers. You will need the protocol and IPs of the BMCs, as well as credentials for accessing them, and the servers MAC addresses.

Create one secret for each BareMetalHost, containing the credentials for accessing its BMC. No credentials are needed in the virtualized setup but you still need to create the secret with some values. Here is an example:

apiVersion: v1
kind: Secret
metadata:
  name: bml-01
type: Opaque
stringData:
  username: replaceme
  password: replaceme

Then continue by creating the BareMetalHost manifest. You can put it in the same file as the secret if you want. Just remember to separate the two resources with one line containing ---.

Here is an example of a BareMetalHost referencing the secret above with MAC address and BMC address matching our bml-01 server (see supported hardware for information on BMC addressing).

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: bml-01
spec:
  online: true
  bootMACAddress: 80:c1:6e:7a:e8:10
  # This particular hardware does not support UEFI so we use legacy
  bootMode: legacy
  bmc:
    address: ilo4-virtualmedia://192.168.1.13
    credentialsName: bml-01
    disableCertificateVerification: true

Here is the same for the virtualized BareMetalHost:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: bml-vm-01
spec:
  online: true
  bootMACAddress: 00:60:2f:31:81:01
  bootMode: UEFI # use 'legacy' for Scenario 2
  hardwareProfile: libvirt
  bmc:
    address: redfish-virtualmedia+http://192.168.222.1:8000/redfish/v1/Systems/bmh-vm-01
    credentialsName: bml-01

Apply these in the cluster with kubectl apply -f path/to/file.

You should now be able to see the BareMetalHost go through registering and inspecting phases before it finally becomes available. Check with kubectl get bmh. The output should look similar to this:

NAME      STATE         CONSUMER   ONLINE   ERROR   AGE
bml-01    available                true             26m

(Scenario 1) Provision BareMetalHosts

If you want to manage the BareMetalHosts directly, keep reading. If you would rather use Cluster API to make Kubernetes clusters out of them, skip to the next section.

Edit the BareMetalHost to add details of what image you want to provision it with. For example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: bml-01
spec:
  online: true
  bootMACAddress: 80:c1:6e:7a:e8:10
  bootMode: legacy
  bmc:
    address: ilo4-virtualmedia://192.168.1.13
    credentialsName: bml-01
    disableCertificateVerification: true
  image:
    checksumType: sha256
    checksum: http://192.168.0.150/SHA256SUMS
    format: qcow2
    url: http://192.168.0.150/jammy-server-cloudimg-amd64.img

Note that the URL for the disk image is not using the out of band network. Image provisioning works so that the Ironic Python Agent is first booted on the machine. From there (i.e. not in the out of band network) it downloads the disk image and writes it to disk. If the machine has several disks, and you want to specify which one to use, set rootDeviceHints (otherwise, /dev/sda is used by default).

The manifest above is enough to provision the BareMetalHost, but unless you have everything you need already baked in the disk image, you will most likely want to add some user-data and network-data. We will show here how to configure authorized ssh keys using user-data (see instance customization for more details).

First, we create a file (user-data.yaml) with the user-data:

#cloud-config
users:
- name: user
  ssh_authorized_keys:
  - ssh-ed25519 ABCD... user@example.com

Then create a secret from it.

kubectl create secret generic user-data --from-file=value=user-data.yaml --from-literal=format=cloud-config

Add the following to the BareMetalHost manifest to make it use the user-data:

spec:
  ...
  userData:
    name: user-data
    namespace: default

Apply the changes with kubectl apply -f path/to/file. You should now see the BareMetalHost go into provisioning and eventually become provisioned.

NAME      STATE         CONSUMER   ONLINE   ERROR   AGE
bml-01    provisioned              true             2h

You can now check the logs of the DHCP server to see what IP the BareMetalHost got (docker logs dnsmasq) and try to ssh to it.

(Scenario 2) Metal3 and Cluster API

If you want to turn the BareMetalHosts into Kubernetes clusters, you should consider using Cluster API and the infrastructure provider for Metal3. In this section we will show how to do it.

Initialize the Cluster API core components and the infrastructure provider for Metal3:

clusterctl init --infrastructure metal3

Now we need to set some environment variables that will be used to render the manifests from the cluster template. Most of them are related to the disk image that we downloaded above.

Note: There are many ways to configure and expose the API endpoint of the cluster. You need to decide how to do it. It will not “just work”. Here are some options:

  1. Configure a specific IP for the control-plane server through the DHCP server. This is doesn’t require anything extra but it is also very limited. You will not be able to upgrade the cluster for example.
  2. Set up a load balancer separately and use that as API endpoint.
  3. Use keepalived or kube-vip or similar to assign a VIP to one of the control-plane nodes.
export IMAGE_CHECKSUM="ab54897a1bcae83581512cdeeda787f009846cfd7a63b298e472c1bd6c522d23"
export IMAGE_CHECKSUM_TYPE="sha256"
export IMAGE_FORMAT="qcow2"
# Baremetal lab IMAGE_URL
export IMAGE_URL="http://192.168.0.150/CENTOS_9_NODE_IMAGE_K8S_v1.29.0.qcow2"
# Virtualized setup IMAGE_URL
export IMAGE_URL="http://192.168.222.1/CENTOS_9_NODE_IMAGE_K8S_v1.29.0.qcow2"
export KUBERNETES_VERSION="v1.29.0"
# Make sure this does not conflict with other networks
export POD_CIDR='["192.168.10.0/24"]'
# These can be used to add user-data
export CTLPLANE_KUBEADM_EXTRA_CONFIG="
    users:
    - name: user
      sshAuthorizedKeys:
      - ssh-ed25519 ABCD... user@example.com"
export WORKERS_KUBEADM_EXTRA_CONFIG="
      users:
      - name: user
        sshAuthorizedKeys:
        - ssh-ed25519 ABCD... user@example.com"
# NOTE! You must ensure that this is forwarded or assigned somehow to the
# server(s) that is selected for the control-plane.
export CLUSTER_APIENDPOINT_HOST="192.168.0.101"
export CLUSTER_APIENDPOINT_PORT="6443"

With the variables in place, we can render the manifests and apply:

clusterctl generate cluster my-cluster --control-plane-machine-count 1 --worker-machine-count 0 | kubectl apply -f -

You should see BareMetalHosts be provisioned as they are “consumed” by the Metal3Machines:

NAME      STATE         CONSUMER                        ONLINE   ERROR   AGE
bml-02    provisioned   my-cluster-controlplane-8z46n   true             68m

If all goes well and the API endpoint is correctly configured, you should eventually see a healthy cluster. Check with clusterctl describe cluster my-cluster:

NAME                                                READY  SEVERITY  REASON  SINCE  MESSAGE
Cluster/my-cluster                                  True                     76s
├─ClusterInfrastructure - Metal3Cluster/my-cluster  True                     15m
└─ControlPlane - KubeadmControlPlane/my-cluster     True                     76s
  └─Machine/my-cluster-cj5zt                        True                     76s

Cleanup

If you created a cluster using Cluster API, delete that first:

kubectl delete cluster my-cluster

Delete all BareMetalHosts with kubectl delete bmh <name>. This ensures that the servers are cleaned and powered off.

Delete the management cluster.

kind delete cluster

Stop DHCP and image servers. They are automatically removed when stopped.

docker stop dnsmasq
docker stop image-server

If you did the virtualized setup you will also need to cleanup the sushy-tools container and the VM.

docker stop sushy-tools

virsh -c qemu:///system destroy --domain bmh-vm-01
virsh -c qemu:///system undefine --domain bmh-vm-01 --remove-all-storage --nvram

virsh -c qemu:///system net-destroy baremetal
virsh -c qemu:///system net-undefine baremetal

Baremetal provisioning

This is a guide to provision baremetal servers using the Metal³ project. It is a generic guide with basic implementation, different hardware may require different configuration.

In this guide we will use minikube as management cluster.

All commands are executed on the host where minikube is set up.

This is a separate machine, e.g. your laptop or one of the servers, that has access to the network where the servers are in order to provision them.

Install requirements on the host

Install following requirements on the host:

  • Python
  • Golang
  • Docker for ubuntu and podman for Centos
  • Ansible

See Install Ironic for other requirements.

Configure host

  • Create network settings. We are creating 2 bridge interfaces: provisioning and external. The provisioning interface is used by Ironic to provision the BareMetalHosts and the external interface allows them to communicate with each other and connect to internet.

    # Create a veth iterface peer.
    sudo ip link add ironicendpoint type veth peer name ironic-peer
    
    # Create provisioning bridge.
    sudo brctl addbr provisioning
    
    sudo ip addr add dev ironicendpoint 172.22.0.1/24
    sudo brctl addif provisioning ironic-peer
    sudo ip link set ironicendpoint up
    sudo ip link set ironic-peer up
    
    # Create the external bridge
    sudo brctl addbr external
    
    sudo ip addr add dev external 192.168.111.1/24
    sudo ip link set external up
    
    # Add udp forwarding to firewall, this allows to use ipmitool (port 623)
    # as well as allowing TFTP traffic outside the host (random port)
    iptables -A FORWARD -p udp -j ACCEPT
    
    # Add interface to provisioning bridge
    brctl addif provisioning eno1
    
    # Set VLAN interface to be up
    ip link set up dev bmext
    
    # Check if bmext interface is addded to the bridge
    brctl show baremetal | grep bmext
    
    # Add bmext to baremeatal bridge
    brctl addif baremetal bmext
    

Prepare image cache

  • Start httpd container. This is used to host the the OS images that the BareMetalHosts will be provisioned with.

    sudo docker run -d --net host --privileged --name httpd-infra -v /opt/metal3-dev-env/ironic:/shared --entrypoint /bin/runhttpd --env
    

    Download the node image and put it in the folder where the httpd container can host it.

    wget -O /opt/metal3-dev-env/ironic/html/images https://artifactory.nordix.org/artifactory/metal3/images/k8s_v1.27.1
    

    Convert the qcow2 image to raw format and get the hash of the raw image

    # Change IMAGE_NAME and IMAGE_RAW_NAME according to what you download from artifactory
    cd /opt/metal3-dev-env/ironic/hrtml/images
    IMAGE_NAME="CENTOS_9_NODE_IMAGE_K8S_v1.27.1.qcow2"
    IMAGE_RAW_NAME="CENTOS_9_NODE_IMAGE_K8S_v1.27.1-raw.img"
    qemu-img convert -O raw "${IMAGE_NAME}" "${IMAGE_RAW_NAME}"
    
    # Create sha256 hash
    sha256sum "${IMAGE_RAW_NAME}" | awk '{print $1}' > "${IMAGE_RAW_NAME}.sha256sum"
    

Launch management cluster using minikube

  • Create a minikube cluster to use as management cluster.

    minikube start
    
    # Configuring ironicendpoint with minikube
    minikube ssh sudo brctl addbr ironicendpoint
    minikube ssh sudo ip link set ironicendpoint up
    minikube ssh sudo brctl addif ironicendpoint eth2
    minikube ssh sudo ip addr add 172.22.0.9/24 dev ironicendpoint
    
  • Initialize Cluster API and the Metal3 provider.

    kubectl create namespace metal3
    clusterctl init --core cluster-api --bootstrap kubeadm --control-plane kubeadm --infrastructure metal3
    # NOTE: In clusterctl init you can change the version of provider like this 'cluster-api:v1.7.4',
    # if no version is given by deafult latest stable release will be used.
    

Install provisioning components

  • Exporting necessary variables for baremetal operator and Ironic deployment.

    # The URL of the kernel to deploy.
    export DEPLOY_KERNEL_URL="http://172.22.0.1:6180/images/ironic-python-agent.kernel"
    
    # The URL of the ramdisk to deploy.
    export DEPLOY_RAMDISK_URL="http://172.22.0.1:6180/images/ironic-python-agent.initramfs"
    
    # The URL of the Ironic endpoint.
    export IRONIC_URL="http://172.22.0.1:6385/v1/"
    
    # The URL of the Ironic inspector endpoint - only before BMO 0.5.0.
    #export IRONIC_INSPECTOR_URL="http://172.22.0.1:5050/v1/"
    
    # Do not use a dedicated CA certificate for Ironic API.
    # Any value provided in this variable disables additional CA certificate validation.
    # To provide a CA certificate, leave this variable unset.
    # If unset, then IRONIC_CA_CERT_B64 must be set.
    export IRONIC_NO_CA_CERT=true
    
    # Disables basic authentication for Ironic API.
    # Any value provided in this variable disables authentication.
    # To enable authentication, leave this variable unset.
    # If unset, then IRONIC_USERNAME and IRONIC_PASSWORD must be set.
    #export IRONIC_NO_BASIC_AUTH=true
    
    # Disables basic authentication for Ironic inspector API (when used).
    # Any value provided in this variable disables authentication.
    # To enable authentication, leave this variable unset.
    # If unset, then IRONIC_INSPECTOR_USERNAME and IRONIC_INSPECTOR_PASSWORD must be set.
    #export IRONIC_INSPECTOR_NO_BASIC_AUTH=true
    
  • Launch baremetal operator.

    # Clone BMO repo
    git clone https://github.com/metal3-io/baremetal-operator.git
    # Run deploy.sh
    ./baremetal-operator/tools/deploy.sh -b -k -t
    
  • Launch Ironic.

    # Run deploy.sh
    ./baremetal-operator/tools/deploy.sh -i -k -t
    

Create Secrets and BareMetalHosts

Create yaml files for each BareMetalHost that will be used. Below is an example.

---
apiVersion: v1
kind: Secret
metadata:
  name: <<secret_name_bmh1>>
type: Opaque
data:
  username: <<username_bmh1>>
  password: <<password_bmh1>>
---
apiVersion: metal3.io/v1alpha1
  kind: BareMetalHost
  metadata:
    name: <<id_bmh1>>
  spec:
    online: true
    bootMACAddress: <<mac_address_bmh1>>
    bootMode: legacy
    bmc:
      address: <<address_bmh1>> // this depends on the protocol that are mentioned above, they depend on hardware vendor
      credentialsName: <<secret_name_bmh1>>
      disableCertificateVerification: true

Apply the manifests.

kubectl apply -f ./bmh1.yaml -n metal3

At this point, the BareMetalHosts will go through registering and inspection phases before they become available.

Wait for all of them to be available. You can check their status with kubectl get bmh -n metal3.

The next step is to create a workload cluster from these BareMetalHosts.

Create and apply cluster, controlplane and worker template

#API endpoint IP and port for target cluster
export CLUSTER_APIENDPOINT_HOST="192.168.111.249"
export CLUSTER_APIENDPOINT_PORT="6443"

# Export node image variable and node image hash varibale that we created before.
# Change name according to what was downlowded from artifactory
export IMAGE_URL=http://172.22.0.1/images/CENTOS_9_NODE_IMAGE_K8S_v1.27.1-raw.img
export IMAGE_CHECKSUM=http://172.22.0.1/images/CENTOS_9_NODE_IMAGE_K8S_v1.27.1-raw.img.sha256sum
export IMAGE_CHECKSUM_TYPE=sha256
export IMAGE_FORMAT=raw

# Generate templates with clusterctl, change control plane and worker count according to
# the number of BareMetalHosts
clusterctl generate cluster capm3-cluster \
  --kubernetes-version v1.27.0 \
  --control-plane-machine-count=3 \
  --worker-machine-count=3 \
  > capm3-cluster-template.yaml

# Apply the template
kubectl apply -f capm3-cluster-template.yaml

Bare Metal Operator

The Bare Metal Operator (BMO) is a Kubernetes controller that manages bare-metal hosts, represented in Kubernetes by BareMetalHost (BMH) custom resources.

BMO is responsible for the following operations:

  • Inspecting the host’s hardware and reporting the details on the corresponding BareMetalHost. This includes information about CPUs, RAM, disks, NICs, and more.
  • Optionally preparing the host by configuring RAID, changing firmware settings or updating the system and/or BMC firmware.
  • Provisioning the host with a desired image.
  • Cleaning the host’s disk contents before and after provisioning.

Under the hood, BMO uses Ironic to conduct these actions.

Enrolling BareMetalHosts

To enroll a bare-metal machine as a BareMetalHost, you need to know at least the following properties:

  1. The IP address and credentials of the BMC - the remote management controller of the host.
  2. The protocol that the BMC understands. Most common are IPMI and Redfish. See supported hardware for more details.
  3. Boot technology that can be used with the host and the chosen protocol. Most hardware can use network booting, but some Redfish implementations also support virtual media (CD) boot.
  4. MAC address that is used for booting. Important: it’s a MAC address of an actual NIC of the host, not the BMC MAC address.
  5. The desired boot mode: UEFI or legacy BIOS. UEFI is the default and should be used unless there are serious reasons not to.

This is a minimal example of a valid BareMetalHost:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-0
  namespace: metal3
spec:
  bmc:
    address: ipmi://192.168.111.1:6230
    credentialsName: node-0-bmc-secret
  bootMACAddress: 00:5a:91:3f:9a:bd
  online: true

When this resource is created, it will undergo inspection that will populate more fields as part of the status.

Deploying BareMetalHosts

To provision a bare-metal machine, you will need a few more properties:

  1. The URL and checksum of the image. Images should be in QCOW2 or raw format. It is common to use various cloud images with BMO, e.g. Ubuntu or CentOS. Important: not all images are compatible with UEFI boot - check their description.
  2. Optionally, user data: a secret with a configuration or a script that is interpreted by the first-boot service embedded in your image. The most common service is cloud-init, some distributions use ignition.
  3. Optionally, network data: a secret with the network configuration that is enterpreted by the first-boot service. In some cases, the network data is embedded in the user data instead.

Here is a complete example of a host that will be provisioned with a CentOS 9 image:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-0
  namespace: metal3
spec:
  bmc:
    address: ipmi://192.168.111.1:6230
    credentialsName: node-0-bmc-secret
  bootMACAddress: 00:5a:91:3f:9a:bd
  image:
    checksum: http://172.22.0.1/images/CENTOS_9_NODE_IMAGE_K8S_v1.29.0.qcow2.sha256sum
    url: http://172.22.0.1/images/CENTOS_9_NODE_IMAGE_K8S_v1.29.0.qcow2
  networkData:
    name: test1-workers-tbwnz-networkdata
    namespace: metal3
  online: true
  userData:
    name: test1-workers-vd4gj
    namespace: metal3
status:
  hardware:
    cpu:
      arch: x86_64
      count: 2
    hostname: node-0
    nics:
    - ip: 172.22.0.73
      mac: 00:5a:91:3f:9a:bd
      name: enp1s0
    ramMebibytes: 4096
    storage:
    - hctl: "0:0:0:0"
      name: /dev/sda
      serialNumber: drive-scsi0-0-0-0
      sizeBytes: 53687091200
      type: HDD

Integration with the cluster API

CAPM3 is the Metal3 component that is responsible for integration between Cluster API resources and BareMetalHosts. When using Metal3 with CAPM3, you will enroll BareMetalHosts as described above first, then use Metal3MachineTemplate to describe how hosts should be deployed, i.e. which images and user data to use.

This happens for example when the user scales a MachineDeployment so that the server should be added to the cluster, or during an upgrade when it must change the image it is booting from:

ipa-provisioning

Install Baremetal Operator

Installing Baremetal Operator (BMO) involves usually three steps:

  1. Clone Metal3 BMO repository https://github.com/metal3-io/baremetal-operator.git.
  2. Adapt the configuration settings to your specific needs.
  3. Deploy BMO in the cluster with or without Ironic.

Note: This guide assumes that a local clone of the repository is available.

Configuration Settings

Review and edit the file ironic.env found in config/default. The operator supports several configuration options for controlling its interaction with Ironic.

DEPLOY_RAMDISK_URL – The URL for the ramdisk of the image containing the Ironic agent.

DEPLOY_KERNEL_URL – The URL for the kernel to go with the deploy ramdisk.

DEPLOY_ISO_URL – The URL for the ISO containing the Ironic agent for drivers that support ISO boot. Optional if kernel/ramdisk are set.

IRONIC_ENDPOINT – The URL for the operator to use when talking to Ironic.

IRONIC_CACERT_FILE – The path of the CA certificate file of Ironic, if needed

IRONIC_INSECURE – (“True”, “False”) Whether to skip the ironic certificate validation. It is highly recommend to not set it to True.

IRONIC_CLIENT_CERT_FILE – The path of the Client certificate file of Ironic, if needed. Both Client certificate and Client private key must be defined for client certificate authentication (mTLS) to be enabled.

IRONIC_CLIENT_PRIVATE_KEY_FILE – The path of the Client private key file of Ironic, if needed. Both Client certificate and Client private key must be defined for client certificate authentication (mTLS) to be enabled.

IRONIC_SKIP_CLIENT_SAN_VERIFY – (“True”, “False”) Whether to skip the ironic client certificate SAN validation.

BMO_CONCURRENCY – The number of concurrent reconciles performed by the Operator. Default is the number of CPUs, but no less than 2 and no more than 8.

PROVISIONING_LIMIT – The desired maximum number of hosts that could be (de)provisioned simultaneously by the Operator. The limit does not apply to hosts that use virtual media for provisioning. The Operator will try to enforce this limit, but overflows could happen in case of slow provisioners and / or higher number of concurrent reconciles. For such reasons, it is highly recommended to keep BMO_CONCURRENCY value lower than the requested PROVISIONING_LIMIT. Default is 20.

IRONIC_EXTERNAL_URL_V6 – This is the URL where Ironic will find the image for nodes that use IPv6. In dual stack environments, this can be used to tell Ironic which IP version it should set on the BMC.

Deprecated options

IRONIC_INSPECTOR_ENDPOINT – The URL for the operator to use when talking to Ironic Inspector. Only supported before baremetal-operator 0.5.0.

Kustomization Configuration

It is possible to deploy baremetal-operator with three different operator configurations, namely:

  1. operator with ironic
  2. operator without ironic
  3. ironic without operator

A detailed overview of the configuration is presented in the following sections.

Notes on external Ironic

When an external Ironic is used, the following requirements must be met:

  • Either HTTP basic or no-auth authentication must be used (Keystone is not supported).

  • API version 1.74 (Xena release cycle) or newer must be available.

Authenticating to Ironic

Because hosts under the control of Metal³ need to contact the Ironic API during inspection and provisioning, it is highly advisable to require authentication on those APIs, since the provisioned hosts running user workloads will remain connected to the provisioning network.

Configuration

The baremetal-operator supports connecting to Ironic with the following auth_strategy modes:

Note that Keystone (OpenStack Identity) authentication methods are not yet supported.

Authentication configuration is read from the filesystem, beginning at the root directory specified in the environment variable METAL3_AUTH_ROOT_DIR. If this variable is empty or not specified, the default is /opt/metal3/auth.

Within the root directory, there is a separate subdirectory ironic for Ironic client configuration.

noauth

This is the default, and will be chosen if the auth root directory does not exist. In this mode, the baremetal-operator does not attempt to do any authentication against the Ironic APIs.

http_basic

This mode is configured by files in each authentication subdirectory named username and password, and containing the Basic auth username and password, respectively.

Running Bare Metal Operator with or without Ironic

This section explains the deployment scenarios of deploying Bare Metal Operator(BMO) with or without Ironic as well as deploying only Ironic scenario.

These are the deployment use cases addressed:

  1. Deploying baremetal-operator with Ironic.

  2. Deploying baremetal-operator without Ironic.

  3. Deploying only Ironic.

Current structure of baremetal-operator config directory

tree config/
config/
├── basic-auth
│   ├── default
│   │   ├── credentials_patch.yaml
│   │   └── kustomization.yaml
│   └── tls
│       ├── credentials_patch.yaml
│       └── kustomization.yaml
├── certmanager
│   ├── certificate.yaml
│   ├── kustomization.yaml
│   └── kustomizeconfig.yaml
├── crd
│   ├── bases
│   │   ├── metal3.io_baremetalhosts.yaml
│   │   ├── metal3.io_firmwareschemas.yaml
│   │   └── metal3.io_hostfirmwaresettings.yaml
│   ├── kustomization.yaml
│   ├── kustomizeconfig.yaml
│   └── patches
│       ├── cainjection_in_baremetalhosts.yaml
│       ├── cainjection_in_firmwareschemas.yaml
│       ├── cainjection_in_hostfirmwaresettings.yaml
│       ├── webhook_in_baremetalhosts.yaml
│       ├── webhook_in_firmwareschemas.yaml
│       └── webhook_in_hostfirmwaresettings.yaml
├── default
│   ├── ironic.env
│   ├── kustomization.yaml
│   ├── manager_auth_proxy_patch.yaml
│   ├── manager_webhook_patch.yaml
│   └── webhookcainjection_patch.yaml
├── kustomization.yaml
├── manager
│   ├── kustomization.yaml
│   └── manager.yaml
├── namespace
│   ├── kustomization.yaml
│   └── namespace.yaml
├── OWNERS
├── prometheus
│   ├── kustomization.yaml
│   └── monitor.yaml
├── rbac
│   ├── auth_proxy_client_clusterrole.yaml
│   ├── auth_proxy_role_binding.yaml
│   ├── auth_proxy_role.yaml
│   ├── auth_proxy_service.yaml
│   ├── baremetalhost_editor_role.yaml
│   ├── baremetalhost_viewer_role.yaml
│   ├── firmwareschema_editor_role.yaml
│   ├── firmwareschema_viewer_role.yaml
│   ├── hostfirmwaresettings_editor_role.yaml
│   ├── hostfirmwaresettings_viewer_role.yaml
│   ├── kustomization.yaml
│   ├── leader_election_role_binding.yaml
│   ├── leader_election_role.yaml
│   ├── role_binding.yaml
│   └── role.yaml
├── render
│   └── capm3.yaml
├── samples
│   ├── metal3.io_v1alpha1_baremetalhost.yaml
│   ├── metal3.io_v1alpha1_firmwareschema.yaml
│   └── metal3.io_v1alpha1_hostfirmwaresettings.yaml
├── tls
│   ├── kustomization.yaml
│   └── tls_ca_patch.yaml
└── webhook
    ├── kustomization.yaml
    ├── kustomizeconfig.yaml
    ├── manifests.yaml
    └── service_patch.yaml

The config directory has one top level folder for deployment, namely default and it deploys only baremetal-operator through kustomization file calling manager folder. In addition, basic-auth, certmanager, crd, namespace, prometheus, rbac, tls and webhookfolders have their own kustomization and yaml files. samples folder includes yaml representation of sample CRDs.

Current structure of ironic-deployment directory

tree ironic-deployment/
ironic-deployment/
├── base
│   ├── ironic.yaml
│   └── kustomization.yaml
├── components
│   ├── basic-auth
│   │   ├── auth.yaml
│   │   ├── ironic-auth-config
│   │   ├── ironic-auth-config-tpl
│   │   ├── ironic-htpasswd
│   │   └── kustomization.yaml
│   ├── keepalived
│   │   ├── ironic_bmo_configmap.env
│   │   ├── keepalived_patch.yaml
│   │   └── kustomization.yaml
│   └── tls
│       ├── certificate.yaml
│       ├── kustomization.yaml
│       ├── kustomizeconfig.yaml
│       └── tls.yaml
├── default
│   ├── ironic_bmo_configmap.env
│   └── kustomization.yaml
├── overlays
│   ├── basic-auth_tls
│   │   ├── basic-auth_tls.yaml
│   │   └── kustomization.yaml
│   └── basic-auth_tls_keepalived
│       └── kustomization.yaml
├── OWNERS
└── README.md

The ironic-deployment folder contains kustomizations for deploying Ironic. It makes use of kustomize components for basic auth, TLS and keepalived configurations. This makes it easy to combine the configurations, for example basic auth + TLS. There are some ready made overlays in the overlays folder that shows how this can be done. For more information, check the readme in the ironic-deployment folder.

Deployment commands

There is a useful deployment script that configures and deploys BareMetal Operator and Ironic. It requires some variables :

  • IRONIC_HOST : domain name for Ironic
  • IRONIC_HOST_IP : IP on which Ironic is listening

In addition you can configure the following variables. They are optional. If you leave them unset, then passwords and certificates will be generated for you.

  • KUBECTL_ARGS : Additional arguments to kubectl apply
  • IRONIC_USERNAME : username for ironic
  • IRONIC_PASSWORD : password for ironic
  • IRONIC_CACERT_FILE : CA certificate path for ironic
  • IRONIC_CAKEY_FILE : CA certificate key path, unneeded if ironic
  • certificates exist
  • IRONIC_CERT_FILE : Ironic certificate path
  • IRONIC_KEY_FILE : Ironic certificate key path
  • MARIADB_KEY_FILE: Path to the key of MariaDB
  • MARIADB_CERT_FILE: Path to the cert of MariaDB
  • MARIADB_CAKEY_FILE: Path to the CA key of MariaDB
  • MARIADB_CACERT_FILE: Path to the CA certificate of MariaDB

Before version 0.5.0, Ironic Inspector parameters were also used:

  • IRONIC_INSPECTOR_USERNAME : username for inspector
  • IRONIC_INSPECTOR_PASSWORD : password for inspector
  • IRONIC_INSPECTOR_CERT_FILE : Inspector certificate path
  • IRONIC_INSPECTOR_KEY_FILE : Inspector certificate key path
  • IRONIC_INSPECTOR_CACERT_FILE : CA certificate path for inspector, defaults to IRONIC_CACERT_FILE
  • IRONIC_INSPECTOR_CAKEY_FILE : CA certificate key path, unneeded if inspector certificates exist

Then run :

./tools/deploy.sh [-b -i -t -n -k]
  • -b: deploy BMO
  • -i: deploy Ironic
  • -t: deploy with TLS enabled
  • -n: deploy without authentication
  • -k: deploy with keepalived

This will deploy BMO and / or Ironic with the proper configuration.

Useful tips

It is worth mentioning some tips for when the different configurations are useful as well. For example:

  1. Only BMO is deployed, in a case when Ironic is already running, e.g. as part of Cluster API Provider Metal3 (CAPM3) when a successful pivoting state was met and ironic being deployed.

  2. BMO and Ironic are deployed together, in a case when CAPM3 is not used and baremetal-operator and ironic containers to be deployed together.

  3. Only Ironic is deployed, in a case when BMO is deployed as part of CAPM3 and only Ironic setup is sufficient, e.g. clusterctl provided by Cluster API(CAPI) deploys BMO, so that it can take care of moving the BaremetalHost during the pivoting.

Important Note When the baremetal-operator is deployed through metal3-dev-env, baremetal-operator container inherits the following environment variables through configmap:


$PROVISIONING_IP
$PROVISIONING_INTERFACE

In case you are deploying baremetal-operator locally, make sure to populate and export these environment variables before deploying.

Host State Machine

During its lifetime, a BareMetalHost resource goes through a series of various states. Some of them are stable (the host stays in them indefinitely without user input), some are transient (the state will change once a certain operation completes). These fields in the status resource define the current state of the host:

  • status.provisioning.state – the current phase of the provisioning process.
  • status.operationHistory – the history of the main provisioning phases: registration, inspection, provisioning and deprovisioning.
  • status.operationalStatus – the overall status of the host.
  • status.errorType – the type of the current error (if any).
  • status.poweredOn – the current power state of the host.

This is how the status of a healthy provisioned host may look like:

status:
 # ...
 operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: "2024-06-17T13:09:07Z"
      start: "2024-06-17T13:03:54Z"
    provision:
      end: "2024-06-17T13:11:18Z"
      start: "2024-06-17T13:09:26Z"
    register:
      end: "2024-06-17T13:03:54Z"
      start: "2024-06-17T12:54:18Z"
  operationalStatus: OK
  poweredOn: true
  provisioning:
    ID: e09032ea-1b7d-4c50-bfcd-b94ff7e8d431
    bootMode: UEFI
    image:
      checksumType: sha256
      checksum: http://192.168.0.150/SHA256SUMS
      format: qcow2
      url: http://192.168.0.150/jammy-server-cloudimg-amd64.img
    rootDeviceHints:
      deviceName: /dev/sda
    state: provisioned
 # ...

OperationalStatus

  • OK – the host is healthy and operational.
  • discovered – the host is known to Metal3 but lacks the required information for the normal operation (usually, the BMC credentials).
  • error – error has occured, see the status.errorType and status.errorMessage fields for details.
  • delayed – cannot proceed with the provisioning because the maximum number of the hosts in the given state has been reached.
  • detached – the host is detached, no provisioning actions are possible (see detached annotation for details).

Provisioning state machine

BaremetalHost provisioning state transitions

Provisioning states

Creating

Newly created hosts get an empty provisioning state briefly before moving either to unmanaged or registering.

Unmanaged

An unmanaged host is missing both the BMC address and credentials secret name, and does not have any information to access the BMC for registration.

The corresponding operational status is discovered.

Externally Provisioned

An externally provisioned host has been deployed using another tool. Hosts reach this state when they are created with the externallyProvisioned field set to true. Hosts in this state are monitored, and only their power status is managed.

Registering

The host will stay in the registering state while the BMC access details are being validated.

Inspecting

After the host is registered, an IPA ramdisk will be booted on it. The agent collects information about the available hardware components and sends it back to Metal3. The host will stay in the inspecting state until this process is completed.

Preparing

When setting up RAID or changing firmware settings, the host will be in preparing state.

Available

A host in the available state is ready to be provisioned. It will move to the provisioning state once the image field is populated.

Provisioning

While an image is being copied to the host, and the host is configured to run the image, the host will be in the provisioning state.

Provisioned

After an image is copied to the host and the host is running the image, it will be in the provisioned state.

Deprovisioning

When the previously provisioned image is being removed from the host, it will be in the deprovisioning state.

Powering off before delete

When the host that is not currently unmanaged is marked to be deleted, it will be powered off first and will stay in the powering off before delete until it’s done or until the retry limit is reached.

Deleting

When the host is marked to be deleted and has been successfully powered off, it will move from its current state to deleting, at which point the resource record is deleted.

Supported hardware

Metal3 supports many vendors and models of enterprise-grade hardware with a BMC (Baseboard Management Controller) that supports one of the remote management protocols described in this document. On top of that, one of the two boot methods must be supported:

  1. Network boot. Most hardware supports booting a Linux kernel and initramfs via TFTP. Metal3 augments it with iPXE - a higher level network boot firmware with support for scripting and TCP-based protocols such as HTTP.

    Booting over network relies on DHCP and thus requires a provisioning network for isolated L2 traffic between the Metal3 control plane and the machines.

  2. Virtual media boot. Some hardware model support directly booting an ISO 9660 image as a virtual CD device over HTTP(s). An important benefit of this approach is the ability to boot hardware over L3 networks, potentially without DHCP at all.

IPMI

IPMI is the oldest and by far the most widely available remote management protocol. Nearly all enterprise-grade hardware supports it. Its downside include reduced reliability and a weak security, especially if not configured properly.

WARNING: only network boot over iPXE is supported for IPMI.

BMC address formatNotes
ipmi://<host>:<port>Port is optional, defaults to 623.
<host>:<port>IPMI is the default protocol in Metal3.

Redfish and its variants

Redfish is a vendor-agnostic protocol for remote hardware management. It is based on HTTP(s) and JSON and thus does not suffer from the limitations of IPMI. It also exposes modern features such as virtual media boot, RAID management, firmware settings and updates.

Ironic (and thus Metal3) aims to support Redfish as closely to the standard as possible, with a few workarounds for known issues and explicit support for Dell iDRAC. Note, however, that all features are optional in Redfish, so you may encounter a Redfish-capable hardware that is not supported by Metal3. Furthermore, some features (such as virtual media boot) may require buying an additional license to function.

Since a Redfish API endpoint can manage several servers (systems in Redfish terminology), BMC addresses for Redfish-based drivers include a system ID - the URL of the particular server. For Dell machines it usually looks like /redfish/v1/Systems/System.Embedded.1, while other vendors may simply use /redfish/v1/Systems/1. Check the hardware documentation to find out which format is right for your machine.

TechnologyBoot methodBMC address formatNotes
Generic RedfishiPXEredfish://<host>:<port>/<systemID>
Virtual mediaredfish-virtualmedia://<host>:<port>/<systemID>Must not be used for Dell machines.
Dell iDRAC 8+iPXEidrac-redfish://<host>:<port>/<systemID>
Virtual mediaidrac-virtualmedia://<host>:<port>/<systemID>Requires firmware v6.10.30.00+ for iDRAC 9, v2.75.75.75+ for iDRAC 8.
HPE iLO 5 and 6iPXEilo5-redfish://<host>:<port>/<systemID>An alias of redfish for convenience. RAID management only on iLO 6.
Virtual mediailo5-virtualmedia://<host>:<port>/<systemID>An alias of redfish for convenience. RAID management only on iLO 6.

Users have also reported success with certain models of SuperMicro, Lenovo, ZT Systems and Cisco UCS hardware, but hardware from these vendors is not regularly tested by the team.

All drivers based on Redfish allow optionally specifying the carrier protocol in the form of +http or +https, for example: redfish+http://... or idrac-virtualmedia+https. When not specified, HTTPS is used by default.

Vendor-specific protocols

TechnologyProtocolBoot methodBMC address formatNotes
Fujitsu iRMCiRMCiPXEirmc://<host>:<port>Port is optional, the default is 443.
HPE iLO 4iLOiPXEilo4://<host>:<port>Port is optional, the default is 443.
iLOVirtual mediailo4-virtualmedia://<host>:<port>
HPE iLO 5iLOiPXEilo5://<host>:<port>Should only be used instead of Redfish if you need RAID support.

Baremetal Operator features

Basic features

Advanced features

Automated Cleaning

One of the Ironic’s feature exposed to Metal3 Baremetal Operator is node automated cleaning. When enabled, automated cleaning kicks off when a node is provisioned first time and on every deprovisioning.

There are two automated cleaning modes available which can be configured via automatedCleaningMode field of a BareMetalHost spec:

  • metadata (the default) enables the removal of partitioning tables from all disks
  • disabled disables the cleaning process

For example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example-host
spec:
  automatedCleaningMode: metadata
  bootMACAddress: 00:8a:b6:8e:ac:b8
  bmc:
    address: ipmi://192.168.111.1:6230
    credentialsName: example-node-bmc-secret
  online: true

Note: Ironic supports full data removal, which is not currently exposed in Metal3.

For a host with cleaning disabled, no cleaning will be performed during deprovisioning. This is faster but may cause conflicts on subsequent provisionings (e.g. Ceph is known not to tolerate stale data partitions).

Warning: when disabling cleaning, consider setting root device hints to specify the exact block device to install to. Otherwise, subsequent provisionings may end up with different root devices, potentially causing incorrect configuration because of duplicated config drives.

If you are using Cluster-api-provider-metal3, please see its cleaning documentation.

Automatic secure boot

The automatic secure boot feature allows enabling and disabling UEFI (Unified Extensible Firmware Interface) secure boot when provisioning a host. This feature requires supported hardware and compatible OS image. The current hardwares that support enabling UEFI secure boot are iLO, iRMC and Redfish drivers.

Check also:

Why do we need it

We need the Automatic secure boot when provisioning a host with high security requirements. Based on checksum and signature, the secure boot protects the host from loading malicious code in the boot process before loading the provisioned operating system.

How to use it

To enable Automatic secure boot, first check if hardware is supported and then specify the value UEFISecureBoot for bootMode in the BareMetalHost custom resource. Please note, it is enabled before booting into the deployed instance and disabled when the ramdisk is running and on tear down. Below you can check the example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-1
spec:
  online: true
  bootMACAddress: 00:5c:52:31:3a:9c
  bootMode: UEFISecureBoot
  ...

This will enable UEFI before booting the instance and disable it when deprovisioned. Note that the default value for bootMode is UEFI.

Firmware Settings

Metal3 supports modifying firmware settings of the hosts before provisioning them. This feature can be used, for example, to enable or disable CPU virtualization extensions, huge pages or SRIOV support. The corresponding functionality in Ironic is called BIOS settings.

Reading and modifying firmware settings is only supported for drivers based on Redfish, iRMC or iLO (see supported hardware). The commonly used IPMI driver does not support this feature.

HostFirmwareSettings Resources

A HostFirmwareSettings resource is automatically created for each host that supports firmware settings with the same name and in the same namespace as host. BareMetal Operator puts the current settings in the status.settings field:

apiVersion: metal3.io/v1alpha1
kind: HostFirmwareSettings
metadata:
  creationTimestamp: "2024-05-28T16:31:06Z"
  generation: 1
  name: worker-0
  namespace: my-cluster
  ownerReferences:
  - apiVersion: metal3.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: BareMetalHost
    name: worker-0
    uid: 663a1453-d4d8-43a3-b459-64ea94d1435f
  resourceVersion: "20653"
  uid: 46fc9ccb-0717-4ced-93aa-babbe1a8cd5b
spec:
  settings: {}
status:
  conditions:
  - lastTransitionTime: "2024-05-28T16:31:06Z"
    message: ""
    observedGeneration: 1
    reason: Success
    status: "True"
    type: Valid
  - lastTransitionTime: "2024-05-28T16:31:06Z"
    message: ""
    observedGeneration: 1
    reason: Success
    status: "False"
    type: ChangeDetected
  lastUpdated: "2024-05-28T16:31:06Z"
  schema:
    name: schema-f229959d
    namespace: my-cluster
  settings:
    BootMode: Uefi
    EmbeddedSata: Raid
    L2Cache: 10x256 KB
    NicBoot1: NetworkBoot
    NumCores: "10"
    ProcTurboMode: Enabled
    QuietBoot: "true"
    SecureBootStatus: Enabled
    SerialNumber: QPX12345

In this example (taken from a virtual testing environment):

  • The spec.settings mapping is empty - no change is requested by the user.

  • The status.settings mapping is populated with the current values detected by Ironic.

  • The Valid condition is True, which means that spec.settings are valid according to the host’s FirmwareSchema. The condition will be set to False if any value in spec.settings fails validation.

  • The ChangeDetected condition is False, which means that the desired settings and the real settings do not diverge. This condition will be set to True after you modify spec.settings until the change is reflected in status.settings.

  • The schema field contains a link to the firmware schema (see below).

Warning: Ironic does not constantly update the current settings to avoid an unnecessary load on the host’s BMC. The current settings are updated on enrollment, provisioning and deprovisioning only.

FirmwareSchema resources

One or more FirmwareSchema resources are created for hosts that support firmware settings. Each schema object represents a list of possible settings and limits on their values.

apiVersion: metal3.io/v1alpha1
kind: FirmwareSchema
metadata:
  creationTimestamp: "2024-05-28T16:31:06Z"
  generation: 1
  name: schema-f229959d
  namespace: my-cluster
  ownerReferences:
  - apiVersion: metal3.io/v1alpha1
    kind: HostFirmwareSettings
    name: worker-1
    uid: bd97a81c-c736-4a6d-aee5-32dccb26e366
  - apiVersion: metal3.io/v1alpha1
    kind: HostFirmwareSettings
    name: worker-0
    uid: d8fb3c8a-395e-4c0a-9171-5928a68305b3
spec:
  hardwareModel: KVM (8.6.0)
  hardwareVendor: Red Hat
  schema:
    BootMode:
      allowable_values:
      - Bios
      - Uefi
      attribute_type: Enumeration
      read_only: false
    NumCores:
      attribute_type: Integer
      lower_bound: 10
      read_only: true
      unique: false
      upper_bound: 20
    QuietBoot:
      attribute_type: Boolean
      read_only: false
      unique: false

The following fields are included for each setting:

  • attribute_type – The type of the setting (Enumeration, Integer, String, Boolean, or Password).
  • read_only – The setting is read-only and cannot be modified.
  • unique – The setting’s value is unique in this host (e.g. serial numbers).

For type Enumeration:

  • allowable_values – A list of allowable values.

For type Integer:

  • lower_bound – The lowest allowed integer value.
  • upper_bound – The highest allowed integer value.

For type String:

  • min_length – The minimum length that the string value can have.
  • max_length – The maximum length that the string value can have.

Note: the FirmwareSchema has a unique identifier derived from its settings and limits. Multiple hosts may therefore have the same FirmwareSchema identifier so its likely that more than one HostFirmwareSettings reference the same FirmwareSchema when hardware of the same vendor and model are used.

How to change firmware settings

To change one or more settings for a host, update the corresponding HostFirmwareSettings resource, changing or adding the required settings to spec.settings. For example:

apiVersion: metal3.io/v1alpha1
kind: HostFirmwareSettings
metadata:
  name: worker-0
  namespace: my-cluster
  # ...
spec:
  settings:
     QuietBoot: true
status:
  # ...

Hint: you don’t need to copy over the settings you don’t want to change.

If the host is in the available state, it will be moved to the preparing state and the new settings will be applied. After some time, the host will move back to available, and the resulting changes will be reflected in the status of the HostFirmwareSettings object. Applying firmware settings requires 1-2 reboots of the machine and thus may take 5-20 minutes.

Warning: if the host is not in the available state, the settings will be pending until it gets to this state (e.g. as a result of deprovisioning).

Alternatively, you can create a HostFirmwareSettings object together with the BareMetalHost object. In this case, the settings will be applied after inspection is finished.

Inspect annotation

Re-running inspection

The inspect annotation can be used to request the BareMetal Operator to (re-)inspect an available BareMetalHost, for example, when the hardware changes. If an inspection request is made while the host is any other state than available, the request will be ignored.

To request a new inspection, simply annotate the host with inspect.metal3.io. Once inspection is requested, you should see the BMH in inspecting state until inspection is completed, and by the end of inspection the inspect.metal3.io annotation will be removed automatically.

Here is an example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    # The inspect annotation with no value
    inspect.metal3.io: ""
spec:
  ...

Disabling inspection

If you do not need the HardwareData collected by inspection, you can disable it by setting the inspect.metal3.io annotation to disabled, for example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    inspect.metal3.io: disabled
spec:
  ...

For advanced use cases, such as providing externally gathered inspection data, see external inspection.

Instance Customization

When provisioning bare-metal machines, it is usually required to customize the resulting instances. Common use cases include injecting SSH keys, adding users, installing software, starting services or configuring networking.

It is recommended to use UserData or NetworkData together with a first-boot configuration software such as cloud-init, Glean or Ignition. Most cloud images already come with one of these programs installed and configured.

Note: all customizations described in this document apply only to the final instance provisioned by Metal3 and do not apply during the inspection, preparing and provisioning phases.

Modified images

Rather than using an official cloud image, a user may build a custom image per cluster or even per host. There are numerous tools to achieve that, the one that the Metal3 community often employs is diskimage-builder.

This approach has two major downsides:

  1. Per-host images take a lot of disk space, especially since Ironic has a local image cache.
  2. diskimage-builder allows only basic customization out of box, code will need to be written for anything complex.

It is recommended to use UserData or NetworkData instead when possible.

NetworkData

Network data describes the desired networking configuration in the OpenStack network_data.json format supported by cloud-init and Glean. The format is not very well documented, but you can consult the network_data JSON schema shipped with OpenStack.

Usually, one network data secret is created per host and should be linked to it. For example, given a local file host-0-network.json, you can create a secret:

kubectl create secret generic host-0-networkdata --from-file=networkData=host-0-network.json

Then you can attach it to the host during its enrollment or when starting provisioning:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: host-0
  namespace: my-cluster
spec:
  online: true
  bootMACAddress: 80:c1:6e:7a:e8:10
  bmc:
    address: ipmi://192.168.1.13
    credentialsName: host-0-bmc
  image:
    checksum: http://192.168.0.150/SHA256SUMS
    url: http://192.168.0.150/jammy-server-cloudimg-amd64.img
  networkData:
    name: host-0-networkdata

UserData

User data describes the desired configuration of the instance in a format specific to the first-boot software:

  • cloud-init supports two formats: cloud-config YAML and a shell script (distinguished by the header).
  • Ignition uses its own format.
  • Glean does not support user data at all.

For example, you can create a cloud-config file host-0.yaml:

#cloud-config
users:
- name: metal3
  ssh_authorized_keys:
  - ssh-ed25519 ABCD... metal3@example.com
kubectl create secret generic host-0-userdata --from-file=userData=host-0.json

Then you can attach it to the host during its enrollment or when starting provisioning:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: host-0
  namespace: my-cluster
spec:
  online: true
  bootMACAddress: 80:c1:6e:7a:e8:10
  bmc:
    address: ipmi://192.168.1.13
    credentialsName: host-0-bmc
  image:
    checksum: http://192.168.0.150/SHA256SUMS
    url: http://192.168.0.150/jammy-server-cloudimg-amd64.img
  userData:
    name: host-0-userdata

Implementation notes

User and network data are passed to the instance via a so called config drive, which is a small additional disk partition created on the root device during provisioning. This partition contains user and network data, as well as meta data with a host name, as files.

Ironic is responsible for creating a partition image (usually, in the ISO 9660 format) and passing it to the IPA ramdisk together with the rest of the deployment information. Once the instance boots, the partition is mounted by the first boot software and the configuration loaded from it.

Both cloud-init and Ignition support various data sources, from which user and network data are fetched. Depending on the image type, different sources may be enabled by default:

  • In case of cloud-init, make sure that the config drive data source is enabled. This is not the same as the OpenStack data source, although both are used with OpenStack.

  • For Ignition to work, you must use an OpenStack Platform image (see supported platforms).

RAID setup

RAID is a technology that allows creating volumes with certain properties out of two or more physical disks. Depending on the RAID level, you may be able to merge several disks into a larger one or achieve redundancy.

Metal3 supports two RAID implementation:

  • Hardware RAID is implemented by hardware itself and can be configured through the machine’s BMC.
  • Software RAID is implemented by the Linux kernel and can be configured using the standard mdadm tool.

To create or delete RAID volumes, you need to edit the spec.raid field of the BareMetalHost resource, changing either the hardwareRAIDVolumes or the softwareRAIDVolumes array. If the host is in the available state, it will be moved to the preparing state and the new settings will be applied. After some time, the host will move back to available, and the resulting changes will be reflected in its status.raid field.

Note: RAID setup requires 1-2 reboots of the machine and thus may take 5-20 minutes.

Warning: never try to configure both hardware and software RAID at the same time on the same host. While theoretically possible, this mode makes little sense and is not supported well by the underlying Ironic service.

Hardware RAID

Hardware RAID is a type of RAID that is configured by a special component of the bare-metal machine - RAID controller. The resulting RAID volumes are normally presented transparently to the operating system and can be used as normal disks.

Not all hardware models and Metal3 drivers support RAID: check supported hardware for details.

Automatic allocation

One approach is to define the required level, disk count and volume size, letting Ironic to automatically select the disks to place RAID on, for example:

spec:
  raid:
    hardwareRAIDVolumes:
    - name: volume1
      level: "5"
      numberOfPhysicalDisks: 3
      sizeGibibytes: 350

The most common RAID levels are 0, 1, 5 and 1+0. Levels 2, 6, 5+0 and 6+0 are also supported by Metal3 but may not be supported by all hardware models. The level dictates the minimum number of physical disks and the maximum size of a RAID volume.

Note: because of values like 1+0, RAID level is a string, not a number.

You can use the boolean rotational field to limit the types of physical disks:

  • true to use only rotational disks (traditional spinning hard drives)
  • false to use non-rotational storage (flash-based: SSD, NVMe)
  • any types are used by default

Manual allocation

Alternatively, you can provide the controller and a list of disk identifiers. Note that these are internal disk identifiers as reported by the BMC, not standard Linux names like /dev/sda. For example, on a Dell machine:

spec:
  raid:
    hardwareRAIDVolumes:
    - name: volume2
      level: "0"
      controller: RAID.Integrated.1-1
      physicalDisks:
      - Disk.Bay.5:Enclosure.Internal.0-1:RAID.Integrated.1-1
      - Disk.Bay.6:Enclosure.Internal.0-1:RAID.Integrated.1-1
      - Disk.Bay.7:Enclosure.Internal.0-1:RAID.Integrated.1-1

If you do not specify the size of the volume, the maximum possible size will be used (depending on size of the physical disks).

Removing RAID

To remove the RAID configuration, set hardwareRAIDVolumes to an empty list:

spec:
  raid:
    hardwareRAIDVolumes: []

Warning: there is a crucial difference between setting hardwareRAIDVolumes to an empty list and removing the raid field completely: the former will remove any existing volumes, the latter will not touch any existing RAID configuration.

Software RAID

Warning: software RAID support is experimental. Please report any issues you encounter.

Software RAID is configured by the mdadm utility from within the IPA ramdisk, which will be automatically booted by Ironic when the host moves to the preparing state.

A subset of the hardware RAID API is provided for software RAID volumes with the following limitations:

  • The only supported levels are 0, 1 and 1+0.
  • Only one or two RAID volumes can be created on a host.
  • The first volume must have level 1 and should be used as the root device.
  • It is not possible to specify the number of physical disks.
  • The backing physical disks must not have any data or partitions on them.
  • Your instance image must have Linux software RAID support, including the mdadm utility. Other operating systems may not work at all.

Check the Ironic software RAID guide for more implementation details.

Software RAID: automatic allocation

You can specify the sizes and the levels of the volume(s) and let Ironic do the rest. You can also omit the size of the last volume:

spec:
  raid:
    softwareRAIDVolumes:
    - level: "1"
      sizeGibibytes: 10
    - level: "0"

Note: the same physical disks will be used for both volumes. Each physical disk will have partitions corresponding to each of the volumes.

Software RAID: manual allocation

You can specify the backing physical disks using the same format as rootDeviceHints, for example:

spec:
  raid:
    softwareRAIDVolumes:
    - level: "1"
      physicalDisks:
      - serialNumber: abcd
      - serialNumber: efgh

Removing software RAID

To remove the RAID configuration, set softwareRAIDVolumes to an empty list:

spec:
  raid:
    softwareRAIDVolumes: []

Warning: even when automated cleaning is enabled, software RAID is not automatically removed on deprovisioning.

Reboot annotation

The reboot annotation can be used for rebooting BareMetalHosts in the provisioned state. The annotation key takes either of the following forms:

  • reboot.metal3.io
  • reboot.metal3.io/{key}

Note: use the online field to power hosts on/off instead of rebooting.

Simple reboot

In its basic form (reboot.metal3.io), the annotation will trigger a reboot of the BareMetalHost. The controller will remove the annotation as soon as it has restored power to the host.

The annotation value should be a JSON map containing the key mode and a value hard or soft to indicate if a hard or soft reboot should be performed. If the value is an empty string, the default is to first try a soft reboot, and if that fails, do a hard reboot.

Phased reboot

The advanced form (reboot.metal3.io/{key}) includes a unique suffix (indicated with {key}). In this form the host will be kept in PoweredOff state until the annotation has been removed. This can be useful if some tasks needs to be performed while the host is in a known stable state. The purpose of the {key} is to allow multiple clients to use the API simultaneously in a safe way. Each client chooses a key and touches only the annotations that has this key to avoid interfering with other clients.

If there are multiple annotations, the controller will wait for all of them to be removed (by the clients) before powering on the host. Similarly, if both forms of annotations are used, the reboot.metal3.io/{key} form will take precedence. This ensures that the host stays powered off until all clients are ready (i.e. all annotations are removed).

Clients using this API must respect each other and clean up after themselves. Otherwise they will step on each others toes by for example, leaving an annotation indefinitely or removing someone else’s annotation before they were ready.

Examples

Immediate reboot via soft shutdown first, followed by a hard power-off if the soft shutdown fails:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    reboot.metal3.io: ""
spec:
  ...

Immediate reboot via hard power-off action:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    reboot.metal3.io: '{"mode": "hard"}'
spec:
  ...

Phased reboot, issued and managed by the client registered with the key cli42, via soft shutdown first, followed by a hard reboot if the soft reboot fails:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    reboot.metal3.io/cli42: ""
spec:
  ...

Phased reboot, issued and managed by the client registered with the key, via hard shutdown:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    reboot.metal3.io/cli42: '{"mode": "hard"}'
spec:
  ...

Implementation notes

The exact behavior of hard and soft reboot depends on the Ironic configuration. Please see the Ironic configuration reference for more details on this, e.g. the soft_power_off_timeout variable is relevant.

For more details please check the reboot interface proposal.

Root Device Hints

Bare-metal machines often have more than one block device, and in many cases a user will want to specify, which of them to use as the root device. Root device hints allow selecting one device or a group of devices to choose from. You can provide the hints via the spec.rootDeviceHints field on your BareMetalHost:

spec:
  # ...
  rootDeviceHints:
    wwn: "0x55cd2e415652abcd"

Hint: root device hints in Metal3 are closely modeled on the Ironic’s root device hints, but there are important differences in available hints and the comparison operators they use.

Warning: the default root device depends on the hardware profile as explained below. Currently, /dev/sda path is used when no hints are specified. This value is not going to work for NVMe storage. Furthermore, Linux does not guarantee the block device names to be consistent across reboots.

RootDeviceHints format

One or more hints can be provided, the chosen device will need to match all of them. Available hints are:

  • deviceName – A string containing a canonical Linux device path like /dev/vda or a by-path alias like /dev/disk/by-path/pci-0000:04:00.0.

    Warning: as mentioned above, block device names are not guaranteed to be consistent across reboots. If possible, choose a more reliable hint, such as wwn or serialNumber.

    Hint: only by-path aliases are supported, other aliases, such as by-id or by-uuid, cannot currently be used.

  • hctl – A string containing a SCSI bus address like 0:0:0:0.

  • model – A string containing a vendor-specific device identifier. The hint can be a substring of the actual value.

  • vendor – A string containing the name of the vendor or manufacturer of the device. The hint can be a substring of the actual value.

  • serialNumber – A string containing the device serial number.

  • minSizeGigabytes – An integer representing the minimum size of the device in Gigabytes.

  • wwn – A string containing the unique storage identifier.

  • wwnWithExtension – A string containing the unique storage identifier with the vendor extension appended.

  • wwnVendorExtension – A string containing the unique vendor storage indentifier.

  • rotational – A boolean indicating whether the device must be a rotating disk (true) or not (false). Examples of non-rotational devices include SSD and NVMe storage.

Finding the right hint value

Since the root device hints are only required for provisioning, you can use the results of inspection to get an overview of available storage devices:

kubectl get hardwaredata/<BMHNAME> -n <NAMESPACE> -o jsonpath='{.spec.hardware.storage}' | jq .

This commands produces a JSON output, where you can find all necessary fields to populate the root device hints before provisioning. For example, on a virtual testing environment:

[
  {
    "alternateNames": [
      "/dev/sda",
      "/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0"
    ],
    "hctl": "0:0:0:0",
    "model": "QEMU HARDDISK",
    "name": "/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0",
    "rotational": true,
    "serialNumber": "drive-scsi0-0-0-0",
    "sizeBytes": 32212254720,
    "type": "HDD",
    "vendor": "QEMU"
  }
]

Interaction with hardware profiles

Hardware profiles are a deprecated concept that was introduced to describe homogenous types of hardware. The default hardware profile is unknown, which implies using /dev/sda as the root device.

In a future version of BareMetalHost API, the hardware profile concept will be disabled, and Metal3 will default to having no root device hints by default. In this case, the default logic in Ironic will apply: the smaller block device that is at least 4 GiB. If you want this logic to apply in the current verson of the API, use the empty profile:

spec:
  # ...
  hardwareProfile: empty

In all other cases, use explicit root device hints.

Baremetal Operator features

Basic features

Advanced features

Detached annotation

The detached annotation provides a way to prevent management of a BareMetalHost. It works by deleting the host information from Ironic without triggering deprovisioning. The BareMetal Operator will recreate the host in Ironic again once the annotation is removed. This annotation can be used with BareMetalHosts in Provisioned, ExternallyProvisioned or Available states.

Normally, deleting a BareMetalHost will always trigger deprovisioning. This can be problematic and unnecessary if we just want to, for example, move the BareMetalHost from one cluster to another. By applying the annotation before removing the BareMetalHost from the old cluster, we can ensure that the host is not disrupted by this (normally it would be deprovisioned). The next step is then to recreate it in the new cluster without triggering a new inspection. See the status annotation page for how to do this.

The annotation key is baremetalhost.metal3.io/detached and the value can be anything (it is ignored). Here is an example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: example
  annotations:
    baremetalhost.metal3.io/detached: ""
spec:
  online: true
  bootMACAddress: 00:8a:b6:8e:ac:b8
  bootMode: legacy
  bmc:
    address: ipmi://192.168.111.1:6230
    credentialsName: example-bmc-secret
...

Why is this annotation needed?

  • It provides a way to move BareMetalHosts between clusters (essentially deleting them in the old cluster and recreating them in the new) without going through deprovisioning, inspection and provisioning.
  • It allows deleting the BareMetalHost object without triggering deprovisioning. This can be used to hand over management of the host to a different system without disruption.

For more details, please see the design proposal.

External inspection

Similar to the status annotation, external inspection makes it possible to skip the inspection step. The difference is that the status annotation can only be used on the very first reconcile and allows setting all the fields under status. In contrast, external inspection limits the changes so that only HardwareDetails can be modified, and it can be used at any time when inspection is disabled (with the inspect.metal3.io: disabled annotation) or when there is no existing HardwareDetails data.

External inspection is controlled through an annotation on the BareMetalHost. The annotation key is inspect.metal3.io/hardwaredetails and the value is a JSON representation of the BareMetalHosts status.hardware field.

Here is an example with a BMH that has inspection disabled and is using the external inspection feature to add the HardwareDetails.

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-0
  namespace: metal3
  annotations:
    inspect.metal3.io: disabled
    inspect.metal3.io/hardwaredetails: |
      {"systemVendor":{"manufacturer":"QEMU", "productName":"Standard PC (Q35 + ICH9, 2009)","serialNumber":""}, "firmware":{"bios":{"date":"","vendor":"","version":""}},"ramMebibytes":4096, "nics":[{"name":"eth0","model":"0x1af4 0x0001","mac":"00:b7:8b:bb:3d:f6", "ip":"172.22.0.64","speedGbps":0,"vlanId":0,"pxe":true}], "storage":[{"name":"/dev/sda","rotational":true,"sizeBytes":53687091200, "vendor":"QEMU", "model":"QEMU HARDDISK","serialNumber":"drive-scsi0-0-0-0", "hctl":"6:0:0:0"}],"cpu":{"arch":"x86_64", "model":"Intel Xeon E3-12xx v2 (IvyBridge)","clockMegahertz":2494.224, "flags":["foo"],"count":4},"hostname":"hwdAnnotation-0"}
spec:
  ...

Why is this needed?

  • It allows avoiding an extra reboot for live-images that include their own inspection tooling.
  • It provides an arguably safer alternative to the status annotation in some cases.

Caveats:

  • If both baremetalhost.metal3.io/status and inspect.metal3.io/hardwaredetails are specified on BareMetalHost creation, inspect.metal3.io/hardwaredetails will take precedence and overwrite any hardware data specified via baremetalhost.metal3.io/status.
  • If the BareMetalHost is in the Available state the controller will not attempt to match profiles based on the annotation.

Live ISO

The live-iso API in Metal3 allows booting a BareMetalHost with an ISO image instead of writing an image to the local disk using the IPA deploy ramdisk.

This feature has two primary use cases:

  • Running ephemeral load on hosts (e.g. calculations or simulations that do not store local data).
  • Integrating a 3rd party installer (e.g. coreos installer).

Warning: this feature is designed to work with virtual media (see supported hardware. While it’s possible to boot an ISO over iPXE, the booted OS will not be able to access any data on the ISO except for the kernel and initramfs it booted from.

To boot a live ISO, you need to set the image URL to the location of the ISO and set the format field to live-iso, for example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: live-iso-booted-node
spec:
  image:
    url: http://1.2.3.4/image.iso
    format: live-iso
  online: true

Note: image.checksum, rootDeviceHints, networkData and userData will not be used since the image is not written to disk.

For more details, please see the design proposal.

Status annotation

The status annotation is useful when you need to avoid inspection of a BareMetalHost. This can happen if the status is already known, for example, when moving the BareMetalHost from one cluster to another. By setting this annotation, the BareMetal Operator will take the status of the BareMetalHost directly from the annotation.

The annotation key is baremetalhost.metal3.io/status and the value is a JSON representation of the BareMetalHosts status field. One simple way of extracting the status and turning it into an annotation is using kubectl like this:

# Save the status in json format to a file
kubectl get bmh <name-of-bmh> -o jsonpath="{.status}" > status.json
# Save the BMH and apply the status annotation to the saved BMH.
kubectl -n metal3 annotate bmh <name-of-bmh> \
  baremetalhost.metal3.io/status="$(cat status.json)" \
  --dry-run=client -o yaml > bmh.yaml

Note that the above example does not apply the annotation to the BareMetalHost directly since this is most likely not useful to apply it on one that already has a status. Instead it saves the BareMetalHost with the annotation applied to a file bmh.yaml. This file can then be applied in another cluster. The status would be discarded at this point since the user is usually not allowed to set it, but the annotation is still there and would be used by the BareMetal Operator to set status again. Once this is done, the operator will remove the status annotation. In this situation you may also want to check the detached annotation for how to remove the BareMetalHost from the old cluster without going through deprovisioning.

Here is an example of a BareMetalHost, first without the annotation, but with status and spec, and then the other way around. This shows how the status field is turned into the annotation value.

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-0
  namespace: metal3
spec:
  automatedCleaningMode: metadata
  bmc:
    address: redfish+http://192.168.111.1:8000/redfish/v1/Systems/febc9f61-4b7e-411a-ada9-8c722edcee3e
    credentialsName: node-0-bmc-secret
  bootMACAddress: 00:80:1f:e6:f1:8f
  bootMode: legacy
  online: true
status:
  errorCount: 0
  errorMessage: ""
  goodCredentials:
    credentials:
      name: node-0-bmc-secret
      namespace: metal3
    credentialsVersion: "1775"
  hardwareProfile: ""
  lastUpdated: "2022-05-31T06:33:05Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: null
      start: "2022-05-31T06:33:05Z"
    provision:
      end: null
      start: null
    register:
      end: "2022-05-31T06:33:05Z"
      start: "2022-05-31T06:32:54Z"
  operationalStatus: OK
  poweredOn: false
  provisioning:
    ID: 8d566f5b-a28f-451b-a70f-419507c480cd
    bootMode: legacy
    image:
      url: ""
    state: inspecting
  triedCredentials:
    credentials:
      name: node-0-bmc-secret
      namespace: metal3
    credentialsVersion: "1775"
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-0
  namespace: metal3
  annotations:
    baremetalhost.metal3.io/status: |
      {"errorCount":0,"errorMessage":"","goodCredentials":{"credentials":{"name":"node-0-bmc-secret","namespace":"metal3"},"credentialsVersion":"1775"},"hardwareProfile":"","lastUpdated":"2022-05-31T06:33:05Z","operationHistory":{"deprovision":{"end":null,"start":null},"inspect":{"end":null,"start":"2022-05-31T06:33:05Z"},"provision":{"end":null,"start":null},"register":{"end":"2022-05-31T06:33:05Z","start":"2022-05-31T06:32:54Z"}},"operationalStatus":"OK","poweredOn":false,"provisioning":{"ID":"8d566f5b-a28f-451b-a70f-419507c480cd","bootMode":"legacy","image":{"url":""},"state":"inspecting"},"triedCredentials":{"credentials":{"name":"node-0-bmc-secret","namespace":"metal3"},"credentialsVersion":"1775"}}
spec:
  ...

Ironic

Ironic is an open-source service for automating provisioning and lifecycle management of bare metal machines. Born as the Bare Metal service of the OpenStack cloud software suite, it has evolved to become a semi-autonomous project, adding ways to be deployed independently as a standalone service, for example using Bifrost, and integrates in other tools and projects, as in the case of Metal3.

Ironic nowadays supports the two main standard hardware management interfaces, Redfish and IPMI, and thanks to its large community of contributors, it can provide native support for many different bare-metal hardware vendors, such as Dell, Fujitsu, HPE, and Supermicro.

The Metal3 project adopted Ironic as the back-end that manages bare-metal hosts behind native Kubernetes API.

Why Ironic in Metal3

  • Ironic is open source! This aligns perfectly with the philosophy behind Metal3.
  • Ironic has a vendor agnostic interface provided by a robust set of RESTful APIs.
  • Ironic has a vibrant and diverse community, including small and large operators, hardware and software vendors.
  • Ironic provides features covering the whole hardware life-cycle: from bare metal machine registration and hardware specifications retrieval of newly discovered bare metal machines, configuration and provisioning with custom operating system images, up to machines reset, cleaning for re-provisionionig or end-of-life retirement.

How Metal3 uses Ironic

Bare Metal Operator is the main component that interfaces with the Ironic API for all operations needed to provision bare-metal hosts, such as hardware capabilites inspection, operating system installation, and re-initialization when restoring a bare-metal machine to its original status.

Metal3 provides a way to install Ironic with a suitable configuration. Alternatively, Bare Metal Operator can be set up to use an externally managed Ironic instance.

Requirements for external Ironic

  • HTTP basic authentication (OpenStack Identity is not supported - see issue 1218).
  • Enabled hardware types and interfaces that match the supported Metal3 drivers (at least the ones you intend to use).
  • API version 1.81 (2023.1 “Antelope” release cycle) or newer must be available.
  • Built-in in-band inspection (ironic-inspector is no longer supported).
  • Deploy interface direct enabled and used by default.
  • No-op network interface (OpenStack Networking is not supported).

Optionally:

  • Automated cleaning set to metadata only.
  • Deploy interfaces ramdisk and custom-deploy enabled.
  • Fast track mode enabled.

Ironic database

Ironic keeps information in its own database, completely independent from the Kubernetes data storage. Metal3 treats the Kubernetes database (e.g. BareMetalHost resources) as the authoritative source of information about the desired state of the machines. On any discrepancies, Bare Metal Operator will use the Ironic API to enforce the desired state.

In case of Ironic deployed by the Metal3 deployment scripts, its database is ephemeral by default. SQLite is used as a backend, and the data is removed when the Metal3 pod is restarted. When this happens, Bare Metal Operator will re-create hosts in Ironic and drive them through various actions to enforce the expected state:

  • Hosts in the provisioned state will go through adoption without provisioning them again.

  • For hosts in the available state, only the BMC credentials will be verified.

  • For hosts in various transient states, Bare Metal Operator will restart the action that lead to this state. For instance, a host in the provisioning state will undergo cleaning, then a new provisioning will be started.

Host enrollment and hardware inventory

When a BareMetalHost is created, Bare Metal Operator tries to find an existing record in Ironic by its name or MAC address. The name in Ironic is generated by joining the namespace and the host name with a tilde. For example, host compute-0 in the metal3 namespace will receive the Ironic name metal3~compute-0. If no record is found:

  1. A new record is created in Ironic.
  2. BMC credentials are verified by Ironic by reading the current power state of the machine.
  3. The inspection process is started.

Once inspection finishes successfully, the hardware inventory is fetched from Ironic and stored in a corresponding HardwareData resource. Note that this information is never updated unless a new inspection happens (see inspect annotation).

Host provisioning

Provisioning is triggered by populating either the image or the customDeploy field of the host. Under the hood, three modes of provisioning are supported:

  • When customDeploy is provided, Bare Metal Operator will configure the host to use the custom-agent deploy interface. The method field will be treated as the name of a custom deploy step to execute instead of the regular provisioning process. Your Ironic installation or IPA image must contain the implementation of this step. By default, Metal3 does not ship any such steps.

  • When customDeploy is not provided and the image.diskFormat field is set to live-iso, the host will be configured to use the ramdisk deploy interface, while image.url will be treated as a URL of an ISO 9660 image to boot. This mode is designed to integrate Metal3 with site-specific installers.

  • When customDeploy is not provided and the image.diskFormat field is not set to live-iso, the regular provisioning process is followed. The IPA-based service ramdisk (normally already booted on the host during inspection) will write the downloaded image to the root disk specified by the rootDeviceHints field.

Host decommissioning

Each BareMetalHost will receive a finalizer that prevents this host from being immediately removed on deletion. Before the finalizer is removed, the host is:

  1. cleaned to remove the partitioning tables from all its disks,
  2. powered off to stop it from running the service ramdisk.

The cleaning process is retried several times. If due to a problem with the host cleaning is no longer possible, disable cleaning first by setting the automatedCleanMode field to disabled.

WARNING: it is not recommended to manually remove the finalizer when the cleaning process is taking longer than desired or is failing. Doing so, will remove the host record from Kubernetes but leave it in Ironic. The currently running action will continue in the background, and an attempt to add the host again may fail because of the conflict.

References

Install Ironic

Metal3 runs Ironic as a set of containers. Those containers can be deployed either in-cluster and out-of-cluster. In both scenarios, there are a couple of containers that must run in order to provision baremetal nodes:

  • ironic (the main provisioning service)
  • ipa-downloader (init container to download and cache the deployment ramdisk image)
  • httpd (HTTP server that serves cached images and iPXE configuration)

A few other containers are optional:

  • ironic-endpoint-keepalived (to maintain a persistent IP address on the provisioning network)
  • dnsmasq (to support DHCP on the provisioning network and to implement network boot via iPXE)
  • ironic-log-watch (to provide access to the deployment ramdisk logs)
  • mariadb (the provisioning service database; SQLite can be used as a lightweight alternative)
  • ironic-inspector (the auxiliary inspection service - only used in older versions of Metal3)

Prerequisites

Networking

A separate provisioning network is required when network boot is used.

The following ports must be accessible by the hosts being provisioned:

  • TCP 6385 (Ironic API)
  • TCP 5050 (Inspector API; when used)
  • TCP 80 (HTTP server; can be changed via the HTTP_PORT environment variable)
  • UDP 67/68/546/547 (DHCP and DHCPv6; when network boot is used)
  • UDP 69 (TFTP; when network boot is used)

The main Ironic service must be able to access the hosts’ BMC addresses.

When virtual media is used, the hosts’ BMCs must be able to access HTTP_PORT.

Environmental variables

The following environmental variables can be passed to configure the Ironic services:

  • HTTP_PORT - port used by httpd server (default 6180)
  • PROVISIONING_IP - provisioning interface IP address to use for ironic, dnsmasq(dhcpd) and httpd (default 172.22.0.1)
  • CLUSTER_PROVISIONING_IP - cluster provisioning interface IP address (default 172.22.0.2)
  • PROVISIONING_INTERFACE - interface to use for ironic, dnsmasq(dhcpd) and httpd (default ironicendpoint)
  • CLUSTER_DHCP_RANGE - dhcp range to use for provisioning (default 172.22.0.10-172.22.0.100)
  • DEPLOY_KERNEL_URL - the URL of the kernel to deploy ironic-python-agent
  • DEPLOY_RAMDISK_URL - the URL of the ramdisk to deploy ironic-python-agent
  • IRONIC_ENDPOINT - the endpoint of the ironic
  • CACHEURL - the URL of the cached images
  • IRONIC_FAST_TRACK - whether to enable fast_track provisioning or not (default true)
  • IRONIC_KERNEL_PARAMS - kernel parameters to pass to IPA (default console=ttyS0)
  • IRONIC_INSPECTOR_VLAN_INTERFACES - VLAN interfaces included in introspection, all - all VLANs on all interfaces, using LLDP information (default), interface all VLANs on an interface, using LLDP information, interface.vlan - a particular VLAN interface, not using LLDP
  • IRONIC_BOOT_ISO_SOURCE - where the boot iso image will be served from, possible values are: local (default), to download the image, prepare it and serve it from the conductor; http, to serve it directly from its HTTP URL
  • IPA_DOWNLOAD_ENABLED - enables the use of the Ironic Python Agent Downloader container to download IPA archive (default true)
  • USE_LOCAL_IPA - enables the use of locally supplied IPA archive. This condition is handled by BMO and this has effect only when IPA_DOWNLOAD_ENABLED is “false”, otherwise IPA_DOWNLOAD_ENABLED takes precedence. (default false)
  • LOCAL_IPA_PATH - this has effect only when USE_LOCAL_IPA is set to “true”, points to the directory where the IPA archive is located. This variable is handled by BMO. The variable should contain an arbitrary path pointing to the directory that contains the ironic-python-agent.tar
  • GATEWAY_IP - gateway IP address to use for ironic dnsmasq (dhcpd)
  • DNS_IP - DNS IP address to use for ironic dnsmasq (dhcpd)

To know how to pass these variables, please see the sections below.

Ironic in-cluster installation

For in-cluster Ironic installation, we will run a set of containers within a single pod in a Kubernetes cluster. You can enable TLS or basic auth or even disable both for Ironic and Inspector communication. Below we will see kustomize folders that will help us to install Ironic for each mentioned case. In each of these deployments, a ConfigMap will be created and mounted to the Ironic pod. The ConfigMap will be populated based on environment variables from ironic-deployment/default/ironic_bmo_configmap.env. As such, update ironic_bmo_configmap.env with your custom values before deploying the Ironic.

WARNING: Ironic normally listens on the host network of the control plane nodes. If you do not enable authentication, anyone with access to this network can use it to manipulate your nodes. It’s also highly advised to use TLS to prevent eavesdropping.

Installing with Kustomize

In the quickstart guide, we have demonstrated how to install ironic with kustomize, by creating an ironic kustomization overlay. While that is still what you should follow if you have specific requirements for your ironic deployment, we do provide an already-made overlay for the most-common usecase, ironic with basic authentication and TLS.

We assume you are inside the local baremetal-operator path, if not you need to clone it first and cd to the root path.

 git clone https://github.com/metal3-io/baremetal-operator.git
 cd baremetal-operator

The overlay in interest is located at ironic-deployment/overlay/basic-auth_tls. To make this overlay work, we still need to set up Authentication and Ironic Environment Variables, as instructed in the quickstart guide.

Next, check the Ironic kustomization section in the quickstart guide to see how to generate the necessary configMap and Secrets for the deployment.

Also, cert-manager should have been installed in the cluster before deploying Ironic. If you haven’t installed cert-manager yet:

 kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.3/cert-manager.yaml

Wait a few minutes for all cert-manager deployments to achieve Ready state.

We can then deploy Ironic with basic authentication and TLS enabled:

 kustomize build ironic-deployment/overlays/basic-auth_tls | kubectl apply -f -

Alternatively, you can use the deploy.sh script to deploy Ironic with custom elements. Checkout detailed instruction, and the script itself, for more information.

Ironic out-of-cluster installation

For out-of-cluster Ironic installation, we will run a set of docker containers outside of a Kubernetes cluster. To pass Ironic settings, you can export corresponding environmental variables on the current shell before calling run_local_ironic.sh installation script. This will start below containers:

  • ironic
  • ironic-endpoint-keepalived
  • ironic-log-watch
  • ipa-downloader
  • dnsmasq
  • httpd
  • mariadb; if IRONIC_USE_MARIADB = “true”

If in-cluster ironic installation, we used different manifests for TLS and basic auth, here we are exporting environment variables for enabling/disabling TLS & basic auth but use the same script.

TLS and Basic authentication disabled (not recommended)

 export IRONIC_FAST_TRACK="false"  # Example of manipulating Ironic settings
 export IRONIC_TLS_SETUP="false"   # Disable TLS
 export IRONIC_BASIC_AUTH="false"  # Disable basic auth
 ./tools/run_local_ironic.sh

Basic authentication enabled

 export IRONIC_TLS_SETUP="false"
 export IRONIC_BASIC_AUTH="true"
 ./tools/run_local_ironic.sh

TLS enabled

 export IRONIC_TLS_SETUP="true"
 export IRONIC_BASIC_AUTH="false"
 ./tools/run_local_ironic.sh

Ironic Python Agent (IPA)

IPA is a service written in python that runs within a ramdisk. It provides remote access for Ironic to perform various operations on the managed server. It also sends information about the server to Ironic.

By default, we pull IPA images from Ironic upstream archive where an image is built on every commit to the master git branch.

However, another remote registry or a local IPA archive can be specified. ipa-downloader is responsible for downloading the IPA ramdisk image to a shared volume from where the nodes are able to retrieve it.

Data flow

IPA interacts with other components. The information exchanged and the component to which it is sent to or received from are described below. The communication between IPA and these components can be encrypted in-transit with SSL/TLS.

  • Inspection: data about hardware details, such as CPU, disk, RAM and network interfaces.
  • Heartbeat: periodic message informing Ironic that the node is still running.
  • Lookup: data sent to Ironic that helps it determine Ironic’s node UUID for the node.

The above data is sent/received as follows.

  • Inspection result is sent to Ironic
  • Lookup/heartbeats data is sent to Ironic.
  • User supplied boot image that will be written to the node’s disk is retrieved from HTTPD server

References

Ironic Container Images

The currently available ironic container images are:

Name and link to repositoryPublished imageContent/Purpose
ironic-imagequay.io/metal3-io/ironicIronic services / BMC emulators
ironic-ipa-downloaderquay.io/metal3-io/ironic-ipa-downloaderDownload and cache the ironic python agent ramdisk
ironic-clientquay.io/metal3-io/ironic-clientIronic command-line interface (for debugging)

The main ironic-image currently contains entry points to run both Ironic itself and its auxiliary services: dnsmasq and httpd.

How to build a container image

Each repository mentioned in the list contains a Dockerfile that can be used to build the corresponding container, for example:

git clone https://github.com/metal3-io/ironic-image.git
cd ironic-image
docker build . -f Dockerfile

In some cases a make sub-command is provided to build the image using docker, usually make docker.

Customizing source builds

When building the ironic image, it is also possible to specify a different source for ironic, ironic-lib or the sushy library using the build arguments IRONIC_SOURCE, IRONIC_LIB_SOURCE and SUSHY_SOURCE. It is also possible to apply local patches to the source. See ironic-image README for details.

Special resources: sushy-tools and virtualbmc

The Dockerfiles needed to build sushy-tools (Redfish emulator) and VirtualBMC (IPMI emulator) containers can be found in the ironic-image container repository, under the resources directory.

Kubernetes Cluster API Provider Metal3

Kubernetes-native declarative infrastructure for Metal3.

What is the Cluster API Provider Metal3

The Cluster API brings declarative, Kubernetes-style APIs to cluster creation, configuration and management. The API itself is shared across multiple cloud providers. Cluster API Provider Metal3 is one of the providers for Cluster API and enables users to deploy a Cluster API based cluster on top of bare metal infrastructure using Metal3.

Compatibility with Cluster API

CAPM3 versionCluster API versionCAPM3 Release
v1alpha4v1alpha3v0.4.X
v1alpha5v1alpha4v0.5.X
v1beta1v1beta1v1.1.X
v1beta1v1beta1v1.2.X

Development Environment

There are multiple ways to setup a development environment:

Getting involved and contributing

Are you interested in contributing to Cluster API Provider Metal3? We, the maintainers and community, would love your suggestions, contributions, and help! Also, the maintainers can be contacted at any time to learn more about how to get involved.

To set up your environment checkout the development environment.

In the interest of getting more new people involved, we tag issues with good first issue. These are typically issues that have smaller scope but are good ways to start to get acquainted with the codebase.

We also encourage ALL active community participants to act as if they are maintainers, even if you don’t have “official” write permissions. This is a community effort, we are here to serve the Kubernetes community. If you have an active interest and you want to get involved, you have real power! Don’t assume that the only people who can get things done around here are the “maintainers”.

We also would love to add more “official” maintainers, so show us what you can do!

All the repositories in the Metal3 project, including the Cluster API Provider Metal3 GitHub repository, use the Kubernetes bot commands. The full list of the commands can be found here. Note that some of them might not be implemented in metal3 CI.

Community

Community resources and contact details can be found here.

Github issues

We use Github issues to keep track of bugs and feature requests. There are two different templates to help ensuring that relevant information is included.

Bugs

If you think you have found a bug please follow the instructions below.

  • Please spend a small amount of time giving due diligence to the issue tracker. Your issue might be a duplicate.
  • Collect logs from relevant components and make sure to include them in the bug report you are going to open.
  • Remember users might be searching for your issue in the future, so please give it a meaningful title to help others.
  • Feel free to reach out to the metal3 community.

Tracking new features

We also use the issue tracker to track features. If you have an idea for a feature, or think you can help Cluster API Provider Metal3 become even more awesome, then follow the steps below.

  • Open a feature request.
  • Remember users might be searching for your feature request in the future, so please give it a meaningful title to help others.
  • Clearly define the use case, using concrete examples. e.g.: I type this and cluster-api-provider-metal3 does that.
  • Some of our larger features will require proposals. If you would like to include a technical design for your feature please open a feature proposal in metal3-docs using this template.

After the new feature is well understood, and the design agreed upon we can start coding the feature. We would love for you to code it. So please open up a WIP (work in progress) pull request, and happy coding.

Install Cluster-api-provider-metal3

You can either use clusterctl (recommended) to install Metal³ infrastructure provider or kustomize for manual installation. Both methods install provider CRDs, its controllers and Ip-address-manager. Please keep in mind that Baremetal Operator and Ironic are decoupled from CAPM3 and will not be installed when the provider is initialized. As such, you need to install them yourself.

Prerequisites

  1. Install clusterctl, refer to Cluster API book for installation instructions.

  2. Install kustomize, refer to official instructions here.

  3. Install Ironic, refer to this page.

  4. Install Baremetal Operator, refer to this page.

  5. Install Cluster API core compoenents i.e., core, bootstrap and control-plane providers. This will also install cert-manager, if it is not already installed.

     clusterctl init --core cluster-api:v1.7.4 --bootstrap kubeadm:v1.7.4 \
     --control-plane kubeadm:v1.7.4 -v5
    

With clusterctl

This method is recommended. You can specify the CAPM3 version you want to install by appending a version tag, e.g. :v1.7.1. If the version is not specified, the latest version available will be installed.

clusterctl init --infrastructure metal3:v1.7.1

With kustomize

To install a specific version, checkout the github.com/metal3-io/cluster-api-provider-metal3.git to the tag with the desired version

git clone https://github.com/metal3-io/cluster-api-provider-metal3.git
cd cluster-api-provider-metal3
git checkout v1.1.2 -b v1.1.2

Then, edit the controller-manager image version in config/default/capm3/manager_image_patch.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
spec:
  template:
    spec:
      containers:
      # Change the value of image/tag to your desired image URL or version tag
      - image: quay.io/metal3-io/cluster-api-provider-metal3:v1.1.2
        name: manager

Apply the manifests

cd cluster-api-provider-metal3
kustomize build config/default | kubectl apply -f -

Cluster-api-provider-metal3 features

Remediation Controller and MachineHealthCheck

The Cluster API includes the remediation feature that implements an automated health checking of k8s nodes. It deletes unhealthy Machine and replaces with a healthy one. This approach can be challenging with cloud providers that are using hardware based clusters because of slower (re)provisioning of unhealthy Machines. To overcome this situation, CAPI remediation feature was extended to plug-in provider specific external remediation. It is also possible to plug-in Metal3 specific remediation strategies to remediate unhealthy nodes. In this case, the Cluster API MHC finds unhealthy nodes while the CAPM3 Remediation Controller remediates those unhealthy nodes.

CAPI Remediation

A MachineHealthCheck is a Cluster API resource, which allows users to define conditions under which Machines within a Cluster should be considered unhealthy. Users can also specify a timeout for each of the conditions that they define to check on the Machine’s Node. If any of these conditions are met for the duration of the timeout, the Machine will be remediated. CAPM3 will use the MachineHealthCheck to create remediation requests based on Metal3RemediationTemplate and Metal3Remediation CRDs to plug-in remediation solution. For more info, please read the CAPI MHClink.

External Remediation

External remediation provides remediation solutions other than deleting unhealthy Machine and creating healthy one. Environments consisting of hardware based clusters are slower to (re)provision unhealthy Machines. So there is a growing need for a remediation flow that includes external remediation which can significantly reduce the remediation process time. Normally the conditions based remediation doesn’t offer any other remediation than deleting an unhealthy Machine and replacing it with a new one. Other environments and vendors can also have specific remediation requirements, so there is a need to provide a generic mechanism for implementing custom remediation logic. External remediation integrates with CAPI MHC and support remediation based on power cycling the underlying hardware. It supports the use of BMO reboot API and CAPM3 unhealthy annotation as part of the automated remediation cycle. It is a generic mechanism for supporting externally provided custom remediation strategies. If no value for externalRemediationTemplate is defined for the MachineHealthCheck CR, the condition-based flow is continued. For more info: External Remediation proposal

Metal3 Remediation

The CAPM3 remediation controller reconciles Metal3Remediation objects created by CAPI MachineHealthCheck. It locates a Machine with the same name as the Metal3Remediation object and uses BMO and CAPM3 APIs to remediate associated unhealthy node. The remediation controller supports a reboot strategy specified in the Metal3Remediation CRD and uses the same object to store states of the current remediation cycle. The reboot strategy consists of three steps: power off the Machine, delete the related Node, and power the Machine on again. Deleting the Node indicates that the workloads on the Node are not running anymore, which results in quicker rescheduling and lower downtime of the affected workloads.

Enable remediation for worker nodes

Machines managed by a MachineSet (as identified by the nodepool label) can be remediated. Here is an example MachineHealthCheck and Metal3Remediation for worker nodes:


apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-healthcheck
  namespace: metal3
spec:
  # clusterName is required to associate this MachineHealthCheck with a particular cluster
  clusterName: test1
  # (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy
  maxUnhealthy: 100%
  # (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for
  # a Node to join the cluster, before considering a Machine unhealthy.
  # Defaults to 10 minutes if not specified.
  # Set to 0 to disable the node startup timeout.
  # Disabling this timeout will prevent a Machine from being considered unhealthy when
  # the Node it created has not yet registered with the cluster. This can be useful when
  # Nodes take a long time to start up or when you only want condition based checks for
  # Machine health.
  nodeStartupTimeout: 0m
  # selector is used to determine which Machines should be health checked
  selector:
    matchLabels:
      nodepool: nodepool-0
  # Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its timeout, the Machine is considered unhealthy
  unhealthyConditions:
  - type: Ready
    status: Unknown
    timeout: 300s
  - type: Ready
    status: "False"
    timeout: 300s
  remediationTemplate: # added infrastructure reference
    kind: Metal3RemediationTemplate
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    name: worker-remediation-request

Metal3RemediationTemplate for worker nodes:


apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
    name: worker-remediation-request
    namespace: metal3
spec:
  template:
    spec:
      strategy:
        type: "Reboot"
        retryLimit: 2
        timeout: 300s

Enable remediation for control plane nodes

Machines managed by a KubeadmControlPlane are remediated according to the KubeadmControlPlane proposal. It is necessary to have at least 2 control plane machines in order to use remediation feature. Control plane nodes are identified by the cluster.x-k8s.io/control-plane label. Here is an example MachineHealthCheck and Metal3Remediation for control plane nodes:


apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: controlplane-healthcheck
  namespace: metal3
spec:
  clusterName: test1
  maxUnhealthy: 100%
  nodeStartupTimeout: 0m
  selector:
    matchLabels:
      cluster.x-k8s.io/control-plane: ""
  unhealthyConditions:
    - type: Ready
      status: Unknown
      timeout: 300s
    - type: Ready
      status: "False"
      timeout: 300s
  remediationTemplate: # added infrastructure reference
    kind: Metal3RemediationTemplate
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    name: controlplane-remediation-request

Metal3RemediationTemplate for control plane nodes:


apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
    name: controlplane-remediation-request
    namespace: metal3
spec:
  template:
    spec:
      strategy:
        type: "Reboot"
        retryLimit: 1
        timeout: 300s

Limitations and caveats of Metal3 remediation

  • Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck

  • If the Node for a Machine is removed from the cluster, CAPI MachineHealthCheck will consider this Machine unhealthy and remediates it immediately

  • If there is no Node joins the cluster for a Machine after the NodeStartupTimeout, the Machine will be remediated

  • If a Machine fails for any reason and the FailureReason is set, the Machine will be remediated immediately

Node Reuse

This feature brings a possibility of re-using the same BaremetalHosts (referred to as a host later) during deprovisioning and provisioning mainly as a part of the rolling upgrade process in the cluster.

Importance of scale-in strategy

The logic behind the reusing of the hosts, solely relies on the scale-in upgrade strategy utilized by Cluster API objects, namely KubeadmControlPlane and MachineDeployment. During the upgrade process of above resources, the machines owned by KubeadmControlPlane or MachineDeployment are removed one-by-one before creating new ones (delete-create method). That way, we can fully ensure that, the intended host is reused when the upgrade is kicked in (picked up on the following provisioning for the new machine being created).

Note: To achieve the desired delete first and create after behavior in above-mentioned Cluster API objects, user has to modify:

  • MaxSurge field in KubeadmControlPlane and set it to 0 with minimum number of 3 control plane machines replicas
  • MaxSurge and MaxUnavailable fields in MachineDeployment set them to 0 & 1 accordingly

On the contrary, if the scale-out strategy is utilized by CAPI objects during the upgrade, usually create-swap-delete method is followed by CAPI objects, where new machine is created first and new host is picked up for that machine, breaking the node reuse logic right at the beginning of the upgrade process.

Workflow

Metal3MachineTemplate (M3MT) Custom Resource is the object responsible for enabling of the node reuse feature.

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3MachineTemplate
metadata:
  name: test1-controlplane
  namespace: metal3
spec:
  nodeReuse: True
  template:
    spec:
      image:
      ...

There could be two Metal3MachineTemplate objects, one referenced by KubeadmControlPlane for control plane nodes, and the other by MachineDeployment for worker node. Before performing an upgrade, user must set nodeReuse field to true in the desired Metal3MachineTemplate object where hosts targeted to be reused. If left unchanged, by default, nodeReuse field is set to false resulting in no host reusing being performed in the workflow. If you would like to know more about the internals of controller logic, please check the original proposal for the feature here

Once nodeReuse field is set to true, user has to make sure that scale-in feature is enabled as suggested above, and proceed with updating the desired fields in KubeadmControlPlane or MachineDeployment to start a rolling upgrade.

Note: If you are creating a new Metal3MachineTemplate object (for control-plane or worker), rather than using the existing one created while provisioning, please make sure to reference it from the corresponding Cluster API object (KubeadmControlPlane or MachineDeployment). Also keep in mind that, already provisioned Metal3Machines were created from the old Metal3MachineTemplate and they consume existing hosts, meaning even though nodeReuse field is set to true in the new Metal3MachineTemplate, it would have no effect. To use newly Metal3MachineTemplate in the workflow, user has to reprovision the nodes, which should result in using new Metal3MachineTemplate referenced in Cluster API object and Metal3Machine created out of it.

CAPM3 Pivoting

What is pivoting

Cluster API Provider Metal3 (CAPM3) implements support for CAPI’s ‘move/pivoting’ feature.

CAPI Pivoting feature is a process of moving the provider components and declared Cluster API resources from a source management cluster to a target management cluster by using the clusterctl functionality called “move”. More information about the general CAPI “move” functionality can be found here.

In Metal3, pivoting is performed by using the CAPI clusterctl tool provided by Cluster-API project. clusterctl recognizes pivoting as move. During the pivot process clusterctl pauses any reconciliation of CAPI objects and this gets propagated to CAPM3 objects as well. Once all the objects are paused, the objects are created on the other side on the target cluster and deleted from the bootstrap cluster.

Prerequisite

  1. It is mandatory to use clusterctl for both the bootstrap and target cluster.

    If the provider components are not installed using clusterctl, it will not be able to identify the objects to move. Initializing the cluster using clusterctl essentially adds the following labels in the CRDs of each related object.

    labels:
    - clusterctl.cluster.x-k8s.io: ""
    - cluster.x-k8s.io/provider: "<provider-name>"
    

    So if the clusters are not initialized using clusterctl, all the CRDS of the objects to be moved to target cluster needs to have these labels both in bootstrap cluster and target cluster before performing the move.

    Note: This is not recommended, since the way clusterctl identifies objects to manage might change in the future, so it’s always safe to install CRDs and controllers through the clusterctl init sub-command.

  2. BareMetalHost objects have correct status annotation.

    Since BareMetalHost (BMH) status holds important information regarding the BMH itself, BMH with status has to be moved and it has to be reconstructed with correct status in target cluster before it is being reconciled. This is now done through BMH status annotation in BMO.

  3. Maintain connectivity towards provisioning network.

    Baremetal machines boot over a network with a DHCP server. This requires maintaining a fixed IP end points towards the provisioning network. This is achieved through keepalived. A new container is added namely ironic-endpoint-keepalived in the ironic deployment which maintains the Ironic Endpoint using keepalived. The motivation behind maintaining Ironic Endpoint with Keepalived is to ensure that the Ironic Endpoint IP is also passed onto the target cluster control plane. This also guarantees that once moving is done and the management cluster is taken down, target cluster controlplane can re-claim the Ironic endpoint IP through keepalived. The end goal is to make Ironic endpoint reachable in the target cluster.

  4. BMO is deployed as part of CAPM3.

    If not, it has to be deployed before the clusterctl init and the BMH CRDs need to be labeled accordingly manually. Separate labeling for BMH CRDs is required because since CAPM3 release v0.5.0 BMO/BMH CRDs are not deplopyed as part of CAPM3 deployment anymore. This is a prerequisite for both the management and the target cluster.

  5. Objects should have a proper owner reference chain.

    clusterctl move moves all the objects to the target cluster following the owner reference chain. So, it is necessary to verify that all the desired objects that needs to be moved to the target cluster have a proper owner reference chain.

Important Notes

The following requirements are essential for the move process to run successfully:

  1. The move process should be done when the BMHs are in a steady state. BMHs should not be moved while any operation is on-going i.e. BMH is in provisioning state. This will result in failure since the interaction between IPA and Ironic gets broken and as a result Ironic’s database might not be repopulated and eventually the cluster will end up in an erroneous state. Moreover, the IP of the BMH might change after the move and the DHCP-leases from the management cluster are not moved to target cluster.

  2. Before the move process is initialized, it is important to delete the Ironic pod/Ironic containers. If Ironic is deployed in cluster the deployment is named metal3-ironic, if it is deployed locally outside the cluster then the user has to make sure that all of the ironic related containers are correctly deleted. If Ironic is not deleted before move, the old Ironic might interfere with the operations of the new Ironic deployed in target cluster since the database of the first Ironic instance is not cleaned when the BMHs are moved. Also there would be two dnsmasq existent in the deployment if there would be two Ironic deployment which is undesirable.

  3. The provisioning bridge where the ironic-endpoint-IP is supposed to be attached to should have a static IP assignment on it before the Ironic pod/containers start to operate in the target cluster. This is important since ironic-endpoint-keepalived container will only assign the ironic-endpoint-IP on the provisioning bridge in target cluster when it has an IP on it. Otherwise it will fail to attach the IP and Ironic will be unreachable. This is crucial because this interface is used to host the DHCP server and so it cannot be configured to use DHCP.

Step by step pivoting process

As described in clusterctl the whole process of bootstrapping a management cluster to moving objects to target cluster can be described as follows:

The move process can be bounded with the creation of a temporary bootstrap cluster used to provision a target management cluster.

This can now be achieved with the following procedure:

  1. Create a temporary bootstrap cluster, the temporary bootstrap cluster could be created tools like e.g. using Kind or Minikube using and after the bootstrap cluster is up and running then the CAPI and provider components can be installed with clusterctl to the bootstrap cluster.

  2. Install Ironic components, namely: ironic, ironic-endpoint-keepalived, httpd and dnsmasq.

  3. Use clusterctl init to install the provider components

    Example:

    clusterctl init --infrastructure metal3:v1.7.1
    --target-namespace metal3 --watching-namespace metal3
    

    This command will create the necessary CAPI controllers (CAPI, CABPK, CAKCP) and CAPM3 as the infrastructure provider. All of the controllers will be installed on namespace metal3 and they will be watching over objects in namespace metal3.

  4. Provision target cluster:

    Example:

    clusterctl config cluster ... | kubectl apply -f -
    
  5. Wait for the target management cluster to be up and running and once it is up get the kubeconfig for the new target management cluster.

  6. Use the new cluster’s kubeconfig to install the ironic-components in the target cluster.

  7. Use clusterctl init with the new cluster’s kubeconfig to install the provider components.

    Example:

    clusterctl init --kubeconfig target.yaml --infrastructure metal3:v1.7.1
    --target-namespace metal3 --watching-namespace metal3
    
  8. Use clusterctl move to move the Cluster API resources from the bootstrap cluster to the target management cluster.

    Example:

    clusterctl move --to-kubeconfig target.yaml -n metal3 -v 10
    
  9. Delete the bootstrap cluster

Automated Cleaning

Before reading this page, please see Baremetal Operator Automated Cleaning page.

If you are using only Metal3 Baremetal Operator, you can skip this page and refer to Baremetal Operator automated cleaning page instead.

For deployments following Cluster-api-provider-metal3 (CAPM3) workflow, automated cleaning can be (recommended) configured via CAPM3 custom resources (CR).

There are two automated cleaning modes available which can be set via automatedCleaningMode field of a Metal3MachineTemplate spec or Metal3Machine spec.

  • metadata to enable the cleaning
  • disabled to disable the cleaning

When enabled (metadata), automated cleaning kicks off when a node is in the first provisioning and on every deprovisioning. There is no default value for automatedCleaningMode in Metal3MachineTemplate and Metal3Machine. If user doesn’t set any mode, the field in the spec will be omitted. Unsetting automatedCleaningMode in the Metal3MachineTemplate will block the synchronization of the cleaning mode between the Metal3MachineTemplate and Metal3Machines. This enables the selective operations described below.

Bulk operations

CAPM3 controller ensures to replicate automated cleaning mode to all Metal3Machines from their referenced Metal3MachineTemplate. For example, one controlplane and one worker Metal3Machines have automatedCleaningMode set to disabled, because it is set to disabled in the template that they both are referencing.

Note: CAPM3 controller replicates the cleaning mode from Metal3MachineTemplate to Metal3Machine only if automatedCleaningMode is set (not empty) on the Metal3MachineTemplate resource. In other words, it synchronizes either disabled or metadata modes between Metal3MachineTemplate and Metal3Machines.

Selective operations

Normally automated cleaning mode is replicated from Metal3MachineTemplate spec to its referenced Metal3Machines’ spec and from Metal3Machines spec to BareMetalHost spec (if CAPM3 is used). However, sometimes you might want to have a different automated cleaning mode for one or more Metal3Machines than the others even though they are referencing the same Metal3MachineTemplate. For example, there is one worker and one controlplane Metal3Machine created from the same Metal3MachineTemplate, and we would like the automated cleaning to be enabled (metadata) for the worker while disabled (disabled) for the controlplane.

Here are the steps to achieve that:

  1. Unset automatedCleaningMode in the Metal3MachineTemplate. Then CAPM3 controller unsets it for referenced Metal3Machines. Although it is unset in the Metal3Machine, BareMetalHosts will get their default automated cleaning mode metadata. As we mentioned earlier, CAPM3 controller replicates cleaning mode from Metal3MachineTemplate to Metal3Machine ONLY when it is either metadata or disabled. As such, to block synchronization between Metal3MachineTemplate and Metal3Machine, unsetting the cleaning mode in the Metal3MachineTemplate is enough.
  2. Set automatedCleaningMode to disabled on the worker Metal3Machine spec and to metadata on the controlplane Metal3Machine spec. Since we don’t have any mode set on the Metal3MachineTemplate, Metal3Machines can have different automated cleaning modes set even if they reference the same Metal3MachineTemplate. CAPM3 controller copies cleaning modes from Metal3Machines to their corresponding BareMetalHosts. As such, we end up with two nodes having different cleaning modes regardless of the fact that they reference the same Metal3MachineTemplate.

alt

IPAM (IP Address Manager)

The IPAM project provides a controller to manage static IP address allocations in Cluster API Provider Metal3.

In CAPM3, the Network Data need to be passed to Ironic through the BareMetalHost. CAPI addresses the deployment of Kubernetes clusters and nodes, using the Kubernetes API. As such, it uses objects such as MachineDeployments (similar to deployments for pods) that takes care of creating the requested number of machines, based on templates. The replicas can be increased by the user, triggering the creation of new machines based on the provided templates. Considering the KubeadmControlPlane and MachineDeployment features in Cluster API, it is not possible to provide static IP addresses for each machine before the actual deployments.

In addition, all the resources from the source cluster must support the CAPI pivoting, i.e. being copied and recreated in the target cluster. This means that all objects must contain all needed information in their spec field to recreate the status in the target cluster without losing information. All objects must, through a tree of owner references, be attached to the cluster object, for the pivoting to proceed properly.

Moreover, there are use cases that the users want to specify multiple non-continuous ranges of IP addresses, use the same pool across multiple Template objects, or rule out some IP addresses that might be in use for any reason after the deployment.

The IPAM is introduced to manage the allocations of IP subnet according to the requests without handling any use of those addresses. The IPAM adds the flexibility by providing the address right before provisioning the node. It can share a pool across MachineDeployment or KubeadmControlPlane, allow non-continuous pools and external IP management by using IPAddress CRs, offer predictable IP addresses, and it is resilient to the clusterctl move operation.

In order to use IPAM, both the CAPI and IPAM controllers are required, since the IPAM controller has a dependency on Cluster API Cluster objects.

IPAM components

  • IPPool: A set of IP addresses pools to be used for IP address allocations
  • IPClaim: Request for an IP address allocation
  • IPAddress: IP address allocation

IPPool

Example of IPPool:

apiVersion: ipam.metal3.io/v1alpha1
kind: IPPool
metadata:
  name: pool1
  namespace: default
spec:
  clusterName: cluster1
  namePrefix: test1-prov
  pools:
    - start: 192.168.0.10
      end: 192.168.0.30
      prefix: 25
      gateway: 192.168.0.1
    - subnet: 192.168.1.1/26
    - subnet: 192.168.1.128/25
  prefix: 24
  gateway: 192.168.1.1
  preAllocations:
    claim1: 192.168.0.12

The spec field contains the following fields:

  • clusterName: Name of the cluster to which this pool belongs, it is used to verify whether the resource is paused.
  • namePrefix: The prefix used to generate the IPAddress.
  • pools: List of IP address pools
  • prefix: Default prefix for this IPPool
  • gateway: Default gateway for this IPPool
  • preAllocations: Default preallocated IP address for this IPPool

The prefix and gateway can be overridden per pool. Here is the pool definition:

  • start: IP range start address and it can be omitted if subnet is set.
  • end: IP range end address and can be omitted.
  • subnet: Subnet for the allocation and can be omitted if start is set. It is used to verify that the allocated address belongs to this subnet.
  • prefix: Override of the default prefix for this pool
  • gateway: Override of the default gateway for this pool

IPClaim

An IPClaim is an object representing a request for an IP address allocation.

Example of IPClaim:

apiVersion: ipam.metal3.io/v1alpha1
kind: IPClaim
metadata:
  name: test1-controlplane-template-0-pool1
  namespace: default
spec:
  pool:
    name: pool1
    namespace: default

The spec field contains the following:

  • pool: This is a reference to the IPPool that is requested for

IPAddress

An IPAddress is an object representing an IP address allocation. It will be created by IPAM to fill an IPClaim, so that user does not have to create it manually.

Example IPAddress:

apiVersion: ipam.metal3.io/v1alpha1
kind: IPAddress
metadata:
  name: test1-prov-192-168-0-13
  namespace: default
spec:
  pool:
    name: pool1
    namespace: default
  claim:
    name: test1-controlplane-template-0-pool1
    namespace: default
  address: 192.168.0.13
  prefix: 24
  gateway: 192.168.0.1

The spec field contains the following:

  • pool: Reference to the IPPool this address is for
  • claim: Reference to the IPClaim this address is for
  • address: Allocated IP address
  • prefix: Prefix for this address
  • gateway: Gateway for this address

Installing IPAM as Deployment

This section will show how IPAM can be installed as a deployment in a cluster.

Deploying controllers

CAPI and IPAM controllers need to be deployed at the begining. The IPAM controller has a dependency on Cluster API Cluster objects. CAPI CRDs and controllers must be deployed and the cluster objects should exist for successful deployments.

Deployment

The user can create the IPPool object independently. It will wait for its cluster to exist before reconciling. If the user wants to create IPAddress objects manually, they should be created before any claims. It is highly recommended to use the preAllocations field itself or have the reconciliation paused.

After an IPClaim object creation, the controller will list all existing IPAddress objects. It will then select randomly an address that has not been allocated yet and is not in the preAllocations map. It will then create an IPAddress object containing the references to the IPPool and IPClaim and the address, the prefix from the address pool or the default prefix, and the gateway from the address pool or the default gateway.

Deploy IPAM

Deploys IPAM CRDs and IPAM controllers. We can run Makefile target from inside the cloned IPAM git repo.

    make deploy

Run locally

Runs IPAM controller locally

    kubectl scale -n capm3-system deployment.v1.apps/metal3-ipam-controller-manager \
      --replicas 0
    make run

Deploy an example pool

    make deploy-examples

Delete the example pool

    make delete-examples

Deletion

When deleting an IPClaim object, the controller will simply delete the associated IPAddress object. Once all IPAddress objects have been deleted, the IPPool object can be deleted. Before that point, the finalizer in the IPPool object will block the deletion.

References

  1. IPAM.
  2. IPAM deployment workflow.
  3. Custom resource (CR) examples in metal3-dev-env, in the templates.

Trying Metal3 on a development environment

Ready to start taking steps towards your first experience with metal3? Follow these commands to get started!


1. Environment Setup

info: “Naming” For the v1alpha3 release, the Cluster API provider for Metal3 was renamed from Cluster API provider BareMetal (CAPBM) to Cluster API provider Metal3 (CAPM3). Hence, from v1alpha3 onwards it is Cluster API provider Metal3.

1.1. Prerequisites

  • System with CentOS 9 Stream or Ubuntu 22.04
  • Bare metal preferred, as we will be creating VMs to emulate bare metal hosts
  • Run as a user with passwordless sudo access
  • Minimum resource requirements for the host machine: 4C CPUs, 16 GB RAM memory

For execution with VMs

  • Setup passwordless sudo access
  sudo visudo
  • Include this line at the end of the sudoers file
  username  ALL=(ALL) NOPASSWD: ALL
  • Save and exit
  • Manually enable nested virtualization if you don’t have it enabled in your VM
  # To enable nested virtualization 
  # On Centos 9 streams (other distros may vary)
  # check the current setting
  $ sudo cat /sys/module/kvm_intel/parameters/nested 
  N     # disabled

  $ sudo vi /etc/modprobe.d/kvm.conf 
  # uncomment either of the line
  # for Intel CPU, select [kvm_intel], for AMD CPU, select [kvm_amd]

  options kvm_intel nested=1
  #options kvm_amd nested=1

  # unload
  $ sudo modprobe -r kvm_intel

  # reload
  $ sudo modprobe kvm_intel

  $ sudo cat /sys/module/kvm_intel/parameters/nested
  Y     # just enabled

1.2. Setup

info: “Information” If you need detailed information regarding the process of creating a Metal³ emulated environment using metal3-dev-env, it is worth taking a look at the blog post “A detailed walkthrough of the Metal³ development environment”.

This is a high-level architecture of the Metal³-dev-env. Note that for an Ubuntu-based setup, either Kind or Minikube can be used to instantiate an ephemeral cluster, while for a CentOS-based setup, only Minikube is currently supported. The ephemeral cluster creation tool can be manipulated with the EPHEMERAL_CLUSTER environment variable.

metal3-dev-env image

The short version is: clone metal³-dev-env and run

 make

The Makefile runs a series of scripts, described here:

  • 01_prepare_host.sh - Installs all needed packages.

  • 02_configure_host.sh - Creates a set of VMs that will be managed as if they were bare metal hosts. It also downloads some images needed for Ironic.

  • 03_launch_mgmt_cluster.sh - Launches a management cluster using minikube or kind and runs the baremetal-operator on that cluster.

  • 04_verify.sh - Runs a set of tests that verify that the deployment was completed successfully.

When the environment setup is completed, you should be able to see the BareMetalHost (bmh) objects in the Ready state.

1.3. Tear Down

To tear down the environment, run

 make clean

info “Note” When redeploying metal³-dev-env with a different release version of CAPM3, you must set the FORCE_REPO_UPDATE variable in config_${user}.sh to true. warning “Warning” If you see this error during the installation:

error: failed to connect to the hypervisor \
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock':  Permission denied

You may need to log out then log in again, and run make clean and make again.

1.4. Using Custom Image

Whether you want to run target cluster Nodes with your own image, you can override the three following variables: IMAGE_NAME, IMAGE_LOCATION, IMAGE_USERNAME. If the requested image with the name IMAGE_NAME does not exist in the IRONIC_IMAGE_DIR (/opt/metal3-dev-env/ironic/html/images) folder, then it will be automatically downloaded from the IMAGE_LOCATION value configured.

1.5. Setting environment variables

info “Environment variables” More information about the specific environment variables used to set up metal3-dev-env can be found here.

To set environment variables persistently, export them from the configuration file used by metal³-dev-env scripts:

 cp config_example.sh config_$(whoami).sh
 vim config_$(whoami).sh

2. Working with the Development Environment

2.1. BareMetalHosts

This environment creates a set of VMs to manage as if they were bare metal hosts.

There are two different host OSs that the metal3-dev-env setup process is tested on.

  1. Host VM/Server on CentOS, while the target can be Ubuntu or CentOS, Cirros, or FCOS.
  2. Host VM/Server on Ubuntu, while the target can be Ubuntu or CentOS, Cirros, or FCOS.

The way the k8s cluster is running in the above two scenarios is different. For CentOS minikube cluster is used as the source cluster, for Ubuntu, a kind cluster is being created. As such, when the host (where the make command was issued) OS is CentOS, there should be three libvirt VMs and one of them should be a minikube VM.

In case the host OS is Ubuntu, the k8s source cluster is created by using kind, so in this case the minikube VM won’t be present.

To configure what tool should be used for creating source k8s cluster the EPHEMERAL_CLUSTER environment variable is responsible. The EPHEMERAL_CLUSTER is configured to build minikube cluster by default on a CentOS host and kind cluster on a Ubuntu host.

VMs can be listed using virsh cli tool.

In case the EPHEMERAL_CLUSTER environment variable is set to kind the list of running virtual machines will look like this:

$ sudo virsh list
 Id    Name       State
--------------------------
 1     node_0     running
 2     node_1     running

In case the EPHEMERAL_CLUSTER environment variable is set to minikube the list of running virtual machines will look like this:

$ sudo virsh list
 Id   Name       State
--------------------------
 1    minikube   running
 2    node_0     running
 3    node_1     running

Each of the VMs (aside from the minikube management cluster VM) is represented by BareMetalHost objects in our management cluster. The yaml definition file used to create these host objects is in ${WORKING_DIR}/bmhosts_crs.yaml.

$ kubectl get baremetalhosts -n metal3 -o wide
NAME     STATUS   STATE       CONSUMER   BMC                                                                                         HARDWARE_PROFILE   ONLINE   ERROR   AGE
node-0   OK       available              ipmi://192.168.111.1:6230                                                                   unknown            true             58m
node-1   OK       available              redfish+http://192.168.111.1:8000/redfish/v1/Systems/492fcbab-4a79-40d7-8fea-a7835a05ef4a   unknown            true             58m

You can also look at the details of a host, including the hardware information gathered by doing pre-deployment introspection.

$ kubectl get baremetalhost -n metal3 -o yaml node-0


apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"metal3.io/v1alpha1","kind":"BareMetalHost","metadata":{"annotations":{},"name":"node-0","namespace":"metal3"},"spec":{"bmc":{"address":"ipmi://192.168.111.1:6230","credentialsName":"node-0-bmc-secret"},"bootMACAddress":"00:ee:d0:b8:47:7d","bootMode":"legacy","online":true}}
  creationTimestamp: "2021-07-12T11:04:10Z"
  finalizers:
  - baremetalhost.metal3.io
  generation: 1
  name: node-0
  namespace: metal3
  resourceVersion: "3243"
  uid: 3bd8b945-a3e8-43b9-b899-2f869680d28c
spec:
  automatedCleaningMode: metadata
  bmc:
    address: ipmi://192.168.111.1:6230
    credentialsName: node-0-bmc-secret
  bootMACAddress: 00:ee:d0:b8:47:7d
  bootMode: legacy
  online: true
status:
  errorCount: 0
  errorMessage: ""
  goodCredentials:
    credentials:
      name: node-0-bmc-secret
      namespace: metal3
    credentialsVersion: "1789"
  hardware:
    cpu:
      arch: x86_64
      clockMegahertz: 2694
      count: 2
      flags:
       - aes
       - apic
       # There are many more flags but they are not listed in this example.
      model: Intel Xeon E3-12xx v2 (Ivy Bridge)
    firmware:
      bios:
        date: 04/01/2014
        vendor: SeaBIOS
        version: 1.13.0-1ubuntu1.1
    hostname: node-0
    nics:
    - ip: 172.22.0.20
      mac: 00:ee:d0:b8:47:7d
      model: 0x1af4 0x0001
      name: enp1s0
      pxe: true
    - ip: fe80::1863:f385:feab:381c%enp1s0
      mac: 00:ee:d0:b8:47:7d
      model: 0x1af4 0x0001
      name: enp1s0
      pxe: true
    - ip: 192.168.111.20
      mac: 00:ee:d0:b8:47:7f
      model: 0x1af4 0x0001
      name: enp2s0
    - ip: fe80::521c:6a5b:f79:9a75%enp2s0
      mac: 00:ee:d0:b8:47:7f
      model: 0x1af4 0x0001
      name: enp2s0
    ramMebibytes: 4096
    storage:
    - hctl: "0:0:0:0"
      model: QEMU HARDDISK
      name: /dev/sda
      rotational: true
      serialNumber: drive-scsi0-0-0-0
      sizeBytes: 53687091200
      type: HDD
      vendor: QEMU
    systemVendor:
      manufacturer: QEMU
      productName: Standard PC (Q35 + ICH9, 2009)
  hardwareProfile: unknown
  lastUpdated: "2021-07-12T11:08:53Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: "2021-07-12T11:08:23Z"
      start: "2021-07-12T11:04:55Z"
    provision:
      end: null
      start: null
    register:
      end: "2021-07-12T11:04:55Z"
      start: "2021-07-12T11:04:44Z"
  operationalStatus: OK
  poweredOn: true
  provisioning:
    ID: 8effe29b-62fe-4fb6-9327-a3663550e99d
    bootMode: legacy
    image:
      url: ""
    rootDeviceHints:
      deviceName: /dev/sda
    state: ready
  triedCredentials:
    credentials:
      name: node-0-bmc-secret
      namespace: metal3
    credentialsVersion: "1789"

2.2. Provision Cluster and Machines

This section describes how to trigger the provisioning of a cluster and hosts via Machine objects as part of the Cluster API integration. This uses Cluster API v1beta1 and assumes that metal3-dev-env is deployed with the environment variable CAPM3_VERSION set to v1beta1. This is the default behaviour. The v1beta1 deployment can be done with Ubuntu 22.04 or Centos 9 Stream target host images. Please make sure to meet resource requirements for successful deployment:

See support version for more on CAPI compatibility

The following scripts can be used to provision a cluster, controlplane node and worker node.

./tests/scripts/provision/cluster.sh
./tests/scripts/provision/controlplane.sh
./tests//scripts/provision/worker.sh

At this point, the Machine actuator will respond and try to claim a BareMetalHost for this Metal3Machine. You can check the logs of the actuator.

First, check the names of the pods running in the baremetal-operator-system namespace and the output should be something similar to this:

$ kubectl -n baremetal-operator-system get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
baremetal-operator-controller-manager-5fd4fb6c8-c9prs   2/2     Running   0          71m

In order to get the logs of the actuator the logs of the baremetal-operator-controller-manager instance have to be queried with the following command:

$ kubectl logs -n baremetal-operator-system pod/baremetal-operator-controller-manager-5fd4fb6c8-c9prs -c manager
...
{"level":"info","ts":1642594214.3598707,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"metal3/node-1", "provisioningState":"provisioning","requeue":true,"after":10}
...

Keep in mind that the suffix hashes e.g. 5fd4fb6c8-c9prs are automatically generated and change in case of a different deployment.

If you look at the yaml representation of the Metal3Machine object, you will see a new annotation that identifies which BareMetalHost was chosen to satisfy this Metal3Machine request.

First list the Metal3Machine objects present in the metal3 namespace:

$ kubectl get metal3machines -n metal3
NAME                       PROVIDERID                                      READY   CLUSTER   PHASE
test1-controlplane-jjd9l   metal3://d4848820-55fd-410a-b902-5b2122dd206c   true    test1
test1-workers-bx4wp        metal3://ee337588-be96-4d5b-95b9-b7375969debd   true    test1

Based on the name of the Metal3Machine objects you can check the yaml representation of the object and see from its annotation which BareMetalHost was chosen.

$ kubectl get metal3machine test1-workers-bx4wp -n metal3 -o yaml
...
  annotations:
    metal3.io/BareMetalHost: metal3/node-1
...

You can also see in the list of BareMetalHosts that one of the hosts is now provisioned and associated with a Metal3Machines by looking at the CONSUMER output column of the following command:

$ kubectl get baremetalhosts -n metal3
NAME     STATE         CONSUMER                   ONLINE   ERROR   AGE
node-0   provisioned   test1-controlplane-jjd9l   true             122m
node-1   provisioned   test1-workers-bx4wp        true             122m

It is also possible to check which Metal3Machine serves as the infrastructure for the ClusterAPI Machine objects.

First list the Machine objects:

$ kubectl get machine -n metal3
NAME                     CLUSTER   NODENAME                 PROVIDERID                                      PHASE     AGE   VERSION
test1-6d8cc5965f-wvzms   test1     test1-6d8cc5965f-wvzms   metal3://7f51f14b-7701-436a-85ba-7dbc7315b3cb   Running   53m   v1.22.3
test1-nphjx              test1     test1-nphjx              metal3://14fbcd25-4d09-4aca-9628-a789ba3e175c   Running   55m   v1.22.3

As a next step you can check what serves as the infrastructure backend for e.g. test1-6d8cc5965f-wvzms Machine object:

$ kubectl get machine test1-6d8cc5965f-wvzms -n metal3 -o yaml
...
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: Metal3Machine
    name: test1-workers-bx4wp
    namespace: metal3
    uid: 39362b32-ebb7-4117-9919-67510ceb177f
...

Based on the result of the query test1-6d8cc5965f-wvzms ClusterAPI Machine object is backed by test1-workers-bx4wp Metal3Machine object.

You should be able to ssh into your host once provisioning is completed. The default username for both CentOS & Ubuntu images is metal3. For the IP address, you can either use the API endpoint IP of the target cluster which is - 192.168.111.249 by default or use the predictable IP address of the first master node - 192.168.111.100.

 ssh metal3@192.168.111.249

2.3. Deprovision Cluster and Machines

Deprovisioning of the target cluster is done just by deleting Cluster and Machine objects or by executing the de-provisioning scripts in reverse order than provisioning:

./tests/scripts/deprovision/worker.sh
./tests/scripts/deprovision/controlplane.sh
./tests/scripts/deprovision/cluster.sh

Note that you can easily de-provision worker Nodes by decreasing the number of replicas in the MachineDeployment object created when executing the provision/worker.sh script:

kubectl scale machinedeployment test1 -n metal3 --replicas=0

warning “Warning” control-plane and cluster are very tied together. This means that you are not able to de-provision the control-plane of a cluster and then provision a new one within the same cluster. Therefore, in case you want to de-provision the control-plane you need to de-provision the cluster as well and provision both again.

Below, it is shown how the de-provisioning can be executed in a more manual way by just deleting the proper Custom Resources (CR).

The order of deletion is:

  1. Machine objects of the workers
  2. Metal3Machine objects of the workers
  3. Machine objects of the control plane
  4. Metal3Machine objects of the control plane
  5. The cluster object

An additional detail is that the Machine object test1-workers-bx4wp is controlled by the test1 MachineDeployment the object thus in order to avoid reprovisioning of the Machine object the MachineDeployment has to be deleted instead of the Machine object in the case of test1-workers-bx4wp.

$ # By deleting the Machine or MachineDeployment object the related Metal3Machine object(s) should be deleted automatically.


$ kubectl delete machinedeployment test1 -n metal3
machinedeployment.cluster.x-k8s.io "test1" deleted


$ # The "machinedeployment.cluster.x-k8s.io "test1" deleted" output will be visible almost instantly but that doesn't mean that the related Machine
$ # object(s) has been deleted right away, after the deletion command is issued the Machine object(s) will enter a "Deleting" state and they could stay in that state for minutes
$ # before they are fully deleted.


$ kubectl delete machine test1-m77bn -n metal3
machine.cluster.x-k8s.io "test1-m77bn" deleted


$ # When a Machine object is deleted directly and not by deleting a MachineDeployment the "machine.cluster.x-k8s.io "test1-m77bn" deleted" will be only visible when the Machine and the
$ # related Metal3Machine object has been fully removed from the cluster. The deletion process could take a few minutes thus the command line will be unresponsive (blocked) for the time being.


$ kubectl delete cluster test1 -n metal3
cluster.cluster.x-k8s.io "test1" deleted

Once the deletion has finished, you can see that the BareMetalHosts are offline and Cluster object is not present anymore

$ kubectl get baremetalhosts -n metal3
NAME     STATE       CONSUMER   ONLINE   ERROR   AGE
node-0   available              false            160m
node-1   available              false            160m


$ kubectl get cluster -n metal3
No resources found in metal3 namespace.

2.4. Running Custom Baremetal-Operator

The baremetal-operator comes up running in the cluster by default, using an image built from the metal3-io/baremetal-operator repository. If you’d like to test changes to the baremetal-operator, you can follow this process.

First, you must scale down the deployment of the baremetal-operator running in the cluster.

kubectl scale deployment baremetal-operator-controller-manager -n baremetal-operator-system --replicas=0

To be able to run baremetal-operator locally, you need to install operator-sdk. After that, you can run the baremetal-operator including any custom changes.

cd ~/go/src/github.com/metal3-io/baremetal-operator
make run

2.5. Running Custom Cluster API Provider Metal3

There are two Cluster API-related managers running in the cluster. One includes a set of generic controllers, and the other includes a custom Machine controller for Metal3.

Tilt development environment

Tilt setup can deploy CAPM3 in a local kind cluster. Since Tilt is applied in the metal3-dev-env deployment, you can make changes inside the cluster-api-provider-metal3 folder and Tilt will deploy the changes automatically. If you deployed CAPM3 separately and want to make changes to it, then follow CAPM3 instructions. This will save you from having to build all of the images for CAPI, which can take a while. If the scope of your development will span both CAPM3 and CAPI, then follow the CAPI and CAPM3 instructions.

2.6. Accessing Ironic API

Sometimes you may want to look directly at Ironic to debug something. The metal3-dev-env repository contains clouds.yaml file with connection settings for Ironic.

Metal3-dev-env will install the unified OpenStack and standalone OpenStack Ironic command-line clients on the provisioning host as part of setting up the cluster.

Note that currently, you can use either a unified OpenStack client or an Ironic client. In this example, we are using an Ironic client to interact with the Ironic API.

Please make sure to export CONTAINER_RUNTIME environment variable before you execute commands.

Example:

[notstack@metal3 metal3-dev-env]$ export CONTAINER_RUNTIME=docker
[notstack@metal3 metal3-dev-env]$ baremetal node list
+--------------------------------------+---------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name          | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+---------------+--------------------------------------+-------------+--------------------+-------------+
| b423ee9c-66d8-48dd-bd6f-656b93140504 | metal3~node-1 | 7f51f14b-7701-436a-85ba-7dbc7315b3cb | power off   | available          | False       |
| 882533c5-2f14-49f6-aa44-517e1e404fd8 | metal3~node-0 | 14fbcd25-4d09-4aca-9628-a789ba3e175c | power off   | available          | False       |
+--------------------------------------+---------------+--------------------------------------+-------------+--------------------+-------------+

To view a particular node’s details, run the below command. The last_error, maintenance_reason, and provisioning_state fields are useful for troubleshooting to find out why a node did not deploy.

[notstack@metal3 metal3-dev-env]$ baremetal node show b423ee9c-66d8-48dd-bd6f-656b93140504
+------------------------+------------------------------------------------------------+
| Field                  | Value                                                      |
+------------------------+------------------------------------------------------------+
| allocation_uuid        | None                                                       |
| automated_clean        | True                                                       |
| bios_interface         | redfish                                                    |
| boot_interface         | ipxe                                                       |
| chassis_uuid           | None                                                       |
| clean_step             | {}                                                         |
| conductor              | 172.22.0.2                                                 |
| conductor_group        |                                                            |
| console_enabled        | False                                                      |
| console_interface      | no-console                                                 |
| created_at             | 2022-01-19T10:56:06+00:00                                  |
| deploy_interface       | direct                                                     |
| deploy_step            | {}                                                         |
| description            | None                                                       |
| driver                 | redfish                                                    |
| driver_info            | {u'deploy_kernel': u'http://172.22.0.2:6180/images/ironic-python-agent.kernel', u'deploy_ramdisk': u'http://172.22.0.2:6180/images/ironic-python-agent.initramfs', u'redfish_address': u'http://192.168.111.1:8000', u'redfish_password': u'******', u'redfish_system_id': u'/redfish/v1/Systems/492fcbab-4a79-40d7-8fea-a7835a05ef4a', u'redfish_username': u'admin', u'force_persistent_boot_device': u'Default'} |
| driver_internal_info   | {u'last_power_state_change': u'2022-01-19T13:04:01.981882', u'agent_version': u'8.3.1.dev2', u'agent_last_heartbeat': u'2022-01-19T13:03:51.874842', u'clean_steps': None, u'agent_erase_devices_iterations': 1, u'agent_erase_devices_zeroize': True, u'agent_continue_if_secure_erase_failed': False, u'agent_continue_if_ata_erase_failed': False, u'agent_enable_nvme_secure_erase': True, u'disk_erasure_concurrency': 1, u'agent_erase_skip_read_only': False, u'hardware_manager_version': {u'generic_hardware_manager': u'1.1'}, u'agent_cached_clean_steps_refreshed': u'2022-01-19 13:03:47.558697', u'deploy_steps': None, u'agent_cached_deploy_steps_refreshed': u'2022-01-19 12:09:34.731244'} |
| extra                  | {}                                                         |
| fault                  | None                                                       |
| inspect_interface      | agent                                                      |
| inspection_finished_at | None                                                       |
| inspection_started_at  | 2022-01-19T10:56:17+00:00                                  |
| instance_info          | {u'capabilities': {}, u'image_source': u'http://172.22.0.1/images/CENTOS_8_NODE_IMAGE_K8S_v1.22.3-raw.img', u'image_os_hash_algo': u'md5', u'image_os_hash_value': u'http://172.22.0.1/images/CENTOS_8_NODE_IMAGE_K8S_v1.22.3-raw.img.md5sum', u'image_checksum': u'http://172.22.0.1/images/CENTOS_8_NODE_IMAGE_K8S_v1.22.3-raw.img.md5sum', u'image_disk_format': u'raw'} |
| instance_uuid          | None                                                       |
| last_error             | None                                                       |
| lessee                 | None                                                       |
| maintenance            | False                                                      |
| maintenance_reason     | None                                                       |
| management_interface   | redfish                                                    |
| name                   | metal3~node-1                                              |
| network_data           | {}                                                         |
| network_interface      | noop                                                       |
| owner                  | None                                                       |
| power_interface        | redfish                                                    |
| power_state            | power off                                                  |
| properties             | {u'capabilities': u'cpu_vt:true,cpu_aes:true,cpu_hugepages:true,boot_mode:bios', u'vendor': u'Sushy Emulator', u'local_gb': u'50', u'cpus': u'2', u'cpu_arch': u'x86_64', u'memory_mb': u'4096', u'root_device': {u'name': u's== /dev/sda'}}                                                                                                                                                                                        |
| protected              | False                                                      |
| protected_reason       | None                                                       |
| provision_state        | available                                                  |
| provision_updated_at   | 2022-01-19T13:03:52+00:00                                  |
| raid_config            | {}                                                         |
| raid_interface         | no-raid                                                    |
| rescue_interface       | no-rescue                                                  |
| reservation            | None                                                       |
| resource_class         | None                                                       |
| retired                | False                                                      |
| retired_reason         | None                                                       |
| storage_interface      | noop                                                       |
| target_power_state     | None                                                       |
| target_provision_state | None                                                       |
| target_raid_config     | {}                                                         |
| traits                 | []                                                         |
| updated_at             | 2022-01-19T13:04:03+00:00                                  |
| uuid                   | b423ee9c-66d8-48dd-bd6f-656b93140504                       |
| vendor_interface       | redfish                                                    |
+-------------------------------------------------------------------------------------+

Supported release versions

The Cluster API Provider Metal3 (CAPM3) team maintains the two most recent minor releases; older minor releases are immediately unsupported when a new major/minor release is available. Test coverage will be maintained for all supported minor releases and for one additional release for the current API version in case we have to do an emergency patch release. For example, if v1.6 and v1.7 are currently supported, we will also maintain test coverage for v1.5 for one additional release cycle. When v1.8 is released, tests for v1.5 will be removed.

Currently, in Metal³ organization only CAPM3 and IPAM follow CAPI release cycles. The supported versions (excluding release candidates) for CAPM3 and IPAM releases are as follows:

Cluster API Provider Metal3

Minor releaseAPI versionStatus
v1.7v1beta1Supported
v1.6v1beta1Supported
v1.5v1beta1Tested
v1.4v1beta1EOL
v1.3v1beta1EOL
v1.2v1beta1EOL
v1.1v1beta1EOL

IP Address Manager

Minor releaseAPI versionStatus
v1.7v1beta1Supported
v1.6v1beta1Supported
v1.5v1beta1Tested
v1.4v1beta1EOL
v1.3v1beta1EOL
v1.2v1beta1EOL
v1.1v1beta1EOL

The compatability of IPAM and CAPM3 API versions with CAPI is discussed here.

Baremetal Operator

Since capm3-v1.1.2, BMO follows the semantic versioning scheme for its own release cycle, the same way as CAPM3 and IPAM. Currently, we have release-0.6, release-0.5 and release-0.4 release branches for v0.6.x v0.5.x and v0.4.x release cycle respectively and as such two braches are maintained as supported releases. Following table summarizes BMO release/test process:

Minor releaseStatus
v0.6Supported
v0.5Supported
v0.4Tested
v0.3EOL
v0.2EOL
v0.1EOL

Ironic-image

Since v23.1.0, Ironic follows the semantic versioning scheme for its own release cycle, the same way as CAPM3 and IPAM. Currently, we have release-25.0, release-24.1, release-24.0 and release-23.1 release branches for v25.0.x, v24.1.x v24.0.x and v23.1.x release cycle respectively and as such two or three braches are maintained as supported releases. Following table summarizes Ironic-image release/test process:

Minor releaseStatus
v25.0Supported
v24.1Supported
v24.0Supported
v23.1Tested

Image tags

The Metal³ team provides container images for all the main projects and also many auxilary tools needed for tests or otherwise useful. Some of these images are tagged in a way that makes it easy to identify what version of Cluster API provider Metal³ they are tested with. For example, we tag MariaDB container images with tags like capm3-v1.7.0, where v1.7.0 would be the CAPM3 release it was tested with.

All container images are published through the Metal³ organization in Quay. Here are some examples:

  • quay.io/metal3-io/cluster-api-provider-metal3:v1.7.0
  • quay.io/metal3-io/baremetal-operator:v0.6.0
  • quay.io/metal3-io/ip-address-manager:v1.7.0
  • quay.io/metal3-io/ironic:v24.1.1
  • quay.io/metal3-io/mariadb:capm3-v1.7.0

CI Test Matrix

The table describes which branches/image-tags are tested in each periodic CI tests:

INTEGRATION TESTSCAPM3 branchIPAM branchBMO branch/tagKeepalived tagMariaDB tagIronic tag
metal3-periodic-ubuntu/centos-e2e-integration-test-mainmainmainmainlatestlatestlatest
metal3_periodic_main_integration_test_ubuntu/centosmainmainmainlatestlatestlatest
metal3-periodic-ubuntu/centos-e2e-integration-test-release-1-7release-1.7release-1.7release-0.6v0.6.1latestv24.1.1
metal3-periodic-ubuntu/centos-e2e-integration-test-release-1-6release-1.6release-1.6release-0.5v0.5.0latestv24.0.0
metal3-periodic-ubuntu/centos-e2e-integration-test-release-1-5release-1.5release-1.5release-0.5v0.5.0latestv23.1.0
FEATURE AND E2E TESTSCAPM3 branchIPAM branchBMO branch/tagKeepalived tagMariaDB tagIronic tag
metal3-periodic-ubuntu/centos-e2e-feature-test-mainmainmainmainlatestlatestlatest
metal3-periodic-ubuntu/centos-e2e-feature-test-release-1-7release-1.7release-1.7release-0.6v0.6.1latestv24.1.1
metal3-periodic-ubuntu/centos-e2e-feature-test-release-1-6release-1.6release-1.6release-0.5v0.5.0latestv24.0.0
metal3-periodic-ubuntu/centos-e2e-feature-test-release-1-5release-1.5release-1.5release-0.4v0.4.0latestv23.1.0
EPHEMERAL TESTSCAPM3 branchIPAM branchBMO branch/tagKeepalived tagMariaDB tagIronic tag
metal3_periodic_e2e_ephemeral_test_centosmainmainmainlatestlatestlatest

All tests use latest images of VBMC and sushy-tools.

Metal3-io security policy

This document explains the general security policy for the whole project thus it is applicable for all of its active repositories and this file has to be referenced in each repository in each repository’s SECURITY_CONTACTS file.

Way to report a security issue

The Metal3 Community asks that all suspected vulnerabilities be disclosed by reporting them to metal3-security@googlegroups.com mailing list which will forward the vulnerability report to the Metal3 security committee.

Security issue handling, severity categorization, fix process organization

The actions listed below should be completed within 7 days of the security issue’s disclosure on the metal3-security@googlegroups.com.

Security Lead (SL) of the Metal3 Security Committee (M3SC) is tasked to review the security issue disclosure and give the initial feedback to the reporter as soon as possible. Any disclosed security issue will be visible to all M3SC members.

For each reported vulnerability the SL will work quickly to identify committee members that are able work on a fix and CC those developers into the disclosure thread. These selected developers are the Fix Team. The Fix Team is also allowed to invite additional developers into the disclosure thread based on the repo’s OWNERS file. They will then also become members of the Fix Team but not the M3SC.

M3SC members are encouraged to volunteer to the Fix Teams even before the SL would contact them if they think they are ready to work on the issue. M3SC members are also encouraged to correct both the SL and each other on the disclosure threads even if they have not been selected to the Fix Team but after reading the disclosure thread they were able to find mistakes.

The Fix team will start working on the fix either on a private fork of the affected repo or in the public repo depending on the severity of the issue and the decision of the SL. The SL makes the final call about whether the issue can be fixed publicly or it should stay on a private fork until the fix is disclosed based on the issues’ severity level (discussed later in this document).

The SL and the Fix Team will create a CVSS score using the CVSS Calculator. The SL makes the final call on the calculated risk.

If the CVSS score is under ~4.0 (a low severity score) or the assessed risk is low the Fix Team can decide to slow the release process down in the face of holidays, developer bandwidth, etc. These decisions must be discussed on the metal3-security@googlegroups.com.

If the CVSS score is under ~7.0 (a medium severity score), the SL may choose to carry out the fix semi-publicly. Semi-publicly means that PRs are made directly in the public Metal3-io repositories, while restricting discussion of the security aspects to private channels. The SL will make the determination whether there would be user harm in handling the fix publicly that outweighs the benefits of open engagement with the community.

If the CVSS score is over ~7.0 (high severity score), fixes will typically receive an out-of-band release.

More information can be found about severity scores here.

Note: CVSS is convenient but imperfect. Ultimately, the SL has discretion on classifying the severity of a vulnerability.

No matter the CVSS score, if the vulnerability requires User Interaction, or otherwise has a straightforward, non-disruptive mitigation, the SL may choose to disclose the vulnerability before a fix is developed if they determine that users would be better off being warned against a specific interaction.

Fix Disclosure Process

With the Fix Development underway the SL needs to come up with an overall communication plan for the wider community. This Disclosure process should begin after the Fix Team has developed a Fix or mitigation so that a realistic timeline can be communicated to users. Emergency releases for critical and high severity issues or fixes for issues already made public may affect the below timelines for how quickly or far in advance notifications will occur.

The SL will lead the process of creating a GitHub security advisory for the repository that is affected by the issue. In case the SL has no administrator privileges the advisory will be created in cooperation with a repository admin. SL will have to request a CVE number for the security advisory. As GitHub is a CVE Numbering authority (CNA) there is an option to either use an existing CVE number or request a new one from GitHub. More about the GitHub security advisory and the CVE numbering process can be found here.

The original reporter(s) of the security issue has to be notified about the release date of the fix and the advisory and about both the content of the fix and the advisory as soon as the SL has decided a date for the fix disclosure.

If a repository that has a release process requires a high severity fix then the fix has to be released as a patch release for all supported release branches where the fix is relevant as soon as possible.

In case the repository does not have a release process, but it needs a critical fix then the fix has to be merged to the main branch as soon as possible.

In repositories that have a release process Medium and Low severity vulnerability fixes will be released as part of the next upcoming minor or major release whichever happens sooner. Simultaneously with the upcoming release the fix also has to be released to all supported release branches as a patch release if the fix is relevant for given release.

In case the fix was developed on a private repository either the SL or someone designated by the SL has to cherry-pick the fix and push it to the public repository. The SL and the Fix Team has to be able to push the PR through the public repo’s review process as soon as possible and merge it.

Metal3 security committee members

NameGitHub IDAffiliation
Dmitry TantsurdtantsurRed Hat
Riccardo PittauelfosardoRed Hat
Zane BitterzanebRed Hat
Kashif KhankashifestEricsson Software Technology
Lennart Jernlentzi90Ericsson Software Technology
Tuomo TanskanentuminoidEricsson Software Technology
Adam RozmanRozziiEricsson Software Technology

Please don’t report any security vulnerability to the committee members directly.

API reference

Bare Metal Operator

Cluster API provider Metal3

Ip Address Manager