Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(design): phase-1 docs for layer2 support in CAPP: version 1 #786

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
371 changes: 371 additions & 0 deletions docs/layer2/layer2_support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
Motivation/Abstract
===================

By default all servers that are created on Equinix Metal via Cluster API have Layer 3 networking and there is no option when provisioning Equinix Metal cluster to specify type of networking for instances or to create additional L2 interfaces with specific local IP addresses and VLAN.

To solve this, CAPP should provide options to specify:

- Network type (L2/L3/Hybrid)

- Creating network interfaces with specific VLAN

- IP address range for L2 interfaces

* * * * *

Limitations
===========

CAPP managed clusters running without internet connections would need to be able to pull images from a repository also in the layer2, or they would need a bastion host that acts as a gateway and NAT. This isn't supported today, so complete Layer2 will not be supported in the initial phases of the feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting some additional points. These limitations could be left to the user, to define their image store and routing within the L2 network. Other limitations that are perhaps more core to ClusterAPI and CPEM functioning is the need to interact with Equinix Metal APIs (including Metadata where userdata scripts are accessed at node startup). There may be clever ways to work around these limitations, but we are intentionally keeping ideation and solutioning around full layer2 modes out of scope to get the direct benefits of networking modes that enable L2 capabilities without removing the default L3 public address capabilities.


* * * * *

Background
==========

User stories
------------

As a **user of Cluster API provider Packet (CAPP)**
I want **to configure L2 interfaces and define my own IP Address range**
so that **machines are able to communicate over layer2 VLAN**


* * * * *

Goals
===============

In Phase 1 of integrating Layer2 support, the Cluster API Provider (CAPP) will focus on Bring Your Own (BYO) Infrastructure.
Key objectives for this phase include:
- Implementing Hybrid Bonded Mode and Hybrid Unbonded Modes to enhance Layer2 functionalities in CAPP.
- Enabling CAPP to attach network ports to specific VLANs or VXLANs.
- Allowing CAPP to configure Layer2 networking at the OS level on a metal node, including creating sub-interfaces and assigning IP addresses.
rahulii marked this conversation as resolved.
Show resolved Hide resolved
- Ensuring CAPP can track the lifecycle of available IP addresses from VRF Range.
rahulii marked this conversation as resolved.
Show resolved Hide resolved


Non-Goals
===============

- Complete layer2 will not be supported in the initial phases.

- IPAM Provider will be supported in phase-2

Proposal Design/Approach
========================

* * * * *

**Understanding the context and problem space** : The problem space primarily revolves around the operating system (OS) and, to some extent, the cluster level. Specifically, it concerns how Cluster API (CAPP) clusters and machines are defined by IP addresses, networks, and gateways.
A critical aspect of this space is how CAPP provisions infrastructure, particularly network infrastructure. This includes VLANs, gateways, virtual circuits, and IP address ranges such as elastic IPs or IP reservations. Additionally, it involves the management of VRFs and the attachment of these network resources to nodes, ensuring that newly created nodes have ports in a ready state for these attachments. The default approach will be Layer2 networking in a hybrid-bonded mode, though other configurations may also be supported in the future.
This understanding forms the foundation for addressing the technical challenges in provisioning and managing network infrastructure with CAPP.

**Bring Your Own Infrastructure (BYOI)**:

The BYOI approach allows users to leverage their existing infrastructure, such as VLANs, VRFs, Metal Gateways, and similar components.
In this model, users specify the IP ranges to be assigned to metal nodes on VLAN-tagged interfaces. Importantly, CAPP is not responsible for creating or managing this infrastructure, it is assumed to already exist.
However, CAPP needs to be informed of the VLAN ID to attach the network port to the appropriate VLAN using the Equinix Metal (EM) API. This ensures that the network configuration aligns with the pre-existing infrastructure provided by the user.

### Custom Resource Changes:
**PacketMachineTemplate**

To support enhanced layer2 networking capabilities, we propose adding a new Ports field under the spec of the *PacketMachineTemplate*. This field will allow users to define various network port configurations for an Equinix Metal Machine. Below is an outline of the proposed changes:

```go
// PacketMachineSpec defines the desired state of PacketMachine.
type PacketMachineSpec struct {
..
// List of Port Configurations on each Packet Machine
// +optional
Ports []Port `json:"ports"`
}

type Port struct {
// name of the port e.g bond0,eth0 and eth1 for 2 NIC servers.
Name string `json:"name"`
// port bonded or not - by default true
Bonded bool `json:"bonded,omitempty"`
// convert port to layer 2. is false by default on new devices. changes result in /ports/id/convert/layer-[2|3] API calls
Layer2 bool `json:"layer2"`
// IPAddress configurations associated with this port
// These are typically IP Reservations carved out of VRF.
IPAddresses []IPAddress `json:"ip_addresses,omitempty"`
}
// IPAddress represents an IP address configuration
type IPAddress struct {
// IPAddressReservation to reserve for these cluster nodes.
// for eg: can be carved out of a VRF IP Range.
IPAddressReservation string `json:"ipAddressReservation"`
rahulii marked this conversation as resolved.
Show resolved Hide resolved
// VLANs for EM API to find by vxlan, project, and metro match then attach to device. OS userdata template will also configure this VLAN on the bond device
VXLANIDs []string `json:"vxlan_ids,omitempty"`
// UUID of VLANs to which this port should be assigned.
// Either VXLANID or VLANID should be provided.
VLANIDs []string `vlan_ids,omitempty`
// IP Address of the gateway
Gateway string `gateway,omitempty`
}
```

For example:
The following example configures the bond0 port of each node in a cluster to a hybrid bonded mode, attaches vxlan_id with ID 1000 and assigns each node an IP address from range "192.168.2.0/24" with gateway 192.168.2.1

```yaml
kind: PacketMachineTemplate
metadata:
name: example-packet-machine-template
spec:
template:
spec:
facility: ny5
metro: ny
plan: c3.small.x86
billingCycle: hourly
project: your-packet-project-id
sshKeys:
- ssh-rsa AAAAB3...your-public-key...
operatingSystem: ubuntu_20_04
ports:
- name: bond0
layer2: false
ip_addresses:
- ipAddressReservation: "192.168.2.0/24"
vxlan_ids: [1000]
gateway: "192.168.2.1"
```

The following example configures the eth1 port of each node in a cluster to a hybrid unbonded mode, removed the port from the bond, converts the port into a layer mode i.e attaches vxlan_id with ID 1001 and assigns each node an IP address from range "10.50.10.0/24" with gateway 10.50.10.1

```yaml

kind: PacketMachineTemplate
metadata:
name: example-packet-machine-template
spec:
template:
spec:
facility: ny5
metro: ny
plan: c3.small.x86
billingCycle: hourly
project: your-packet-project-id
sshKeys:
- ssh-rsa AAAAB3...your-public-key...
operatingSystem: ubuntu_20_04
ports:
- eth1:
rahulii marked this conversation as resolved.
Show resolved Hide resolved
bonded: false
layer2: true
ip_addresses:
- ipAddressReservation: "10.50.10.0/24"
vxlan_ids: [1001]
gateway: "10.50.10.1"

```

### APIs:

* * * * *

Following are some of the APIs provided by EM, that would be used:
1. **Convert the port to a layer2 port**:

a. https://deploy.equinix.com/developers/api/metal/#tag/Ports/operation/convertLayer2
b. Endpoint: https://api.equinix.com/metal/v1/ports/{id}/convert/layer-2
c. Requied Params : vnid (VLAN ID)

2. **Assign a port to a virtual network (VLAN)**:

a. https://deploy.equinix.com/developers/api/metal/#tag/Ports/operation/assignPort

b. Endpoint: https://api.equinix.com/metal/v1/ports/{id}/assign
Requied Params : vnid (VLAN ID)
c. Type: POST
d. Batch Mode
```
curl -X POST \
-H "Content-Type: application/json" \
-H "X-Auth-Token: <API_TOKEN> " \
"https://api.equinix.com/metal/v1/ports/{id}/vlan-assignments/batches" \
-d '{
"vlan_assignments": [
{
"vlan": "string",
"state": "assigned"
},
{
"vlan": "string",
"state": "assigned"
},
]
}'
```

3. **Device Events API**:
a. Endpoint: `https://api.equinix.com/metal/v1/devices/<id>/events`

4. **Remove port from the bond**
a. Endpoint:
```
curl -X POST \
-H "Content-Type: application/json" \
-H "X-Auth-Token: <API_TOKEN>" \
"https://api.equinix.com/metal/v1/ports/{id}/disbond" \
-d '{
"bulk_disable": false
}'
```


### User-Data Script for Network Configuration
To configure the operating system (OS), create new sub-interfaces for handling VLAN-tagged traffic, and assign IP addresses to those sub-interfaces, a user-data script is required to run at the time of OS boot.
Below is the user-data script that would be used.

```sh
#cloud-config
package_update: true
package_upgrade: true
packages:
- jq
- vlan

write_files:
- path: /tmp/final_configuration.sh
permissions: '0755'
content: |
#!/bin/bash
set -euo pipefail

echo "Running final configuration commands"
apt-get update -qq
apt-get install -y -qq jq vlan

# Generate the network configuration and append it to /etc/network/interfaces for each VLAN-tagged sub-interface.
cat <<EOL >> /etc/network/interfaces
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write out directly to interfaces.d/something. An example of that here:

https://github.com/equinix-labs/terraform-equinix-metal-hybrid-gateway/blob/main/modules/backend/cloud-config.cfg#L7

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err .. nevermind that example.. It was more involved and manipulated cloud-config.d rather than interfaces.d, it also depended on changing the OS to use netplan:

https://github.com/equinix-labs/terraform-equinix-metal-hybrid-gateway/blob/main/modules/backend/pre-cloud-config.sh

That said, we could write out to an interfaces.d/ file directly. This is just a nitpic and there may be reasons why it is better to take the inline approach you offered, to ensure that the modification is made at the right time relative to other operations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't have any opinions on /etc/network/interfaces vs /interfaces.d directory 😅
If it seems reasonable to write into /interfaces.d , I will try that out once.

{{- range .VLANs }}
auto {{ .PortName }}.{{ .ID }}
iface {{ .PortName }}.{{ .ID }} inet static
pre-up sleep 5
address {{ .IPAddress }}
netmask {{ .Netmask }}
gateway {{ .Gateway }}
vlan-raw-device {{ .PortName }}
{{- end }}
EOL

echo "VLAN configuration appended to /etc/network/interfaces."

# Function to send user state events
url="$(curl -sf https://metadata.platformequinix.com/metadata | jq -r .user_state_url)"
send_user_state_event() {
local state="$1"
local code="$2"
local message="$3"
local data

data=$(jq -n --arg state "$state" --arg code "$code" --arg message "$message" \
'{state: $state, code: ($code | tonumber), message: $message}')

curl -s -X POST -d "$data" "$url" || echo "Failed to send user state event"
}

send_user_state_event running 1000 "Configuring Network"

systemctl restart networking

# Verify network configuration
verification_failed=false
{{- range .VLANs }}
if ip addr show {{ .PortName }}.{{ .ID }} | grep -q {{ .IPAddress }}; then
echo "Configuration for VLAN {{ .ID }} on {{ .PortName }} with IP {{ .IPAddress }} successful"
else
echo "Configuration for VLAN {{ .ID }} on {{ .PortName }} with IP {{ .IPAddress }} failed" >&2
verification_failed=true
fi
{{- end }}

if [ "$verification_failed" = true ]; then
send_user_state_event failed 1002 "Network configuration failed"
exit 1
else
send_user_state_event succeeded 1001 "Network configuration successful"
fi

runcmd:
- |
# Fetch metadata and set up network interfaces
metadata=$(curl -sf https://metadata.platformequinix.com/metadata)

# Extract MAC addresses for eth0 and eth1
mac_eth0=$(echo "$metadata" | jq -r '.network.interfaces[] | select(.name == "eth0") | .mac')
mac_eth1=$(echo "$metadata" | jq -r '.network.interfaces[] | select(.name == "eth1") | .mac')

# Function to find interface name by MAC address
find_interface_by_mac() {
local mac="$1"
for iface in $(ls /sys/class/net/); do
iface_mac=$(ethtool -P "$iface" 2>/dev/null | awk '{print $NF}')
if [ "$iface_mac" == "$mac" ]; then
echo "$iface"
return
fi
done
echo "Interface not found for MAC $mac" >&2
return 1
}

# Find interface names for eth0 and eth1
iface_eth0=$(find_interface_by_mac "$mac_eth0")
iface_eth1=$(find_interface_by_mac "$mac_eth1")

# Replace eth0 and eth1 in the script with the actual interface names
sed -i "s/eth0/${iface_eth0}/g" /tmp/final_configuration.sh
sed -i "s/eth1/${iface_eth1}/g" /tmp/final_configuration.sh

# Execute the modified script
bash /tmp/final_configuration.sh
```

The CAPP will use go-templates to substitute the placeholders with appropriate values given by the user.

### Layer 2 Networking Setup by the CAPP Operator
When provisioning a metal node with Layer 2 networking, the Cluster API Provider (CAPP) Operator will perform the following steps:
1. **Create a ConfigMap for IP Address Management**: The operator will create a new ConfigMap named <cluster_name-port_name> for each port to manage IP addresses. This ConfigMap is critical for tracking and allocating IP addresses as detailed in the *IP Address Management* section.
rahulii marked this conversation as resolved.
Show resolved Hide resolved
2. **Select an Available IP Address**: CAPP will select an available IP address from the ConfigMap to be assigned to the machine, node, or server being provisioned.
3. **Generate User-Data Script**: Using Go templates, CAPP will substitute the necessary variables in the user-data script, such as port name, IP address, gateway, and VXLAN. These values are provided by the user through the custom resource definition.
4. **Submit Device Creation Request**: CAPP will then submit a request to create the device, incorporating the generated user-data script for OS and network configuration.
5. **Verify Network Configuration**: After the machine or device is successfully provisioned, CAPP will poll the device events API to check whether the network configuration was successful. If not, it will handle the failure or timeout as needed.
6. **Perform Post-Provisioning Network Operations**: Once the device is provisioned and the network configuration from the user-data script is in place, CAPP will make calls to the /ports API to perform additional operations. These include assigning the VLAN to the port, converting the port to Layer 2 if required, and other necessary configurations.
rahulii marked this conversation as resolved.
Show resolved Hide resolved

### Explanation of send_user_state_event Function
The send_user_state_event function in the script is responsible for sending status updates to the user_state_url fetched from Equinix Metadata API. The Metadata API is a service available on every Equinix Metal server instance that allows the server to access and share various data about itself. Here’s how the function works:
rahulii marked this conversation as resolved.
Show resolved Hide resolved
1. **Retrieve the user_state_url**: The script fetches the user_state_url from the Equinix Metadata API. This URL is used to send custom user state events that report on the progress or status of the server's configuration.
2. **Prepare the Event Data**: The function constructs a JSON payload containing the state, code, and message. The jq tool is used to create this JSON object dynamically, based on the input parameters.
3. **Send the Event**: The constructed JSON data is then sent to the user_state_url via a POST request. This allows the system to log the state of the network configuration process (e.g., "running," "succeeded," or "failed") along with an appropriate status code and message.
This approach enables tracking of the server's state during the boot process, particularly for critical operations like network configuration.


### IP Address Management:

* * * * *

In Phase-1, the Cluster API Provider Packet (CAPP) will manage IP allotment to individual machines using Kubernetes Configmaps. This approach allows for tracking allocations and assigning available IP addresses dynamically.
rahulii marked this conversation as resolved.
Show resolved Hide resolved

Example:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: capp-ip-allocations
rahulii marked this conversation as resolved.
Show resolved Hide resolved
namespace: cluster-api-provider-packet-system
Data:
cidr: 192.168.2.0/24
allocations: |
rahulii marked this conversation as resolved.
Show resolved Hide resolved
{
"machine1": "192.168.2.2",
"machine2": "192.168.2.3"
}
```

In the example above, capp-ip-allocations ConfigMap in the cluster-api-provider-packet-system namespace tracks IP allocations. The cidr field specifies the IP range, while the allocations field is a JSON object mapping machine names to their allocated IP addresses.