top of page
  • Twitter Social Icon
  • LinkedIn Social Icon
  • Facebook Social Icon

Automating Cilium CNI Installation on OCI OKE with Pulumi and Python

  • Writer: Nikhil Verma
    Nikhil Verma
  • 2 days ago
  • 9 min read

Updated: 3 hours ago

In the realm of cloud-native applications, Kubernetes stands as the top choice for orchestrating containerized workloads. However, selecting the right Container Network Interface (CNI) is crucial for both performance and security. Oracle Cloud Infrastructure (OCI) Oracle Kubernetes Engine (OKE) does not include Cilium CNI by default. This can be a setback for users who want to utilize Cilium's powerful features. In this guide, we will show you how to automate the installation of Cilium CNI on OCI OKE using Pulumi and Python.


By following this guide, you will learn to remove the existing Flannel overlay, set up Cilium CNI, and restart the nodes— all through automation with Pulumi. By the end of this article, you will confidently know how to implement Cilium CNI in your OCI OKE environment.

ree

Understanding Cilium CNI


Cilium is an open-source networking solution that enhances networking, security, and observability specifically for containerized applications. It uses eBPF (Extended Berkeley Packet Filter) technology, which allows for significant performance improvements. Here are some compelling benefits of using Cilium:


  • Enhanced Security: With layer 7 policies, Cilium provides detailed control over traffic between different services. For instance, you can enforce policies that only allow specific microservices to communicate, greatly reducing attack surfaces.


  • Performance Improvements: Cilium can achieve up to 5x lower latency and significantly higher throughput compared to traditional CNIs like Flannel. For example, users report noticeable speed increases in service-to-service communication under high load.


  • Observability Tools: Cilium offers robust monitoring and tracing features, making it easier to pinpoint and resolve network problems. This can be critical for maintaining uptime in production environments, where even a small network issue can lead to time-critical service disruptions.


Given these advantages, many companies are migrating to Cilium for their networking needs. However, the manual installation can often deter users, which is where automation becomes invaluable.


Why Use Pulumi for Automation?


Pulumi is an advanced Infrastructure as Code (IaC) tool that allows users to manage cloud resources in numerous programming languages such as Python, JavaScript, and Go. Unlike traditional IaC tools that often use specialized languages, Pulumi lets developers use familiar programming concepts. This flexibility helps to blend infrastructure deployment with application logic seamlessly.


Using Pulumi to automate the installation of Cilium CNI on OCI OKE provides several key benefits:


  • Flexibility: You can write custom scripts that fit your specific needs using Python. This means you can add complex logic that might be required for unique installation flows.


  • Reusability: Pulumi encourages creating reusable components, potentially saving you hours in future deployments. This means you can quickly replicate setups across different environments, maintaining consistency.


  • Version Control: By using standard programming languages such as Python, you can utilize version control systems like Git. This feature not only tracks changes but also makes collaboration simpler across teams.


Prerequisites


Before getting started with the automation process, ensure you have the following prerequisites:


  1. OCI Account: You need an active Oracle Cloud account with access to OKE.

  2. Install Pulumi: Make sure you have Pulumi ready on your local machine. You can find the official installation guide for assistance.

  3. Python Environment: Ensure you have Python installed, and you might want to set up a virtual environment for better package management, especially if you're juggling various projects.

  4. kubectl: Install `kubectl` for interacting with your Kubernetes cluster.

  5. Access Permissions: Confirm you have the necessary permissions to manage the OKE cluster.


Step-by-Step Automation Process


Step 1: Set up Networking for new Cluster


In Step 1 of setting up networking for a new cluster, we begin by creating a Virtual Cloud Network (VCN) named "oke-vcn" with a CIDR block of 10.0.0.0/16. An Internet Gateway is then established, linked to this VCN, to allow external connectivity. A public route table is configured to direct all outbound traffic (0.0.0.0/0) through the Internet Gateway.

Security is a priority, so we create a security list for the cluster, allowing all outbound traffic and specifying rules for ingress traffic. These rules include allowing TCP traffic for the Kubernetes API server and kubelet API, ICMP traffic for Cilium, and specific ports for Cilium health checks and VXLAN overlay.

We also set up a security list for worker nodes, permitting all outbound traffic and controlling ingress traffic within the VCN, while enabling SSH access and NodePort services.

Finally, we create two subnets: a "cluster subnet" with a CIDR block of 10.0.1.0/24, and a "node subnet" with a CIDR block of 10.0.2.0/24. Both subnets are linked to the public route table and their respective security lists, and they allow public IPs on VNICs.


# Create VCN for the cluster
vcn = oci.core.Vcn("oke-vcn",
    compartment_id=compartment_id,
    cidr_blocks=["10.0.0.0/16"],
    display_name="oke-vcn",
    dns_label="okevcn"
)

# Create Internet Gateway
internet_gateway = oci.core.InternetGateway("oke-internet-gateway",
    compartment_id=compartment_id,
    vcn_id=vcn.id,
    display_name="oke-internet-gateway"
)

# Create Route Table for public subnet
public_route_table = oci.core.RouteTable("oke-public-route-table",
    compartment_id=compartment_id,
    vcn_id=vcn.id,
    display_name="oke-public-route-table",
    route_rules=[{
        "destination": "0.0.0.0/0",
        "network_entity_id": internet_gateway.id,
    }]
)

# Create Security List for cluster
cluster_security_list = oci.core.SecurityList("oke-cluster-security-list",
    compartment_id=compartment_id,
    vcn_id=vcn.id,
    display_name="oke-cluster-security-list",
    egress_security_rules=[
        {
            "protocol": "all",
            "destination": "0.0.0.0/0",
            "description": "Allow all outbound traffic",
        },
    ],
    ingress_security_rules=[
        {
            "protocol": "6",  # TCP
            "source": "0.0.0.0/0",
            "tcp_options": {
                "min": 6443,
                "max": 6443,
            },
            "description": "Kubernetes API server",
        },
        {
            "protocol": "6",  # TCP
            "source": "10.0.0.0/16",
            "tcp_options": {
                "min": 12250,
                "max": 12250,
            },
            "description": "Kubernetes kubelet API",
        },
        {
            "protocol": "1",  # ICMP
            "source": "10.0.0.0/16",
            "description": "ICMP traffic for Cilium",
        },
        # Cilium specific ports
        {
            "protocol": "6",  # TCP
            "source": "10.0.0.0/16",
            "tcp_options": {
                "min": 4240,
                "max": 4240,
            },
            "description": "Cilium health checks",
        },
        {
            "protocol": "17",  # UDP
            "source": "10.0.0.0/16",
            "udp_options": {
                "min": 8472,
                "max": 8472,
            },
            "description": "Cilium VXLAN overlay",
        },
    ]
)

# Create Security List for worker nodes
node_security_list = oci.core.SecurityList("oke-node-security-list",
    compartment_id=compartment_id,
    vcn_id=vcn.id,
    display_name="oke-node-security-list",
    egress_security_rules=[
        {
            "protocol": "all",
            "destination": "0.0.0.0/0",
            "description": "Allow all outbound traffic",
        },
    ],
    ingress_security_rules=[
        {
            "protocol": "all",
            "source": "10.0.0.0/16",
            "description": "Allow all traffic within VCN",
        },
        {
            "protocol": "6",  # TCP
            "source": "0.0.0.0/0",
            "tcp_options": {
                "min": 22,
                "max": 22,
            },
            "description": "SSH access",
        },
        {
            "protocol": "6",  # TCP
            "source": "0.0.0.0/0",
            "tcp_options": {
                "min": 30000,
                "max": 32767,
            },
            "description": "NodePort services",
        },
    ]
)

# Create Cluster Subnet
cluster_subnet = oci.core.Subnet("oke-cluster-subnet",
    compartment_id=compartment_id,
    vcn_id=vcn.id,
    cidr_block="10.0.1.0/24",
    display_name="oke-cluster-subnet",
    dns_label="cluster",
    route_table_id=public_route_table.id,
    security_list_ids=[cluster_security_list.id],
    prohibit_public_ip_on_vnic=False
)

# Create Node Pool Subnet
node_subnet = oci.core.Subnet("oke-node-subnet",
    compartment_id=compartment_id,
    vcn_id=vcn.id,
    cidr_block="10.0.2.0/24",
    display_name="oke-node-subnet",
    dns_label="nodes",
    route_table_id=public_route_table.id,
    security_list_ids=[node_security_list.id],
    prohibit_public_ip_on_vnic=False
)

Step 2: Availability Domains

We require availability domains because we will be launching nodes in different ADs.

# Get availability domains
availability_domains = oci.identity.get_availability_domains(
    compartment_id=compartment_id
)

Step3: Define the OKE cluster

In this section, we will deploy an Enhanced cluster without specifying any CNI.

# Create OKE Cluster without default CNI (to install Cilium)
cluster = oci.containerengine.Cluster("oke-cluster",
    compartment_id=compartment_id,
    kubernetes_version="v1.34.1",
    name="oke-cilium-cluster",
    vcn_id=vcn.id,
    type="ENHANCED_CLUSTER",
    endpoint_config={
        "subnet_id": cluster_subnet.id,
        "is_public_ip_enabled": True,
    },
    options={
        "service_lb_subnet_ids": [cluster_subnet.id],
        "kubernetes_network_config": {
            "pods_cidr": "10.244.0.0/16",
            "services_cidr": "10.96.0.0/16",
        },
        "add_ons": {
            "is_kubernetes_dashboard_enabled": False,
            "is_tiller_enabled": False,
        },
        "admission_controller_options": {
            "is_pod_security_policy_enabled": False,
        },
    }
    # Note: No cluster_pod_network_options specified to avoid default CNI installation
)

Step4: Deploy node pool

We first define placement configs to deploy nodes in different AD's

# Create placement configurations for node pool
placement_configs = []
for i, ad in enumerate(availability_domains.availability_domains[:3]):
    placement_configs.append({
        "availability_domain": ad.name,
        "subnet_id": node_subnet.id,
    })

# Create Node Pool
node_pool = oci.containerengine.NodePool("oke-node-pool",
    cluster_id=cluster.id,
    compartment_id=compartment_id,
    kubernetes_version="v1.33.1",
    name="oke-node-pool",
    node_config_details={
        "size": 2,
        "placement_configs": placement_configs,
    },
    node_shape="VM.Standard.E3.Flex",
    node_shape_config={
        "memory_in_gbs": 16,
        "ocpus": 4,
    },
    node_source_details={
        "image_id": "ocid1.image.oc1.iad.aaaaaaaasnbi4aalhxsv36r32eejjomlzrhsfbbhrcbzwptrbzhlspc2kqqa",
        "source_type": "IMAGE",
    },
    initial_node_labels=[
        {
            "key": "node-pool",
            "value": "oke-cilium",
        },
    ],
    ssh_public_key=ssh_public_key
)

Remove Flannel CNI

To begin, we must access the cluster by obtaining the kubeconfig file. Subsequently, we will create a Kubernetes provider using this kubeconfig.

# Get kubeconfig
def get_kubeconfig(cluster_id):
    return oci.containerengine.get_cluster_kube_config(
        cluster_id=cluster_id,
        token_version="2.0.0"
    )

kubeconfig = cluster.id.apply(get_kubeconfig)

# Create Kubernetes provider using the kubeconfig
k8s_provider = k8s.Provider("oke-k8s",
    kubeconfig=kubeconfig.content,
    opts=pulumi.ResourceOptions(depends_on=[cluster, node_pool])
)

Then we will create a dedicated namespace called "cilium-system". It includes a helper function to extract the host from an endpoint URL by removing specific parts of the string. A job named "remove-default-cni" is created to eliminate existing CNI components before installing Cilium. This job uses an Alpine-based container running kubectl commands to delete specific daemonsets and configmaps related to the default CNI, ensuring a clean slate for Cilium deployment. The job has a restart policy of "OnFailure" and a backoff limit of 3. Additionally, a service account "remove-cni-sa" is created in the "kube-system" namespace, along with a corresponding ClusterRole and ClusterRoleBinding. The ClusterRole grants permissions to manage daemonsets and configmaps, while the ClusterRoleBinding associates these permissions with the service account. All resources are configured using Pulumi with a specified provider.

# Create Cilium namespace
cilium_namespace = k8s.core.v1.Namespace("cilium-system",
    metadata={
        "name": "cilium-system",
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Helper function to extract host from endpoint
def extract_host_from_endpoint(endpoint):
    if endpoint:
        return endpoint.replace("https://", "").replace(":6443", "")
    return ""

# Create a job to remove default CNI after cluster is ready but before Cilium
remove_default_cni_job = k8s.batch.v1.Job("remove-default-cni",
    metadata={
        "name": "remove-default-cni",
        "namespace": "kube-system",
    },
    spec={
        "template": {
            "spec": {
                "service_account_name": "remove-cni-sa",
                "containers": [{
                    "name": "remove-cni",
                    "image": "alpine/kubectl:1.34.1",
                    "command": ["/bin/sh"],
                    "args": ["-c", """
                        echo "Checking for existing CNI components..."
                        kubectl delete daemonset -n kube-system oci-cni --ignore-not-found=true
                        kubectl delete daemonset -n kube-system kube-flannel-ds --ignore-not-found=true
                        kubectl delete configmap -n kube-system kube-flannel-cfg --ignore-not-found=true
                        kubectl delete configmap -n kube-system cni-config --ignore-not-found=true
                        echo "CNI cleanup completed"
                        sleep 10
                    """],
                }],
                "restart_policy": "OnFailure",
            },
        },
        "backoff_limit": 3,
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Create service account and RBAC for the cleanup job
remove_cni_sa = k8s.core.v1.ServiceAccount("remove-cni-sa",
    metadata={
        "name": "remove-cni-sa",
        "namespace": "kube-system",
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

remove_cni_cluster_role = k8s.rbac.v1.ClusterRole("remove-cni-role",
    metadata={
        "name": "remove-cni-role",
    },
    rules=[
        {
            "api_groups": ["apps"],
            "resources": ["daemonsets"],
            "verbs": ["get", "list", "delete"],
        },
        {
            "api_groups": [""],
            "resources": ["configmaps"],
            "verbs": ["get", "list", "delete"],
        },
    ],
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

remove_cni_cluster_role_binding = k8s.rbac.v1.ClusterRoleBinding("remove-cni-binding",
    metadata={
        "name": "remove-cni-binding",
    },
    role_ref={
        "api_group": "rbac.authorization.k8s.io",
        "kind": "ClusterRole",
        "name": remove_cni_cluster_role.metadata["name"],
    },
    subjects=[{
        "kind": "ServiceAccount",
        "name": remove_cni_sa.metadata["name"],
        "namespace": "kube-system",
    }],
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

Step 2: Install Cilium CNI


After Flannel is removed, the next step is to install Cilium CNI. To install Cilium using Helm, ensure it depends on the CNI cleanup job. The configuration specifies using the "cilium" chart version 1.18.2 in the "cilium-system" namespace, fetched from the Cilium Helm repository. Key settings include a cluster named "oke-cilium-cluster" with IPAM in "cluster-pool" mode, replacing kube-proxy with Cilium, and setting the service host and port. The configuration opts for "tunnel" routing mode with "vxlan" protocol and enables IPv4 masquerade. It disables AWS ENI mode and host firewall for OKE compatibility, while enabling node port and external IPs. The CNI is set to install exclusively with Cilium as the only CNI, and security features include default policy enforcement without encryption enabled. The setup relies on specific Pulumi resource options, including dependencies on the Cilium namespace and the default CNI job removal.


# Install Cilium using Helm - depends on CNI cleanup job
cilium_chart = k8s.helm.v3.Chart(
    "cilium",
    k8s.helm.v3.ChartOpts(
        chart="cilium",
        version="1.18.2",  # Use latest stable version
        namespace="cilium-system",
        fetch_opts=k8s.helm.v3.FetchOpts(
            repo="https://helm.cilium.io/",
        ),
        values={
            # Cilium configuration for OKE
            "cluster": {
                "name": "oke-cilium-cluster",
                "id": 1,
            },
            "ipam": {
                "mode": "cluster-pool",  # Use cluster-pool for better control
                "operator": {
                    "clusterPoolIPv4PodCIDRList": ["10.244.0.0/16"],
                }
            },
            "kubeProxyReplacement": True,  # Replace kube-proxy entirely with Cilium
            "k8sServiceHost": cluster.endpoints.apply(
                lambda endpoints: extract_host_from_endpoint(
                    endpoints[0].public_endpoint
                )
            ),
            "k8sServicePort": "6443",
            "operator": {
                "replicas": 1,
                "rollOutPods": True,
            },
            # Ensure Cilium manages all networking
            "routingMode": "tunnel",
            "tunnelProtocol": "vxlan",
            "autoDirectNodeRoutes": False,
            "enableIPv4Masquerade": True,
            "enableIPv6Masquerade": False,
            "installIptablesRules": True,
            "masqueradeInterfaces": "eth0",
            # OKE/OCI specific configurations
            "eni": {
                "enabled": False,  # Disable AWS ENI mode
            },
            "nodePort": {
                "enabled": True,
            },
            "externalIPs": {
                "enabled": True,
            },
            "hostFirewall": {
                "enabled": False,  # Disable for OKE compatibility
            },
            "bpf": {
                "masquerade": True,
                "hostLegacyRouting": False,
            },
            # CNI configuration
            "cni": {
                "install": True,
                "exclusive": True,  # Cilium should be the only CNI
                "chainingMode": "none",
            },
            # Security and networking features
            "policyEnforcementMode": "default",
            "encryption": {
                "enabled": False,  # Can be enabled for encryption in transit
                "type": "wireguard",
            }
        }
    ),
    opts=pulumi.ResourceOptions(
        provider=k8s_provider,
        depends_on=[cilium_namespace, remove_default_cni_job]
    )
)

This sets the CNI type to Cilium and installs it on your OKE cluster.


Step 3: Restart Nodes


After installing Cilium, it is essential to restart your nodes.

First we retrieves node pool information using OCI's container engine, extracting all node IDs from the node pool. These IDs are then exported using Pulumi. A function, 'reboot_nodes_command', generates commands to perform a soft reset on each node, including a sleep interval after each reboot. A rolling reboot is initiated, ensuring it occurs after the installation of Cilium, by applying a local command with the generated reboot commands.

node_pool_info = node_pool.id.apply(
    lambda np_id: oci.containerengine.get_node_pool(node_pool_id=np_id)
)

# Extract all node IDs
node_ids = node_pool_info.apply(lambda info: [n.id for n in info.nodes])

# Export them
pulumi.export("node_ids", node_ids)


def reboot_nodes_command(ids):
    # We can safely use len(ids) here because `ids` is already the resolved list
    cmds = []
    for nid in ids:
        # Reboot command
        cmds.append(f"oci compute instance action --action SOFTRESET --instance-id {nid}")
        # Simple sleep after reboot
        cmds.append("sleep 30")

    return " && ".join(cmds)

rolling_reboot = node_ids.apply(
    lambda ids: local.Command(
        "rolling-reboot-nodes",
        create=reboot_nodes_command(ids),
        opts=pulumi.ResourceOptions(
            depends_on=[cilium_chart]  # Ensure this runs after Cilium installation
        )
    )
)


This loop goes through every node pool in the cluster and updates the configuration to map the changes.


Step 4: Complete Pulumi Program


Putting it all together, your full Pulumi program will look like this:


This complete program automates the entire process of transitioning from Flannel to Cilium and restarting the nodes.


Pulumi Deployment Validation


Let us review your deployment in the Pulumi workspace.

ree

Testing the Installation


After executing your Pulumi program, it's essential to ensure Cilium is installed correctly. Run the following command:


# Check Cilium installation status:
kubectl get pods -n cilium-system
cilium status --wait
cilium connectivity test

You should see Cilium pods running smoothly in the 'kube-system' namespace. Once everything is confirmed to be operational, you can start taking advantage of Cilium's advanced security and performance features in your applications.


Wrapping Up


Automating the installation of Cilium CNI on OCI OKE with Pulumi and Python streamlines your transition and enhances your Kubernetes networking capabilities. By following the clear steps in this guide, you can efficiently move from Flannel to Cilium, gaining access to advanced features for security, performance, and observability.


As cloud-native technologies evolve, leveraging automation tools like Pulumi will become increasingly important for managing infrastructure efficiently. By adopting these practices, you ensure your Kubernetes environment remains not only efficient but also secure and scalable.


For additional information, consider checking out the complete Pulumi code available on GitHub.


Comments


  • Grey Twitter Icon
  • Grey LinkedIn Icon
  • Grey Facebook Icon
bottom of page