Cisco Application Centric Infrastructure Calico Design White Paper

White Paper

Available Languages

Download Options

  • PDF
    (1.4 MB)
    View with Adobe Reader on a variety of devices
Updated:February 9, 2022

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Available Languages

Download Options

  • PDF
    (1.4 MB)
    View with Adobe Reader on a variety of devices
Updated:February 9, 2022
 

 

Introduction

With the increasing adoption of container technologies and Cisco® Application Centric Infrastructure (Cisco ACI®) as a pervasive data-center fabric technology, it is only natural to see customers trying to integrate these two solutions.

At the time of writing, the most popular container orchestrator engine is Kubernetes (often abbreviated K8s). A key design choice every Kubernetes administrator has to make when deploying a new cluster is selecting a network plugin. The network plugin is responsible for providing network connectivity, IP address management, and security policies to containerized workloads.

A series of network plugins is available for Kubernetes, with different transport protocols and/or features being offered. To browse the full list of network plugins supported by Kubernetes, follow this link: https://kubernetes.io/docs/concepts/cluster-administration/networking/#how-to-implement-the-kubernetes-networking-model.

Note:      While Cisco offers a CNI (container network interface) plugin directly compatible and integrated with Cisco ACI, that is not covered in this document. In this white paper we will be discussing the current best practices for integrating Cisco ACI with Project Calico.

Calico

Calico supports two main network modes: direct container routing (no overlay transport protocol) or network overlay using VXLAN or IPinIP (default) encapsulations to exchange traffic between workloads. The direct routing approach means the underlying network is aware of the IP addresses used by workloads. Conversely, the overlay network approach means the underlying physical network is not aware of the workloads’ IP addresses. In that mode, the physical network only needs to provide IP connectivity between K8s nodes while container to container communications are handled by the Calico network plugin directly. This, however, comes at the cost of additional performance overhead as well as complexity in interconnecting your container-based workloads with external non-containerized workloads.

When the underlying networking fabric is aware of the workloads’ IP addresses, an overlay is not necessary. The fabric can directly route traffic between workloads inside and outside of the cluster as well as allowing direct access to the services running on the cluster. This is the preferred Calico mode of deployment when running on premises.[1] This guide details the recommended ACI configuration when deploying Calico in direct routing mode.

You can read more about Calico at https://docs.projectcalico.org/

Calico routing architecture

In a Calico network, each compute server acts as a router for all the endpoints that are hosted on that compute server. We call that function a vRouter. The data path is provided by the Linux kernel, the control plane by a BGP protocol server, and management plane by Calico’s on-server agent, Felix.

Each endpoint can only communicate through its local vRouter. The first and last hop in any Calico packet flow is an IP router hop through a vRouter. Each vRouter announces all of the endpoints it is responsible for to all the other vRouters and other routers on the infrastructure fabric using BGP, usually with BGP route reflectors to increase scale.[2]

Calico proposes three BGP design options:

      The BGP AS Per Rack design

      The BGP AS Per Compute Server design

      The Downward Default design

They are all detailed at https://docs.projectcalico.org/reference/architecture/design/l3-interconnect-fabric

After taking into consideration the characteristics and capabilities of ACI and Calico, Cisco’s current recommendation is to implement an alternative design where a single AS is allocated through the whole Cluster.

AS Per Cluster design – overview

In this design, a dedicated Cisco ACI L3Out will be created for the entire Kubernetes cluster. This will remove control-plane and data-plane overhead on the K8s cluster, thus providing improved performance and enhanced visibility to the workloads.

Each Kubernetes node will have the same AS number and will peer via eBGP with a pair of ACI Top-of-Rack (ToR) switches configured in a vPC pair. Having a vPC pair of leaf switches provides redundancy within the rack.

This eBGP design does not require running any route reflector nor full mesh peering (iBGP) in the Calico infrastructure; this results in a more scalable, simpler, and easier to maintain architecture.

In order to remove the need of running iBGP in the cluster the ACI BGP configuration will need to enable the following features:

      AS override: The AS override function will replace the AS number from the originating router with the AS number of the sending BGP router in the AS Path of outbound routes.

      Disable Peer AS Check—Disables the peer autonomous number check.

With these two configuration options enabled ACI will advertise the POD Subnets between the Kubernetes nodes (even if learned from the same AS) ensuring all the PODs in the Cluster can communicate with each other.

An added benefits of configuring all the Kuberneters nodes with the same AS is that it allows us to use BGP Dynamic Neighbours by using the Cluster Subnet as BGP neighbors greatly reducing the ACI config complexity.

Note: You need to ensure the Kubernetes nodes are configured to peer only with the Border Leaves Switches in the same RACK.

Once this design is implemented, the following connectivity is expected:

      Pods running on the Kubernetes cluster can be directly accessed from inside ACI or outside through transit routing.

      Pod-to-pod and node-to-node connectivity will happen over the same L3Out and external end point group (EPG).

      Exposed Services can be directly accessed from inside or outside ACI. Those services will be load balanced by ACI through ECMP 64-way provided by BGP.

Figure 1 shows an example of such a design with two racks and a six-node K8s cluster:

      Each rack has a pair of ToR leaf switches and three Kubernetes nodes.

      ACI uses AS 65002 for its leaf switches.

      The six Kubernetes nodes are allocated AS 65003:

AS Per Cluster design

Figure 1.            

AS Per Cluster design

Physical connectivity

The physical connectivity will be provided by a virtual Port-Channel (vPC) configured on the ACI leaf switches toward the Kubernetes nodes. One L3Out is configured on the ACI fabric to run eBGP with the vRouter in each Kubernetes node through the vPC port-channel.

The vPC design supports both virtualised and bare-metal servers.

Standard SVI vs. floating SVI

When it comes to connecting the K8s nodes to the ACI fabric, an important decision to make is choosing to use standard SVIs or floating SVIs. If you are not familiar with floating SVI, you should take a moment to read https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/Cisco-ACI-Floating-L3Out.html

The following table should help you decide between the two options. The table is based on Cisco ACI Release 5.2.3 scale and features.

Table 1.        Floating SVI vs Standard SVI

 

Floating SVI

Standard SVI

Max cluster rack span[3]

3 racks: 6 anchor nodes: with optimal traffic flows

19 racks: 6 anchor nodes + 32 non-anchor: with suboptimal traffic flows for the nodes connected to the non-anchor nodes

6: 12 Boarder Leaves with optimal traffic flows

Node subnets required

One subnet for the whole cluster

One /29 Subnet Per Node

Static paths binding

None; the binding is done at the physical domain level.

One per vPC

VM mobility

Yes; with suboptimal traffic flows if the VMs move to a different rack

No

Per fabric scale

200 floating SVI IPs (An IPv4/6 dual stack cluster uses two floating SVIs.)

Unlimited

L3Out per fabric

100

2400

ACI version

5.0 or newer

Any (This design was tested on 4.2 and 5.x code.)

Note:      Floating SVI L3Out and regular, or “standard,” SVI L3Out can coexist in the same ACI fabric and leaves.

Recommendation: Floating SVI is the better design choice, providing a simpler and more flexible configuration. As long as it can fulfil the scale requirements of your K8s environments, floating SVI is the preferred design choice.

The rest of the document will cover both the floating SVI and the regular or standard SVI approach. Since most of the configuration between the two options is identical, this document will first focus on the specific implementation details of a floating SVI, continue with the details of a standard SVI design, and end with the common configurations.

A detailed configuration example will be also provided for both options.

Floating SVI design

The floating SVI feature enables you to configure an L3Out without specifying logical interfaces. The feature saves you from having to configure multiple L3Out logical interfaces to maintain routing when Virtual Machines (VMs) move from one host to another. Floating SVI is supported for VMware vSphere Distributed Switch (VDS) as of ACI Release 4.2(1) and on physical domains as of ACI Release 5.0(1). It is recommended using the physical domains approach for the following reasons:

      Can support any hypervisor

      Can support mixed mode clusters (VMs and bare-metal)

Using floating SVI also relaxes the requirement of not having any Layer-2 communications between the routing nodes; this allows the design to use:

      A single subnet for the whole cluster

      A single encapsulation (VLAN) for the whole cluster

Floating SVI design

Figure 2.            

Floating SVI design

A strong recommendation for this design is to peer the K8s node with the local (same-rack) anchor nodes. This is needed to limit traffic tromboning because, currently, a compute leaf (in the example below, Leaf105) will install, as next hop for the routes received from the K8s nodes, the TEP IP address of the anchor nodes where the eBGP peering is established. This leads to suboptimal traffic flow, as shown in Figure 3.

VM mobility is, however, still supported and can be used to perform, for example, maintenance on the hypervisor without the need to power off nodes in the K8s cluster.

Note:      The techniques described in the “Avoiding suboptimal traffic from an ACI internal EP to a floating L3Out” section in the "Floating SVI design” document cannot be applied to this design.

Tromboning with floating SVI

Figure 3.            

Tromboning with floating SVI

Standard SVI design

If floating SVI cannot be used due to scale limitations or ACI version, a valid alternative is to use standard SVI. The high-level architecture is still the same, where the K8s nodes will peer with the local (same-rack) border leaves; however, a “standard” L3out is meant to attach routing devices only. It is not meant to attach servers directly on the SVI of an L3Out because the Layer 2 domain created by an L3Out with SVIs is not equivalent to a regular bridge domain.

For this reason, it is preferred not to have any Layer-2 switching between the K8s nodes connected to an external bridge domain.

To ensure that our cluster follows this design “best practice,” we will need to allocate:

      A /29 subnet for every server

      A dedicated encapsulation (VLAN) on a per-node basis

A standard SVI design

Figure 4.            

A standard SVI design

K8s node gateway

Each Kubernetes node uses ACI as default gateway (GW). The GW IP address will be the secondary IP of the floating SVI or standard SVI interfaces.

If the Kubernetes nodes are Virtual Machines (VMs), follow these additional steps:

      Configure the virtual switch’s port-group load-balancing policy (or its equivalent) to “route based on IP address”

      Avoid running more than one Kubernetes node per hypervisor. It is technically possible to run multiple Kubernetes nodes on the same hypervisor, but this is not recommended because a hypervisor failure would result in a double (or more) Kubernetes node failure.

      For standard SVI design only: Ensure that each VM is hard-pinned to a hypervisor, and ensure that no live migration of the VMs from one hypervisor to another can take place.

If the Kubernetes nodes are bare-metal, follow these additional steps:

      Configure the physical NICs in an LACP bonding (often called 802.3ad mode)

This design choice allows creating nodes with a single interface, simplifying the routing management on the nodes; however, it is possible to create additional interfaces as needed.

Kubernetes node egress routing

Because ACI leaf switches are the default gateway for the Kubernetes nodes, in theory the nodes only require a default route that uses the ACI L3Out as the next hop both for node-to-node traffic and for node to outside world communications.

However, for ease of troubleshooting and visibility, the ACI fabric will be configured to advertise back, to the individual nodes, the /26 pod subnets.[4]

Benefits of using the L3Out secondary IP as default gateway include:

      Zero impact during leaf reload or interface failures: the secondary IP and its MAC address are shared between the ACI leaves. In the event of one leaf failure, traffic will seamlessly converge to the remaining leaf.

      No need for a dedicated management interface. The node will be reachable even before eBGP is configured.

Kubernetes node ingress routing

Each Kubernetes node will be configured to advertise the following subnets to ACI:

      Node subnet

      Its allocated subnet(s) inside the pod supernet (a /26 by default[5])

      Host route (/32) for any pod on the node outside of the pod subnets allocated to the node

      The whole service subnet advertised from each node

      Host route for each exposed service (a /32 route in the service subnet) for each service endpoint configured with the external Traffic Policy: Local

ACI BGP configuration

AS override

The AS override function will replace the AS number from the originating router with the AS number of the sending BGP router in the AS Path of outbound routes.

Disable Peer AS Check

Disables the peer autonomous number check.

BGP Graceful Restart

Both ACI and Calico are configured by default to use BGP Graceful Restart. When a BGP speaker restarts its BGP process or when the BGP process crashes, neighbors will not discard the received paths from the speaker, ensuring that connectivity is not impacted as long as the data plane is still correctly programmed.

This feature ensures that, if the BGP process on the Kubernetes node restarts (CNI Calico BGP process upgrade/crash), no traffic is impacted. In the event of an ACI switch reload, the feature is not going to provide any benefit, because Kubernetes nodes are not receiving any routes from ACI.

BGP timers

The ACI BGP timers should be set to 1s/3s to match the Calico configuration.

AS path policy

By default, ACI will only install one ECMP path for a received subnet even if the remote AS is different. To allow installing more than one ECMP path, it is required to configure an AS path policy to relax the AS_path restriction when choosing multiple paths.

Max BGP ECMP path

By default, ACI will only install 16 eBGP/iBGP ECMP paths. This would limit spreading the load to up to 16 K8s nodes. The recommendation is to increase both values to 64. Increasing also the iBGP max path value to 64 will ensure that the internal MP-BGP process is also installing additional paths.

BGP hardening

In order to protect the ACI against potential Kubernetes BGP misconfigurations, the following settings are recommended:

      Enabled BGP password authentication

      Set the maximum AS limit to one:

a.   Per the eBGP architecture, the AS path should always be one.

      Configure BGP import route control to accept only the expected subnets from the Kubernetes cluster:

a.   Pod subnet

b.   Node subnet

c.   Service subnet

      (Optional) Set a limit on the number of received prefixes from the nodes.

Kubernetes node maintenance and failure

Before performing any maintenance or reloading a node, you should follow the standard Kubernetes best practice of draining the node. Draining a node ensures that all the pods present on that node are terminated first, then restarted on other nodes. For more info see: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

While draining a node, the BGP process running on that node stops advertising the service addresses toward ACI, ensuring there is no impact on the traffic. Once the node has been drained, it is possible to perform maintenance on the node with no impact on traffic flows.

Scalability

As of ACI Release 5.2, the scalability of this design is bounded by the following parameters:

Nodes per cluster

A single L3Out can be composed of up to 12 border leaves or six anchor and two non-anchor nodes. Assuming an architecture with two leaf switches per node, this will limit the scale to a maximum of six or three/four racks per Kubernetes cluster. Considering current rack server densities, this should not represent a significant limit for most deployments. Should a higher rack/server scale be desired, it is possible to spread a single cluster over multiple L3Outs. This requires an additional configuration that is not currently covered in this design guide. If you are interested in pursuing such a large-scale design, please reach out to Cisco Customer Experience (CX) and services for further network design assistance.

Longest prefix match scale

The routes that are learned by the border/anchor leaves through peering with external routers are sent to the spine switches. The spine switches act as route reflectors and distribute the external routes to all of the leaf switches that have interfaces that belong to the same tenant. These routes are called Longest Prefix Match (LPM) and are placed in the leaf switch's forwarding table with the VTEP IP address of the remote leaf switch where the external router is connected.

In this design, every Kubernetes node advertises to the local border leaf its pod host routes (aggregated to /26 blocks when possible) and service subnets, plus host routes for each service with external Traffic Policy: Local. Currently, on ACI ToR switches of the -EX, -FX, and -FX2 hardware family, it is possible to change the amount of supported LPM entries using “scale profiles” as described in: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/b_Cisco_APIC_Forwarding_Scale_Profile_Policy.html

In a summary, depending on the selected profile, the design can support from 20,000 to 128,000 LPMs.

Detailed configurations – example

This section shows the required ACI and Calico steps to configure integration with floating SVI or standard SVI as shown in figures 2 and 4.

Note:      It is assumed the reader is familiar with essential ACI concepts and the basic fabric configuration (interfaces, VLAN pools, external routed domain, AAEP) and that a tenant with the required VRF already exists.

Example floating SVI IP allocation

With floating SVI, the IP allocation scheme is extremely simple and consists in allocating a subnet big enough to contain as many host addresses as the number of required K8s nodes plus one host address for every anchor node, one floating address, and one secondary address.

Example standard SVI IP allocation

A possible IP allocation schema is shown below:

1.     Allocate a supernet big enough to contain as many /29 subnets as nodes. For example, for a 32-nodes cluster, you could use a /24 subnet, a 64-nodes would use a /23 subnet, and so on.

2.     For every /29 subnet

    Allocate the first usable IP to the node

    Allocate the last three usable IPs for the ACI border leaves:

i     Last IP-2 for the first leaf

ii    Last IP-1 for the second leaf

iii    Last IP for the secondary IP

ACI BGP configuration (ACI Release 5.2)

Note that because each K8s node is acting as a BGP router, the nodes attach to the ACI fabric through an external routed domain instead of a regular physical domain. When configuring the interface policy groups and Attachable Access Entity Profiles (AAEP) for your nodes, bind them to that single external routed domain and physical domain. You will attach that routed domain to your L3Out in the next steps.

3.     Create a new L3Out.

    Go to Tenant <Name> -> Networking, right click on “External Routed Network,” and create a new L3Out.

Configure the basic L3Out parameters:

    Name: <Your L3Out Name>

    VRF: <Your VRF Name>

    External Routed Domain: <Your External Routed Domain>

    Enable BGP.

Click Next.

Note:      If you use floating SVI, follow the steps in point 4, OR, if you use standard SVI, follow the steps in point 5.

4.     Configure floating SVI:

    Select “Floating SVI” as interface type.

    Domain Type: Physical

    Domain: <Your Domain>

    Floating Address: An IP in the node subnet

    VLAN: Select a valid VLAN ID

    MTU: Select the desired MTU; 9000 is the recommended value.

    Node: Add here all your anchor nodes:

    Node ID: Select your node

    Router ID: A valid IP address

    IPv4/6: Your node primary address

5.     Configure SVI:

    Select “SVI” as interface type

    Select the Layer 2 interface type: You should be using vPC.

    Select the physical path.

    VLAN: Select a valid VLAN ID.

    MTU: Select the desired MTU; 9000 is the recommended value.

    Configure side A and side B:

    Node ID should be selected automatically, based on your path.

    Router ID: A valid IP address

    IPv4/6: Your node primary address

    If you need more than two border leaves, you will have to add them manually after the creation of the L3OUT:

    Go to the L3OUT -> Logical Node Profiles -> <Node Profile Name> -> Click on “+” to add nodes. Give the node a router ID.

    Go to the L3OUT -> Logical Node Profiles -> Logical Interface Profiles -> <Logical Interface Profiles Name> -> SVI -> Click on “+” to add new paths.

Click Next twice, skipping the Protocols section.

Create a new external EPG.

    Name: “default”

    Subnets:

Add all your subnets:

i     Node subnets (use the supernet in the SVI case)

ii    Pod subnets

iii    Cluster subnet

iv    External service subnet

    (Optional) If you need to perform any route leaking, you should add the required subnet scope options, such as Export Route Control Subnet. This is required if these subnets are to be advertised outside of ACI.

6.     Floating SVI and SVI: Add a secondary address.

    For each anchor node or SVI path, add a IPv4 secondary address.

    Floating SVI: This is one address for the whole cluster, as per Figure 2.

    SVI: Each pair of border leafs has a dedicated secondary IP, as per Figure 4.

7.     Enable Import Route Control Enforcement:

    Select your L3OUT -> Policy -> Route Control Enforcement -> Enabled Import

8.     Set the BGP timers to 1s Keep Alive Interval and 3s Hold Interval, to align with the default configuration of Calico and provide fast node-failure detection. We are also configuring the maximum AS limit to 1 and Graceful Restart, as discussed previously.

    Expand “Logical Node Profiles,” right click on the Logical Node Profile, and select “Create BGP Protocol Profile.”

    Click on the “BGP Timers” drop-down menu and select “Create BGP Timers Policy.”

    Name: <Name>

    Keep Alive Interval: 1s

    Hold Interval: 3s

    Maximum AS Limit: 1

    Graceful Restart Control: Enabled

    Press: Submit.

    Select the <Name> BGP timer policy.

    Press: Submit.

9.     Configure AS-Path Relaxation.

    Select the BGP Protocol Profile created in step 8.

    Click on the “AS-Path Policy” drop-down menu and select “Create BGP Best Path Policy.”

    Name: <Name>

    AS-Path Control – Relax AS-Path: Enabled

    Press: Submit.

    Select the <Name> BGP timer policy.

    Press: Submit.

(Optional) Repeat Step 8 for all the remaining “Logical Node Profiles.”

    Note: The 1s/3s BGP timer policy already exists now; there is no need to re-create it.

10.  In order to protect the ACI fabric from potential BGP prefix misconfigurations on the Kubernetes cluster, Import Route Control has been enabled in step 7. In this step, we are going to configure the required route map to allow only the node, pod cluster service subnet and external service subnet to be accepted by ACI.

    Expand your L3Out, right click on “Route map for import and export route control,” and select “Create Route Map for…”

    Name: From the drop-down menu, select “default-import” node.

    Note: Do not create a new one; select the pre-existing one, called “default-import.”

    Type: “Match Prefix and Routing Policy”

    Click on “+” to create a new context.

    Order: 0

    Name: <Name>

    Action: Permit

    Click on “+” to create a new “Associated Matched Rules.”

    From the drop-down menu, select “Create Match Rule for a Route Map.”

    Name: <Name>

    Click “+” on Match Prefix:

    IP: POD Subnet

    Aggregate: True

    Click Update.

    Repeat the above step for all the subnets that you want to accept.

    Click Submit.

    Ensure the selected Match Rule is the one we just created.

    Click OK.

    Click Submit.

It is possible to use the same steps (10) to configure what routes ACI will export by selecting the “default-export” option. It is recommended to configure the same subnets as in step 10.

11.  Create max ECMP Policies:

    Go to Policies -> Protocol -> BGP -> BGP Address Family Context -> Create a new one and set:

    eBGP Max ECP to 64

    iBGP Max ECP to 64

    Leave the rest to the default.

12.  Apply the ECMP Policies:

    Go to Networking -> VRFs -> Your VRF -> Click “+” on BGP Context Per Address Family:

    Type: IPv4 unicast address family

    Context: Your Context

13.  (Optional) Create maximum number of prefixes to limit how many prefixes we can accept from a single K8s node.

    Go to Policies -> Protocol -> BGP -> BGP Peer Prefix -> Create a new one and set:

    Action: Reject

    Max Number of Prefixes: Choose a value aligned with your cluster requirements.

14.  Configure the eBGP Peers: Each of your K8s nodes should peer with the two ToR leaves as shown in Figure 2, "Floating SVI design.”

    Select your L3OUT -> Logical Node Profiles -> <Node Profile> -> Logical Interface Profile

    Floating SVI: Double click on the anchor node -> Select “+” under the BGP Peer Connectivity Profile.

    SVI: Right click on the SVI Path -> Select Create BGP Peer Connectivity Profile

    Configure the BGP Peer:

    Peer Address: Calico Cluster Subnet

    Remote Autonomous System Number: Cluster AS

    BGP Controls:

i      AS Override

ii      Disable Peer AS Check

    Password/Confirm Password: BGP password

    BGP Peer Prefix Policy: BGP peer prefix policy name

Repeat this step for all your Anchor nodes

Calico routing and BGP configuration

This section assumes that the basic Kubernetes node configurations have already been applied:

      The Kubernetes node network configuration is complete (that is, default route is configured to be the ACI L3Out secondary IP, and connectivity between the Kubernetes nodes is possible).

      Kubernetes is installed.

      Calico is installed with its default setting as per
https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises#install-calico-with-kubernetes-api-datastore-50-nodes-or-less

      No configuration has been applied.

Note:      In the next examples, it is assumed that Calico and “calicoctl” have been installed. If you are using K8s as a datastore, you should add the following line to your shell environment configuration:

DATASTORE_TYPE=kubernetes KUBECONFIG=~/.kube/config

Configure your IPPool so as following, replace the cidr to the one used in your cluster.

apiVersion: projectcalico.org/v3

kind: IPPool

metadata:

  name: default-ipv4-ippool

spec:

  cidr: 10.1.0.0/16

  ipipMode: Never

  natOutgoing: false

  vxlanMode: Never

  disabled: false

  nodeSelector: all()

    Re-apply the configuration with calicoctl.

calico-master-1#./calicoctl apply -f IPPool.yaml

Create your cluster BGPConfiguration: in this file we disable BGP full mesh between the K8s nodes and set the serviceClusterIPs and serviceExternalIPs subnets so that they can be advertised by eBGP. These subnets are the service and external service subnets in Kubernetes.

apiVersion: projectcalico.org/v3

kind: BGPConfiguration

metadata:

  name: default

spec:

  logSeverityScreen: Info

  nodeToNodeMeshEnabled: false

  serviceClusterIPs:

  - cidr: 192.168.8.0/22

  serviceExternalIPs:

  - cidr: 192.168.3.0/24

calico-master-1#./calicoctl apply -f BGPConfiguration.yaml

Create a secret to store the BGP password. We also need to add a Role and RoleBinding to ensure the calico-node ServiceAccount can access the Secret.

apiVersion: v1

kind: Secret

metadata:

  name: bgp-secrets

  namespace: kube-system

type: Opaque

stringData:

  rr-password: 123Cisco123

---

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

  name: secret-access

  namespace: kube-system

rules:

- apiGroups: [“”]

  resources: [“secrets”]

  resourceNames: [“bgp-secrets”]

  verbs: [“watch”, “list”, “get”]

---

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

  name: secret-access

  namespace: kube-system

roleRef:

  apiGroup: rbac.authorization.k8s.io

  kind: Role

  name: secret-access

subjects:

- kind: ServiceAccount

  name: calico-node

  namespace: kube-system

calico-master-1#kubectl apply -f BGPPassSecret.yaml

Calico BGP Node Config: Each node needs to be allocated a dedicated asNumber and the address used.

apiVersion: projectcalico.org/v3

kind: Node

metadata:

  name: calico-4

spec:

  bgp:

    asNumber: 650011

    ipv4Address: 192.168.12.14

---

apiVersion: projectcalico.org/v3

kind: Node

metadata:

  name: calico-1

spec:

  bgp:

    asNumber: 650011

    ipv4Address: 192.168.12.11

---

apiVersion: projectcalico.org/v3

kind: Node

metadata:

  name: calico-2

spec:

  bgp:

    asNumber: 650011

    ipv4Address: 192.168.12.12

---

apiVersion: projectcalico.org/v3

kind: Node

metadata:

  name: calico-3

spec:

  bgp:

    asNumber: 650011

    ipv4Address: 192.168.12.13

Calico BGP Peering: In order to simplify the peering configuration, it is recommended to label the K8s node with the rack_id. We can then use the rack_id label to select the K8s nodes when we configure the BGP peering in Calico. For example:

calico-master-1 kubectl label node master-1 rack_id=1

calico-master-1 kubectl label node master-2 rack_id=2

Now when we create the BGP peering configuration, we can do the following:

      Set the peerIP to the Leaf IP address for the BGP peering.

      Set the asNumber to the ACI BGP AS number.

      Set the nodeSelector equal to the rack where the ACI leaf is located.

      Set the password to the secret created previously.

With this configuration we will automatically select all the nodes in rack_id ==X and configure them to peer with the selected ACI leaf.

For example:

---

apiVersion: projectcalico.org/v3

kind: BGPPeer

metadata:

  name: "201"

spec:

  peerIP: "192.168.2.201"

  asNumber: 65002

  nodeSelector: rack_id == "1"

  password:

    secretKeyRef:

      name: bgp-secrets

      key: rr-password

---

apiVersion: projectcalico.org/v3

kind: BGPPeer

metadata:

  name: "202"

spec:

  peerIP: "192.168.2.202"

  asNumber: 65002

  nodeSelector: rack_id == "1"

  password:

    secretKeyRef:

      name: bgp-secrets

      key: rr-password

---

apiVersion: projectcalico.org/v3

kind: BGPPeer

metadata:

  name: "203"

spec:

  peerIP: "192.168.2.203"

  asNumber: 65002

  nodeSelector: rack_id == "2"

  password:

    secretKeyRef:

      name: bgp-secrets

      key: rr-password

---

apiVersion: projectcalico.org/v3

kind: BGPPeer

metadata:

  name: "204"

spec:

  peerIP: "192.168.2.204"

  asNumber: 65002

  nodeSelector: rack_id == "2"

  password:

    secretKeyRef:

      name: bgp-secrets

      key: rr-password

 

calico-master-1#./calicoctl apply -f BGPPeer.yaml

Verify that the configuration is applied.

cisco@calico-master-1:~$ calicoctl get bgppeer

NAME    PEERIP                NODE             ASN

201     192.168.2.201         rack_id == "1"   65002

202     192.168.2.202         rack_id == "1"   65002

203     192.168.2.203         rack_id == "2"   65002

204     192.168.2.204         rack_id == "2"   65002

Verify that BGP peering is established:

      From the Kubernetes node:

cisco@calico-node-1:~$ sudo calicoctl node status

Calico process is running.

 

IPv4 BGP status

+---------------+---------------+-------+------------+-------------+

| PEER ADDRESS  |   PEER TYPE   | STATE |   SINCE    |    INFO     |

+---------------+---------------+-------+------------+-------------+

| 192.168.2.201 | node specific | up    | 2021-10-05 | Established |

| 192.168.2.202 | node specific | up    | 02:09:14   | Established |

+---------------+---------------+-------+------------+-------------+

 

      From ACI:

fab2-apic1# fabric 203 show ip bgp summary vrf common:calico

----------------------------------------------------------------

 Node 203 (Leaf203)

----------------------------------------------------------------

BGP summary information for VRF common:calico, address family IPv4 Unicast

BGP router identifier 1.1.4.203, local AS number 65002

BGP table version is 291, IPv4 Unicast config peers 4, capable peers 4

16 network entries and 29 paths using 3088 bytes of memory

BGP attribute entries [19/2736], BGP AS path entries [0/0]

BGP community entries [0/0], BGP clusterlist entries [6/24]

 

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd

192.168.1.1     4 65003    3694    3685      291    0    0 00:01:30 2

192.168.1.9     4 65003    3618    3612      291    0    0 00:00:10 2

192.168.1.17    4 65003    3626    3620      291    0    0 00:00:13 3

By default, Kubernetes service IPs and node ports are accessible through any node and will be load balanced by kube-proxy across all the pods backing the service. To advertise a service directly from just the nodes hosting it (without kube-proxy load balancing), configure the service as “NodePort” and set the “externalTrafficPolicy” to “Local.” This will result in the /32 service IP being advertised to the fabric only by the nodes where the service is active.

apiVersion: v1

kind: Service

metadata:

  name: frontend

  labels:

    app: guestbook

    tier: frontend

spec:

  # if your cluster supports it, uncomment the following to automatically create

  # an external load-balanced IP for the frontend service.

  # type: LoadBalancer

  ports:

  - port: 80

  selector:

    app: guestbook

    tier: frontend

  type: NodePort

  externalTrafficPolicy: Local

Verify that ACI is receiving the correct routes:

    Every Calico node advertises a /26 subnet to ACI from the pod subnet.[6]

    Every exposed service should be advertised as a /32 host route. For example:

Connect to one of the ACI border leaves and check that we are receiving these subnets

# POD Subnets: A neat trick here is to use the supernet /16 in this example with the longer-prefixes option. In this output the 192.168.2.x IPs are my K8s nodes.

 

Leaf203# show ip route vrf common:calico  10.1.0.0/16 longer-prefixes

IP Route Table for VRF "common:calico"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

10.1.38.64/26, ubest/mbest: 1/0

    *via 192.168.2.7%common:calico, [20/0], 12:16:10, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.7/32%common:calico

10.1.39.0/26, ubest/mbest: 1/0

    *via 192.168.2.1%common:calico, [20/0], 2d03h, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.1/32%common:calico

10.1.133.192/26, ubest/mbest: 1/0

    *via 192.168.2.5%common:calico, [20/0], 2d03h, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.5/32%common:calico

10.1.183.0/26, ubest/mbest: 1/0

    *via 192.168.2.9%common:calico, [20/0], 03:04:10, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.9/32%common:calico

 

# Service Subnets: assiming you have a service configured with the externalTrafficPolicy: Local option it should beh advertised:

 

In this example the 192.168.10.150 is advertised by 3 ndoes.

 

cisco@calico-master-1:~$ kubectl -n guestbook get svc frontend

NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE

frontend   NodePort   192.168.10.150   <none>        80:30291/TCP   31s

 

This CLUSTER-IP is frontending 3 PODs:

kubectl  -n guestbook get pod -o wide -l tier=frontend

NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE      

frontend-d7f77b577-cvqp4   1/1     Running   0          90s   10.1.183.4    worker-6  

frontend-d7f77b577-fhsv8   1/1     Running   0          90s   10.1.226.67   worker-1  

frontend-d7f77b577-lv6vc   1/1     Running   0          90s   10.1.97.196   worker-3

 

These nodes are not all peerign with the same leaves:

 

Leaf203# show ip route vrf common:calico  192.168.10.150

192.168.10.150/32, ubest/mbest: 1/0

    *via 192.168.2.9%common:calico, [20/0], 00:16:04, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.9/32%common:calico

 

Leaf201# show ip route vrf common:calico  192.168.10.150

192.168.10.150/32, ubest/mbest: 2/0

    *via 192.168.2.4%common:calico, [20/0], 00:01:14, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.4/32%common:calico

    *via 192.168.2.6%common:calico, [20/0], 00:01:14, bgp-65002, external, tag 65003

         recursive next hop: 192.168.2.6/32%common:calico

Cluster connectivity outside of the fabric – transit routing

In order for nodes of the Kubernetes cluster to communicate with devices located outside the fabric, transit routing needs to be configured between the Calico L3Out and one (or more) L3Outs connecting to an external routed domain.

This configuration requires adding the required consumed/provided contracts between the external EPGs. Please refer to https://www.cisco.com/c/en/us/td/docs/dcn/whitepapers/cisco-application-centric-infrastructure-design-guide.html#Transitrouting for best practices on how to configure transit routing in Cisco ACI.

Cluster connectivity inside the fabric

Once the eBGP configuration is completed, pods and service IPs will be advertised to the ACI fabric. To provide connectivity between the cluster’s L3Out external EPG and a different EPG, all that is required is a contract between those EPGs.

Dual stack considerations

This design is intended both for IPv4 and IPv6 designs. ACI supports also dual stack IPv4 and IPv6 support for Calico CNI K8s design.

Visibility

Following this design guide, a network administrator would be capable to easily verify the status of the K8s cluster.

Under the L3Out logical interface profile, it is possible to visualise the BGP neighbors configured as shown in the following figure:

 BGP Neighbors Configuration

Figure 5.            

BGP Neighbors Configuration

Each neighbor’s state can be also verified as shown in the following figure:

BGP Neighbour State

Figure 6.            

BGP Neighbour State

Finally, the routes learned from the K8s nodes are rendered at the L3Out VRF level, as shown in the figure below:

VRF Routing Table

Figure 7.            

VRF Routing Table

Segmentation

In order to apply microsegmentation based on specific pod endpoints or services exposed by the K8s cluster, it could be possible to create other external EPGs and match the specific workload intended for microsegmentation. Contracts including service graphs and PBR are supported for this use case. A typical use case would be to redirect traffic for specific critical services to a firewall that would log and perform network traffic analysis.

Conclusion

By combining Cisco ACI and Calico, customers can design Kubernetes clusters that are capable of delivering both high performance (no overlays overhead) as well as providing exceptional resilience while keeping the design simple to manage and troubleshoot.

Version history

Version

Date

Change

1.0

10-Jul-19

Initial Release

2.0

14-Jan-22

Floating SVI

 

 

 



[3] This assumes two ToR leaves per rack.
[4] Calico allocates to each node a /26 subnet from the pod subnet.
[5] The pod supernet is, by default, split into multiple /26 subnets and allocated to each node as needed.
In case of IP exhaustion, a node can potentially borrow IPs from a different node pod subnet. In that case, a host route will be advertised from the node for the borrowed IP.
More details on the Calico IPAM can be found here: https://www.projectcalico.org/calico-ipam-explained-and-enhanced

[6] This is the default behavior. Additional /26 subnets will be allocated to nodes automatically if they exhaust their existing /26 allocations. In addition, the node may advertise /32 pod specific routes if it is hosting a pod that falls outside of its /26 subnets. Refer to https://www.projectcalico.org/calico-ipam-explained-and-enhanced for more details on Calico IPAM configuration and controlling IP allocation for specific pods, namespaces, or nodes.

Learn more