NSX Compendium

VMware NSX for vSphere

By Anthony Burke – VMware NSBU Senior System Engineer.

Disclaimer: This is not an official reference and should be treated as such. Any mistakes on this page are a reflection on my writing and knowledge not the product itself. I endeavour for technical accuracy but we are only human! These are to serve as a formalisation of my own notes about NSX for vSphere. Everything discussed on this page is currently shipping within the NSX product.

Introduction

This page serves to be a resource for the components and deployment of VMware’s NSX for vSphere . I work with the product daily and educate customers and the industry at large the benefits of Network Function virtualisation (NFV) and the Software Defined Data Centre (SDDC) has. This resource aims to provide information both at high level and technical depth regarding the components and use cases for VMware NSX. This page will evolve as I add more content to it. It will eventually cover off all aspects of VMware NSX and how to use, consume, and run an environment using it. In time there will be a collection text, video and image that I am binding together into a compendium.

VMware NSX delivers a software based solution that solves many of the challenges faced in the data centre today. For a long time administrators and organisations have been able to deploy x86 compute at lightning pace. The notion of delivering an application by a template and the excitement of doing this in the time it takes to the boil the kettle has had its sheen taken off by the three weeks it can take to provision network services.

Network function virtualisation and delivering network services in software has always been a challenge to many. The notion of not only delivering a user-space instance of a service but the ability to program the end to end work flow from end user right through to storage has been a dream for a long time. It wasn’t until the acquisition by VMware of Nicira did this come about and the ability to deliver many functions of the data centre in software took its strong foot hold.

With a new ability to deliver DC features, such as a distributed in-kernel firewall and routing function, NSX edge functionality and L2 switching across L3 boundaries thanks to VXLAN, does NSX re-define the architecture of the data centre. Whilst rapidly reducing time to deploy, decreasing the administrative overhead and empowering the next generation of DC architectures, NSX provides the flexibility to build and define the next generation data centre.

There are some major components on NSX which provide varying function. This page is a technical resource for NSX and deployment on VMware infrastructure.

Whilst NSX for vSphere is very far reaching it is surprisingly light weight. There are only a handful of components that make up this solution to provide the final piece in VMware’s SDDC vision.

NSX Manager

The NSX manager is one of the touch points for the NSX for vSphere solution. NSX manager provides a centralised management plane across your data centre. It provides the management UI and API for NSX. Upon installation the NSX Manager injects a plugin into the vSphere Web Client for consumption within the web management platform.Along with providing management APIs and a UI for administrators, the NSX Manager component installs a variety of VIBs to the host when initiating host preparation. These VIBs are VXLAN, Distributed Routing, Distributed Firewall and a user world agent. The benefit of leveraging a VMware solution is that access to the kernel is much easier to obtain. With that VMware provide the distributed firewall function and distributed routing function in kernel. This provides extremely in kernel function processing without the inadequacies of traditional user space or physical firewall network architectures.

NSX Controller

The NSX controller is a user space VM that is deployed by the NSX manager. It is one of the core components of NSX and could be termed as the “distributed hive mind” of NSX. It provides a control plane to distribute network information to hosts. To achieve a high level of resiliency the NSX Controller is clustered for scale out and HA.

The NSX controller holds three primary tables. These are a MAC address table, ARP table and a VTEP table. These tables collate VM and host information together for each three tables and replicate this throughout the NSX domain. The benefit of such action is to enable multi-cast free VXLAN on the underlay. Previous versions of vCNS and other VXLAN enabled solutions required VXLAN enabled on the Top of Rack Switches or the entire physical fabric. This provided a significant administrative overhead and removing this alleviates a lot of complexity.

By maintaining these tables an additional benefit is ARP suppression. ARP suppression will allow for the reduction in ARP requests throughout the environment. This is important when layer two segments stretch across various L3 domains. If a segment requests the IP of a MAC address that isn’t on a local segment the host will have the replicated information in its tables pushed to it by the controller.

Roles and function

The NSX Controller has five roles:

  • API Provider
  • Persistence Server
  • Logical Manager
  • Switch Manager
  • Directory server

The API provider maintains the Web-services API which are consumed by NSX Manager. The Persistence server assures data preservation across nodes for data that must not be lost; network state information. Logical manager deals with the computation of policy and the network topology. The switch manager role will manage the hypervisors and push the relevant configuration to the host. The directory server will focus on VXLAN and the distributed logical routing directory of information.

Whilst each role needs a different master each role can be elected to sit on the same or different host. If a node failure occurs and there is no master for an elected role a new node is promoted to master after the election process.

Most deployment scenarios see three, five or seven controllers deployed. This is due to the controller running Zoo Keeper. A Zoo Keeper cluster, known as an ensemble, requires a majority to function and this is best achieved through an odd number of machines. This tie-breaker scenario is used in many cases and HA conditions during NSX for vSphere operations.

Slicing

In a rapidly dynamic environment that may see multiple changes per second how do you dynamically distribute workload across available clusters, re-arrange workloads when new cluster members are added and sustain failure without impact all while this occurs behind the scenes? Slicing.

Controller Slicing 01

A role is told to create a number of slices of itself. An application will collate its slices and assign the object to a slice. This ensures that no individual node can cause a failure of that NSX controller role.

Controller Slicing 02

When a failure of a Controller node occurs the slices that the controller is in charge of will be replicated and reproduced onto existing controllers. This ensures consistent network information and continuous state.

VXLAN

VXLAN is a multi-vendor industry-supported network virtualisation technology. It enables much larger networks to be built at layer 2. This is done without the crippling limitation of scale that is found with traditional layer 2 technologies. Like a VLAN, which is an encapsulation of a layer 2 frame with a logical ID, VXLAN encapsulates the layer 3 packet with a VXLAN header, IP headers and a UDP header. From a virtual machine perspective, VXLAN enables VMs to be deployed on any server in any location, regardless of the IP subnet or VLAN that the physical server resides in.

VXLAN solves many issues that have arisen in the DC through the implementation of Layer 2 domains.

  • Creation of large Layer 2 domains without the blast radius.
  • Scales beyond 4094 VLANs
  • Enables layer 2 connectivity across traditional DC boundaries.
  • Enable smarter traffic management abstracted from the underlay.
  • Enables large layer 2 networks to be built without the high consumption of CAM table allocation on ToR.
  • VXLAN is an industry-standard method of supporting layer 2 overlays across layer 3. There is an alliance on vendors which support a variety of VXLAN integration: as a software feature on hypervisor-resident virtual switches, on firewall and load-balancing appliances and on VXLAN hardware gateways built into L3 switches.

VXLAN Header

Scaling beyond the 4094 VLAN limitation on traditional switches has be solved thanks to the 24 bit VXLAN Network identifying. Similar to the field in the VLAN header where a VLAN ID is stored, the 24 bit header allows for 16 million potential logical networks.

VXLAN Enhancements – Data Plane

There are a few VXLAN enhancements for NSX for vSphere. It is possible to support multiple VXLAN vmknics per host which allows uplink load balancing. QoS support is there through the DSCP and CoS tags from an internal frame copied to the external VXLAN header. It is possible to provide guest VLAN tagging. Due to the VXLAN format used there is potential for later consumption of hardware offload for VXLAN in network adapters such as Mellanox.

VXLAN Enhancements – Control Plane

Control plane enhancements come through adjustments in the VXLAN headers. This allows the removal of the multicast or PIM routing on the physical underlay. It is possible also for the suppression of broadcast traffic in VXLAN networks. This is due to ARP directory services and the role the NSX controller plays in the environment.

VXLAN Replication – Control Plane

Unicast mode along with Hybrid mode select a single VTEP in every remote segment. This is selected from its mapping table. This VTEP is used as a proxy. This is performed on a per VNI basis and load is balanced across proxy VTEPS.

Unicast mode calls this proxy a UTEP – Unicast Tunnel Endpoint. Hybrid mode calls this a MTEP – Multicast Tunnel End Point. The table of UTEPs and MTEPs are synchronised to all VTEPs in the cluster.

Optimisation replication occurs due to a VTEPs performing software replication of Broadcast, Unicast, Multicast traffic. This replication is to local VTEPS and one UTEP/MTEP for each remote segment.

VXLAN Header Breakdown

This is achieved through an update to how NSX uses VXLAN. A REPLICATE_LOCALLY bit in the VXLAN header is used for this. This is used in the Unicast and Hybrid modes. A UTEP or MTEP receiving a unicast frame with the REPLICATE_LOCALLY bit set is now responsible for injecting the frame to the local network.

VXLAN Unicast UTEP

The source VTEP will replicate an encapsulated frame to each remote UTEP via a unicast and replicates the frame to each active VTEP in the local segment. UTEP role is responsible for the delivery of a copy of the de-encapsulated inner frame to the local VMs.

VXLAN Unicast UTEP replication

This allows the alleviation of the dependencies on the physical network but there is a slight overhead incurred. It is configurable per VNI during the provisioning of the logical switch.

Preparing for VXLAN

NSX manager deploys the NSX controllers. A subsequent action after deploying the controllers is preparing the vSphere clusters for VXLAN. Host preparation will install the network VIBs onto hosts in the cluster. These are Distributed Firewall, LDR and VXLAN host kernel components. After this an administrator will create VTEP VMkernel interfaces for each host in the cluster. The individual host VMK interfaces can be allocated IP’s from a pool that can be set up.

Due to the increase of the Ethernet payload due to L2 being encapsulated there is 50 bytes of overhead. An MTU of 1600 is recommended on the physical underlay.

Transport Zone

A transport zone is created to delineate the width of the VXLAN scope This can span one or more vSphere clusters. A NSX environment can contain one or more transport zones based on user requirements. The use of transport zone types is interchangeable and an environment can have unicast, hybrid and multicast communication planes.

Transport zones can be used as a method to further carve infrastructure down which is under a single NSX administrative domain. For example, an administrator may have a DMZ environment. Allocating the hosts that provide the DMZ functionality to a separate transport zone ensures Virtual functions – namely Distributed Logical Routers and Logical Switches – are confined to the scope of that Transport Zone. The rest of the infrastructure will be assigned to another Transport Zone.

Another example of this when a business is seeking to achieve a compliant architecture. Whilst NSX can build a compliant mixed-mode architecture there are QSA’s out there who deem an appropriate gap between elements control planes. Using a separate Transport Zone for elements that are inside an environment requiring a PCI compliant architecture can provide further assurance for the elements attached.

Transport Zone Control Plane communication

  • Multicast mode leverages IP addresses on the physical underlay network for control plane VXLAN replication. It is a recommended transport zone control plane mode when upgrading from older VXLAN deployments. It requires PIM or IGMP on the physical network.
  • Unicast control plane mode is handled by the NSX controller. This is true to the creation and replication of VTEP, ARP and MAC tables on controllers which are subsequently distributed to eligible clusters in a transport zone.
  • Hybrid is a optimised unicast mode. Offloading of local traffic replication to the physical network requires L2 multicast. Leveraging IGMP snooping on the first-hop switch is required but does not required PIM. The first-switch hop switch replicates the traffic for the subnet.

NSX Logical Switching

The NSX logical switch creates logically abstracted segments to which applications or tenant machines can be wired. This provides administrators with increased flexibility and speed of deployment whilst providing traditional switching characteristics. The environment allows traditional switching without the constraints of VLAN sprawl or spanning-tree issues.

A logical switch is distributed and reaches across compute clusters. This allows connectivity in the data centre for Virtual Machines. Delivered in a virtual environment this switching construct is not restricted by historical MAC/FIB table limits. This is due to the broadcast domain is a logical container that resides within the software.

With VMware NSX a logical switch is mapped to a unique VXLAN. When mapped to a VXLAN the virtual machine traffic is encapsulated and is sent out over the physical IP network. The NSX controller is a central control point for logical switches. Its function is to maintain state information of all virtual machines, hosts, logical switches and VXLANs on the network.

Segment ID range

The segment ID range pool is configured on setup and preparation of the host cluster. The VNI ID’s are allocated from this pool and one ID is allocated per Logical Switch. If you made an example range of 5000-5999 you could provision 1000 logical switches within the range.

Logical Switch layout

When creating a logical switch at first it is wise to consider what you are connecting to it. The creation of a Logical switch will consume a VXLAN ID from the segment ID range pool previously defined. Upon creation you will select the control plane replication mode aligned with the transport zone selected for the control plane.

VXLAN Logical Switch across Different Hosts 01

The logical topology of a NSX logical switch looks like this. This highlights the seamless L2 nature the VMs experience even though they traverse different L3 boundaries.

What is really happening?

VXLAN Logical Switch on Different Hosts 02

When Web01 communicates to Web02 it communicates over VXLAN transport network. When the the VM communicates and the switch looks up the MAC address of Web02 the host is aware in its ARP/MAC/VTEP tables pushed to it by the NSX Controller where this VM resides. It is forwarded out into the VXLAN transport network. It is encapsulated within a VXLAN header and routed to the destination host based on the knowledge of the source host. Upon reaching the destination host the VXLAN header is stripped of and the preserved internal IP packet and frame continues to the host.

Logical Distributed Routing

NSX for vSphere provides L3 routing without leaving the hypervisor. Known as the Logical Distributed Router, this advancement sees routing occur within the kernel of each host allowing the routing data plane distributed across the NSX enabled domain. It is no possible to optimising traffic flows between network tiers, no longer break out to core or aggregation devices for routing and support single or multi-tenancy models.

Logical routing provides the scalable routing. It can support a large number of LIFs up to 1000 per Logical Distributed Router. This along with the support of dynamic routing protocols such as BGP and OSPF allows for scalable routing topologies. An additional benefit is that there no longer is hair-pinning of traffic like that of which is found in traditional application and network architectures. LDR allows for heavy optimization of east – west traffic flows and improves application and network architectures.

DLR Logical Overview

Data Path components

Logical interfaces, known as LIFs, are configured on logical routers. They are analogous to Switched virtual interfaces or routed virtual interfaces, SVI’s / RVI’s, on traditional network infrastructure. IP addresses are assigned to LIFs and there can be multiple LIFs per Logical Distributed Router. When a LIF is configured it is distributed to all other hosts. An ARP table is also built and maintained for every LIF.

DLR Logical on the same host

The above image highlights how routing between two L3 segments on a single host occurs. The LIF or gateway for each tier is in the LIF table. The routing table is populated with directly connected networks. When a packet destined for the App tier reaches the gateway it has its MAC re-written and is placed onto the L3 segment in which the destination resides. This is all done in kernel and does not require the packet to traverse the physical infrastructure for L3 function.

DLR on Different Host 01

The above image demonstrates a similar scenario to the single host in kernel routing. The same process applies in regards to routing when dealing with different host scenarios. Where this differs is when the ARP look up occurs. After being routed locally closest to the source in kernel the packet is then routed onto the App segment and sent towards the VXLAN transport network.

DLR on Different Hosts 02

The VXLAN vmk interface wraps the packet destined to App02 in a VXLAN header. The VXLAN header knowing its destination from the tables built by the controllers and pushed to each host is routed. Upon reaching the destination host with App02 on it the VXLAN header is stripped off and the packet is routed to its destination on the local segment.

There is a MAC address assigned to each LIF. This is known as the vMAC. It is the same for all hosts and it is never seen by the physical network. The physical uplink interface has a pMAC associated to it. This is the interface in which traffic flows to the network. If it is an uplink to a VLAN network the pMAC is seen whereas a VXLAN uplink will not expose the pMAC.

It is important to remember that the pMAC is not the physical MAC address. The MAC addresses are generated for the number of uplinks on a VDS enabled for logical routing. The vMAC is replaced by the pMAC on the source host after the routing decision is made but before the packets reach a physical network. Once arriving at the destination host traffic is directly sent the virtual machine.

Control VM and distributed routing operations

The control VM is a user space virtual machine that is responsible for the LIF configuration, control-plane management of dynamic routing protocols and works in conjunction with the NSX controller to ensure correct LIF configuration on all hosts.

DLR Routing control

When deploying a Logical Distributed Router the following Order of Operations occurs:

  1. When deploying a logical distributed router a logical router control VM is deployed. NSX manager creates the instance on the controller and hosts. This is a use space VM and should be deployed on the edge and management cluster.
  2. The controller pushes a new LIF configuration to hosts.
  3. Routing updates received from an external router. This can be any device.
  4. The LR control VM sends route updates to the controller.
  5. The controller sends these route updates to the hosts.
  6. The routing kernel module on the hosts handle the data path traffic.

VLAN LIF

Not all networks require or have VXLAN connectivity everywhere. The Logical Distributed router can have an uplink that connects to VLAN port groups. The first hop routing is handled in the host then routed into a VLAN segment. There must be a VLAN ID associated to the dvPortGroup. VLAN 0 is not supported. VLAN LIFs require a designated instance.

VLAN LIFs generally introduce some design constraints to a network. A design consideration of one PortGroup per virtual distributed switch can limit this uplink type and there can only be one VDS. The same VLAN must span all hosts in the VDS. This doesn’t scale as Network Virtualization seeks to reduce the consumption of VLANs.

Designated Instance

The role of a Designated Instance is to resolve ARP on a VLAN LIF. The election of a Host as a DI is performed by the NSX Controller. This information is subsequently pushed to all other Hosts. Any ARP requests on the particular segment or subnet is handled by that host. If a Host fails or is removed the Controller selects a new host as the Designated instance. This information is there re-advertised to all hosts.

VXLAN LIF

VLXAN LIFs are a more common uplink type. Logical Distributed Routing works with VXLAN logical switch segments. First hop routing is handled on the host and traffic is routed to the corresponding VXLAN. If required, it is encapsulated if needed to travel across the transport network to reach the destination on another host.

A designated instance is not required in the case of VXLAN LIFs. The next hop router is a generally a VM within the transport zone – such as a NSX Edge Services Gateway. It is recommended that Distributed Logical routing leverages VXLAN LIFs as they work the best with this feature. A VXLAN LIF can span all VDS in the transport zone.

LIF Deployment types

There are three use cases in which LIF interfaces can be configured. There are three internal to uplink LIF interface configurations:

  1. VXLAN to VXLAN
  2. VXLAN to VLAN
  3. VLAN to VLAN

NSX Edge Services Gateway

NSX Edge Services gateway is a critical component to NSX. The virtual appliance provides a vast array of network functionality. As the evolution of the vCNS gateway edge, the NSX ESG is a leaner, meaner and resource optimised gateway. It can provide near on 10Gbps throughput. This is almost double what other virtual appliances can move through it. The joys of owning the hypervisor right? Forming one of the termination points between the physical and virtual world, the NSX edge can provide routing, VPN services, firewall capability, L2 bridging and load balancing.

Virtual appliance

Deployed as an OVA by the NSX manager, the NSX edge has a few requirements. Due to being a virtual appliance, unlike it’s in kernel counterparts, each edge requires vCPU, memory and storage resources. It should be treated like any virtual machine would be.

Requirements

The requirements are as follows

  • Compact 1vCPU, 512MB RAM
  • Large 2vCPU, 1024MB RAM
  • Quad-Large 4vCPU, 4096MB RAM
  • Extra-Large 6vCPU, 8192MB RAM

As a rule of thumb your milage will vary based on application and workload that reside behind the edge. VMware has found that Quad-Large is good for a high level firewall whilst Extra-Large is suitable for Load Balancing and SSL offload. This may be for numerous pools or a single application pool. It is a simple interaction that allows the re-sizing of the Edge gateway. This goes both ways – small to large and large to small. This suits environments where performance is known to grow for a certain event or period of time.

Routing

A key function of the NSX edge is the L3 gateways. The ability to provide a subnet interface and attach a logical segment allows for optimisation of network traffic. No longer do virtual workloads require a SVI/RVI on a physical Top of Rack or aggregation switch, they can be used on NSX edges. In most topologies a Distributed Firewall is used for this function to provide kernel LIFs. There are some topologies such as a micro segment that may reside behind an edge on a single logical switch that use an edge for a L3 gateway plus NAT. With that said connectivity needs to be provided to these networks. They need to be reachable.

NSX edges support OSPF, BGP (external and internal), ISIS and static routing. This provides administrators flexibility in how they choose to peer with the physical infrastructure and advertise the subnets into the network. ECMP support is also new in 6.1 allowing multiple routes to a destination. This provides redundancy and resiliency in the IP network.

Redistribution plays a critical part in a scalable and dynamic network. It is possible to redistribution from one protocol to another. Prefix list filtering is also available.

NAT

Network Address Translation is a staple of modern networks. NAT can be performed for traffic that flows through the Edge. Both Source and Destination NAT is supported.

In a customer environments where hosting of applications occur, such as a cloud platform, NAT plays a critical role in the IP address resusage. Where topologies are defined by a catalogue or template the reuse of IP addresses allow for simple topology deployment. On an NSX edge, NAT can be used to translate a private range to a public range. Servers can be reached through this NAT’d IP address. This allows public IP addresses only consumed on the NSX Edge opposed to all virtual machines within a topology.

VPN Services

The Edge gateway provides a number of VPN services. Layer 2 and Layer 3 VPNs give flexibility in deployment. L2 VPN services allow connectivity between seperate DCs, allowing layer 2 domains the ability to connect to each other. With the decoupling of an NSX edge gateway in 6.1, it is possible to allow L2 VPN services from a non NSX enabled environment to create a L2 tunnel to an NSX enabled cloud/environment. This enables the ability to move and connect workloads with other sites or clouds.

L3 VPNs allow for IPsec Site to Site connections between NSX edges or other devices. SSL VPN connections also allow users to connect to an application topology with ease if security policies dictate this.

L2 Bridging

NSX edge supports the ability to bridge a L2 VLAN into a VXLAN. This allows connectivity to a physical VLAN or a VLAN backed port group. Connection to physical workloads are still a reality in this day an age. The ability to bridge this allows migration from P-to-V, connection to legacy systems, and a host of other use cases.

Firewall

The NSX Edge provides a stateful firewall which complements the Distrubted Firewall. Whilst the Distributed Firewall is primarily used for enforce intra-DC communication the NSX edge can be used for filtering communication that leaves an application topology.

When configuring ruleset via the Distributed Firewall or from a Security policy the administrator can select NSX Edges from the Applied To field.

An example of this

The customer sought to adjust the default Edge firewall rules from a single point. Using the Web Client browse to Network & Security \> Firewall 

Here is an example ruleset:
* SRC IP Set A (192.168.32.0/24, 172.16.203.0/24)
* DST mgt-sv-01a
* Port RDP
* Action Block

By default this would be applied to the Distributed Firewall after saving the ruleset. This is not what was desired.

Applied To - 01

If the administrator needs to enforce this across all edges as per the original request then this can be done by modifying the Applied To field.

Applied To - 02

By checking the box that states “Apply this rule on all the Edge Gateways” it will enforce this onto all NSX Edge Firewalls.

Alternatively you could select the Edges that are pertinent to a customer or application topology. This can be done by not selecting Apply to All and calling out the individual objects.

Applied To - 03

The result under the Firewall management panel looks something like this.

Applied To - 04

To validate that it is applied browse to the relevant Edge you have enforced the rule on. Networking & Security \> NSX Edges  and in this case, double click on Edge-Gateway-01. Select the Firewall tab. The result is the following

Applied To - 05

You will find the rule is faithfully represented and enforced on the edge.

Load balancing

NSX Edge provides a load balancing function. For most server farms the features and options provided to server pools suit most real world requirements. Where the NSX Load Balancer does not meet the requirements, partner integration such as F5 allow administrators flexibility in their application deployments.

DHCP Servers

The NSX edge can be a DHCP server for the application topology that resides behind it. This allows automatic IP address management and simplification. Customers using a hosted platform do not need to rely on an infrsatructure management solution and can use DHCP from the Edge. Handy for environments that require dynamic addressing but are volatile in nature (development, test environments).

DHCP Relay

In 6.1, NSX added DHCP relay support. Before this the NSX edge could either be a DHCP server or the application topology that resided behind it required it’s own DHCP server. This wasn’t always suitable for customers or application topologies. DHCP relay support by NSX edge and DLR allows for the relaying of Discover, Offer, Request and Accept messages that make up DHCP.

The scenario is the messages are proxied by the DHCP relay enabled edge to a device running DHCP, in this case it is our Infrastructure Server. This means in certain environments you can have a centralised server cluster that manages IP addresses. The numerous services that reside behind different networks can access this. This can be configured on the DLR or ESG.

Numerous relay agents can reside in the data-path and they support numerous DHCP servers. Any server listed will have request sent to it. The one thing it cannot do is option 82. This is no overlapping address space.

If you issue the show config dhcp command on the Edge gateway you get the following output:

vShield Edge DHCP Config:
{
"dhcp" : {
"relay" : {
"maxHopCounts" : 10,
"servers" : [
"192.168.254.2"
],
"agents" : [
{
"interface" : "vNic_1",
"giaddr" : [
"192.168.10.1"
]
}
]
},
"logging" : {
"enable" : true,
"logLevel" : "debug"
},
"enable" : true,
"bindings" : {},
"leaseRotateTime" : 900,
"leaseRotateThreshold" : 10000
}
}

The gateway for which DHCP requests are expected from is vNic_1 – 192.168.10.1. It will relay DHCP requests to the server 192.168.254.2. It will enable lease time of 10000 seconds and log debugs.

DHCP Relay

Equal Cost Multipathing

With the advent of NSX 6.1 for vSphere, Equal Cost Multi Path (ECMP) has been introduced. Each NSX Edge appliance can push through it 10Gbps of traffic. There may be applications that require more bandwidth and as such ECMP helps solve this problem. It also allows increased resiliency. Instead of active/standby scenarios where one link is not used at all, ECMP can enable numerous paths to a destination. Load-sharing also means that when failure occurs only a subset of bandwidth is lost and not feature functionality.

ECMP

North-South traffic is handled by all active edges. ECMP can be enabled on both the distributed logical router and the NSX edge appliance. As long as the path is equal cost then there will be multiple routes usable for traffic.

ECMP Peering

The OSPF adjacencies shown here highlight there are peerings between the DLR and all the edges and the edges and the physical router. Confirm this by a show ip route and it will demonstrate that the routes to the destination will have multiple equal cost next hops.

ECMP Flow

Here you can see that traffic takes varying paths inbound and outbound because there is only two hops between assets behind the DLR and the physical infrastructure.

vSphere HA should be enabled for the NSX edge VM’s to help achieve higher levels of availability during failure. It is recommended also that timers are aggressively tuned. Hello and hold timers should be 1 and 3 seconds respectively to speed up traffic recovery.

It is important to remember that Active/Active ECMP has zero support for stateful FW, Load balancing or NAT services on edges running ECMP. This is due to the fact the NSX edges do not share connection state amongst each other.

An example of this would be three NSX edges participating a routing domain. These are Edges 1, 2, and 3. All edges peer from a DLR southbound on a common subnet. All edges peer to an upstream device northbound on a common subnet. Either BGP or OSPF can be used but this example uses BGP.

Hashing mechanisms

The importance of hashing cannot be overlooked. The ECMP used in NSX uses one of two hashing algorithms. On the NSX edge, load-balancing is based on the linux kernel. It is a flow based random round robin used in next-hop selection. A flow is determined by a pair of source and destination IPs. On the distributed logical router the hashing is simply done by the source and destination IP.

Linux flow based random round robin algorithm is rather interesting. It uses the concept of budgets. The kernel will define the budget as the sum of all next-hop weights. Each next hop, when initialised, is assigned a budget of 1. Each round, the kernel will generate a random value from 0 to the total round-robin budget. It will search the next-hop list until it finds a next hop with a budget that is equal to or greater than the generated random value. It will decrement both the round-robin budget and selected next-hop budget after selection.

When an edge fails the corresponding flows are re-hashed. It is re-hashed through the remaining active edges. Once the adjacencies time out and the routing table entries are removed a re-hash is triggered.

ECMP

ECMP Flow

Distributed Firewall

Within VMware NSX for vSphere there is a feature known as the distributed firewall (Distributed Firewall). The distributed firewall provides an in-kernel firewall that provides enforcement of policy at the guest level. This is done by matching rule sets applied centrally at the vCenter level down at the vNIC level. This ensures consistent enforcement in a distributed fashion irrespective of where the workload is placed. By allowing optimised traffic flow due to a vNIC level firewall the Distributed Firewall removes odd traffic manipulation to reach a firewall attached to an aggregation or services layer.

Before the Distributed Firewall was brought to market there was a need for East to West firewalling. There has been a long time a focus on perimeter security. This has been brought on through the industry focus on north-south application and network architectures that placed security at the DMZ and internet edge. Firewalls were littered around on a per application basis but nothing targeted East-West enforcement. Virtual appliances permeated the market such as vShield App, vSRX Gateway and ASA gateway but due to being virtual appliances they were limited to poor performance. This generally was 4-5 Gbps and a reduced feature set. Each had also licensing issues and a substantial memory and vCPU footprint which made scaling horizontally quite the issue. Not suited at all to attempting to firewall Tbps of lateral traffic. Enter the Distributed Firewall that scales based on CPU allowing upwards of 18+ Gbps per host.

To use the Distributed Firewall an administrator can use two touch points – vCenter Web Client or the REST API client via NSX Manager. Validation can be performed in three places – on the ESX host itself, on the NSX Manager with Central CLI, or via the vSphere Web Interface. The REST API and the vCenter Web Client will propagate all rule changes to all hosts within an NSX enabled domain. Host and Manager CLI access provides advanced troubleshooting and verification techniques.

Component Interaction

Upon Cluster and Host preparation there is a Distributed Firewall vSphere Installation Bundle (VIB) that is installed to every host. This package is installed by NSX Manager via ESX Agency Manager (EAM). This VIB, the esx-vsip kernel module, is instantiated and the vsfwd daemn is automatically started in the user space of the hypervisor.

vCenter communicates the NSX Manager IP address to the host. NSX Manager will communicate directly with the host through the User World Agent (UWA) speaking with a messaging bus (Rabbit MQ in this case) on tcp/5671

NSX Manager sends rules to the vsfwd user world process over the message bus in a format known as protobuf. This is serialised structured data for programs that communicate over that involve parsing a stream of bytes that actually is structured data.

The vsfwd process converts the protobuf messages into VMKernel ioctls and configures the Distributed Firewall Kernel Module (vsip) with the appropriate configurations for filters and rules. It is important to note that firewall filters are created on all Virtual Machines on all hosts unless an ‘Applied To’ or Exclusion has been applied.

The Message Bus used by NSX Manager is based on Advanced Message Queuing Protocol (AMQP). It allows NSX Manager to securely transfer firewall rules to the ESX host. AMQP is an open standard application layer protocol middleware.

The VMware Internetworking Service Insertion Platform(esx-vsip) is the main component of the Distributed Firewall kernel module. vsip receives firewall rules from NSX Manager and injects them into relevant vNICs

The vNIC-FW is the construct where firewall rules are stored and subsequently enforced. These are the vNICs which present connectivity to a Virtual Machine. The memory space of this construct contains two tables being the rule table and connection tracker tables.

The Distributed Firewall has two tables it maintains. These are the rules table and the connection tracker table. The rules table is the collection of matching rules based on what the matching criteria are. This is a standard numerical indexed table that can match on IP addresses, Ports and vCenter objects. The second table is the connection tracker table. This tables function is to manage current flows through the firewall which tantamount to traffic permitted to the firewall. The first packet of each flow is inspected and the connection table acts as the fast path.

Building Distributed Firewall filters

Tight integration with vCenter and the hypervisor puts the Distributed Firewall into a unique position. It can use the traditional source and destination IP and ports to create filters and object groups. It can also take advantage of vCenter objects such as Clusters, Data Centers, Logical Switches, Tags, VM name, Guest Operating System and more. This allows administrators to use context in rule sets that allow granularity and scalability.

What role does VM Tools play?

Depending on your version of deployed NSX for vSphere there are two paths.

NSX for vSphere 6.0.x / 6.1.x

Distributed Firewall IP address or IP Set rules under 6.0/6.1 do not require VM Tools to be installed on the guest VM.

Distributed Firewall vCenter objects under NSX 6.0/6.1 require VM Tools installed on the guest. When VM Tools is stopped on the guest the IP address of the VM is removed from the DFW address set. This results in the specific security policy that is written for the VM no longer being enforced. Only default rules apply.

NSX for vSphere 6.2

NSX for vSphere 6.2 removes the dependancy of VM Tools on guest Virtual Machines. Distributed Firewall rules can be written and enforced using vCenter objects without VM Tools. This will support both IPv6 and IPv4.

The method of enforcing Distributed Firewall rules based on vCenter objects with VM Tools is known as VM Trust Mode. VM Trust Mode uses the VM Tools to validate the IP address to vNIC mapping to enforce the rules.

In NSX for vSphere 6.2 the new mode is Secure Mode. This method of enforcement removes the dependancy on VM Tools for IP validation. Spoof Guard is considered a trusted source for the IP address to vNIC mapping. Spoof Guard can be used for IPv4 and IPv6 addresses.

Spoof Guard has two modes of enablement
* Trust on First Use
* Manual mode / Approval

Trust on First Use will accept the first IP address seen via ARP on the vNIC and register this back to NSX Manager via the message bus. Manual Mode / Approval will learn the first address and present the administrator an option to accept or deny the IP address learnt on the vNIC. An action must be taken before traffic can flow.

###

Packet Walk

From the source guest VM a packet is sent out towards the vSwitch. Before egressing onto the vSwitch from the vNIC the Distributed Firewall performs its actions. By firewalling at the vNIC with an in kernel module it is possible to reduce the amount of un-authorized traffic within the network.
First-packet lookup

  1. A lookup is performed. This lookup is of the connection tracker table. This checks to see if an entry for a flow is pre-existing.
  2. If a flow is not pre-existing in the connection tracker table it is listed as a miss result. Subsequently a rule lookup occurs on the rule table. An attempt to identify and find a matching rule applicable to the flow.
  3. Upon finding a matching rule in the table a new entry is created within the connection tracker table. The packets are then transmitted.
    Subsequent Packets

  4. A lookup is performed. This lookup is of the connection tracker table. This checks to see if an entry for a flow is pre-existing.

  5. An entry exists for the flow within the connection tracker table. The packets are transmitted.

If communication is between two guests on the same host then traffic will not hit the physical network. If traffic is on two guests on different hosts then traffic only needs to make its way to the other host. No longer is traversing a virtual appliance or physical firewall at an aggregation layer necessary.

By policy on the ingress to it is possible to provide granular policy control at both source and destination.
Distributed Firewall Logs

There are three types of logs that the firewall keeps and stores. They have a variety of information. It is important to understand where these are kept if advanced troubleshoot or auditing is required.

Distributed Firewall Logging

The NSX manager stores two types of logs. Stored at /home/secureall/secureall/logs/vsm.log are Audit logs and System events.

The audit logs include administration logs and Distributed Firewall configuration changes. From an auditing perspective this includes pre and post rule changes. The System Event logs include Distributed Firewall configuration applied, filter created, deleted, failed, VM’s added to security group and more.

On each host there is a Rules message log that is kept at /var/log/vmkernel.log.

This set of logs has PASS or DROP associations for each flow. The combination of these logs provides the required information for audited environments such as PCI-DSS and other compliance frameworks.

Distributed Firewall SpoofGuard

SpoofGuard allows NSX to ensure the IP associated to an endpoint is correct at the given time and block if an unauthorised change is detected. The vNIC level enforcement point provides a place in the network where additional network connection security can be delivered.

SpoofGuard will allow the administrator to acknowledge and validate a change of IP on a Virtual Machine. Without this acknowledgement the Virtual Machine will not be able to communicate on the network.

NSX Manager collects and learns IP addresses of all the Virtual Machines. Currently the vNIC to IP association is what information is collected – not the MAC address of the VM.

Service Composer

Service Composer within VMware NSX provides an administrator the ability to define a scalable and tiered security policy independent of the underlying infrastructure or routed topology.

This is the feature with the NSX platform that allows security to scale. Providing security that is enforced at a unit level, protecting virtual to physical or physical to virtual communications and allow event-driven security actions, Service Composer is the beating heart of NSX.

This section introduces the numerous elements of Service Composer, their respective touch points, and how to securely enforce application workloads. The flexibility of Service Composer and how it applies to workloads is key to the security architecture enforced by NSX.

Services such as IDS/IPS, anti-malware and advanced firewall function can be inserted into the traffic flow and effectively chained together between VMs on a granular, per workload basis. API driven tagging of VMs allows services to be applied dynamically, allowing instant reaction to new threats. NSX and Service Composer provides the foundation for creating granular, zero-trust security architectures as discussed below.

Security Groups

Security Groups provide administrators a mechanism to dynamically associate and group workloads. This abstraction allows a membership definition based upon one of many vCenter constructs. An administrator has the ability to create numerous Security Groups.

Security groups can have the following types of memberships:

  • Dynamic Membership based on object, abstraction or expression
  • Static Membership based on manual selection
  • Inheritance through another Security Group. Also known as Nested.

My definition of object, abstraction, or expression is one of the following – Security Tag, IP Set, Active Directory Group VM Name, OS Type, Computer Name, Security Group, etc. Something that is express in vCenter that is not a note, folder, or label.

It is possible to match on one or more of the aforementioned objects. These objects can match based on one or more or must match all. Whilst the granularity and control is here – this policy or logical box allows matching the right workload.

If a workload is instantiated that matches one or all of the parameters defined by the Security Groups membership rules it will be associated with the Security Group. At this stage all that has occurred is a manual, dynamic, or inherited membership of workloads.

Security Tags

Security Tag is a labelling mechanism that can be used as an abstraction to describe a state. This can be impressed upon a workload or be the matching criteria to a Security Group.
An administrator can create numerous labels to suit how they want to identify a specific workload. Given that the matching criteria of a Security Group can be a Security tag, a workload that is tagged can be automatically placed into a Security Group. Whilst an administrator can express a Security Tag onto a workload via the Web Client, the API or 3rd Party integration can be used to Tag a workload.

Something that uses the API directly would be a cloud management platform such as vRealize Automation. When a blueprint is selected by a user or an administrator it can be configured or set to tag workloads one or many security tags. As a result the workloads will inherit membership of the relevant group.

A 3rd party integration that uses Security Tags to change the group membership of a workload is Endpoint Security. An agent-less anti-virus solution could scan the VMDK associated to a selection of workloads. On detection of a severity one threat the anti-virus solution could revoke a particular tag (say, Production Tier) and invoke a new Security Tag upon the workload.

But why not use Labels?

Security Tags may be dose of deja vu to VMware administrators who have used Labels for a long time. Security Tags are specific and exclusive to NSX. The story goes that NSX Security tags were introduced to the product due to the heavy usage of labels and folders. Heavy usage is a good thing – the problem is that they have been used solely with a compute mentality in mind. This meant that where roles and responsibilities were isolated there was a chance that Labels and Folders used by Compute administrators could adversely alter the security domain.

Security Policy

Security Policies are re-usable rulesets that can be applied to Security Groups. Security Policies are created by administrators that express three types of rulesets:

  • Endpoint Services – guest-based services such as AV solutions
  • Firewall rules – Distributed Firewall policies
  • Network Introspection services – network services such as IDS, IPS, and encryption

Security Policies are created in such a way that any combination of rulesets can be derived.

An example of Security Policy and multiple labels – Security Policy A may take advantage of Firewall rules. It is applied to Security Group A which matches workloads based on the Security Tag – Web. Security Policy B may take advantage of Network Introspection rules that redirect tcp/1433 for Deep Packet Inspection. It is applied to Security Group B which matches workloads based on Security Tag – DB Inspection.

The workload ‘Web VM’ is subsequently tagged with the Security Tag – Web. It inherits the Distributed Firewall rules defined in Security Policy A. This policy explicitly states the following:

Security Policy ‘A’

  • SRC – 172.16.42.0/24, 172.16.43.0/24
  • DST – Policy’s Security Group
  • PRT – 443, ICMP
  • Action – Permit

Due to the reusable nature of Security Policy it is possible to match source or destination to Any, Source Security Group, Destination Security Group, or a selection of other Security Groups. This allows the Security Policy to have its Source or Destination modified based on what it is applied to further extending its reusability.

The administrator is told that all Web VMs must have Deep Packet Inspection to determine if DB queries are legitimate or of a malicious nature. Traditionally that involved a bit of network wizardry (buy me a beer and I’ll tell you how 000/911 still do it!) that may have had substantial lead times.

The 3rd Party Network Introspection solution registered with this environment is Palo Alto Networks. It enables advanced services on a per cluster basis. In short – it deploys a virtual appliance to each host in the cluster. The Distributed Firewall has an option to redirect the packet via VMCI (kernel path) to the virtual appliance. The Panorama management platform is aware of the kernel redirection rule defined by the Security Policy applied to the Distributed Firewall. The rule exposed into Panorama can have Palo Alto rules applied to it. In this case that is Deep Packet Inspection – SQL Injection. Based on the outcome of the advanced rule the packet is dropped or passed back via the kernel and out of the Distributed Firewall.

By applying the relevant third party redirection service (in this case by an Advanced policy applied to a Security Group with membership based on a Security Tag) it is possible for administrators to define advanced function on a per application basis independent of the underlying topology.

Third party service chaining

With the ability to provide advanced Network Introspection and End Point services based on membership of a Security Group or application of a Security Policy it is possible to provide per application chaining.

The ability to Service Chain workloads and provide advanced services to workloads comes about from two core abilities:

  • Application of multiple Security Tags
  • Nesting of Security Groups

These two methods define how and what is matched for the advanced service. What if I have a workload that is tagged with two Security Tags that becomes a member of two different Security Groups? Which Security Policy takes precedence.

Security Policies can have a weighting applied to them. This weight is an arbitrary number ranging from what I have tested 1,000-16,000,000. A Security Policy with a higher weight will have a higher precedence. An example of a highly weighted policy that takes precedence over other policies attached to a work load might be a Quarantine workload. If a Endpoint Service detects a threat it may apply a tag threat.found.sev.HIGH to a workload. A Quarantine Security Group is matching group membership based on the Security Tag threat.found.sev.HIGH. The Security Group has the Security Policy ‘Quarantine – High’ applied to it. It’s rules are:

Security Policy ‘Quarantine – High’

Rule 1

  • SRC – Policy’s Security Group
  • DST – AV-Remediation Security Group
  • PRT – Trend Deep Security Remediation Service Group (4118, 4120, 5274, 80, 443)
  • Action – Permit
  • Log – Disabled

Rule 2

  • SRC – Policy’s Security Group
  • DST – Policy’s Security Group
  • PRT – All
  • Action – Deny
  • Log – Enabled

The workload that was tagged with threat.found.sev.HIGH based on an Endpoint service detecting a threat will automatically inherit the above two rules. Due to the weighting of these two policies being drastically higher than other rules that matched the workloads original group – Security Group A.

Weighting of Security Policy allows such precedence:

  • Security Policy ‘Quarantine – High’ Weight = 9000
  • Security Policy ‘A’ Weight = 5000

This would enforce the entire Security Policy Quarantine - High in its entirety BEFORE Security Policy A. This is in a scenario where a workload is dual tagged based on a event.

How many links in the chain?

What if I wanted to have two different network introspection services on one flow type? What comes first? What is the order of operations? The Distributed Firewall for Network Introspection services is key for redirection into 3rd Party Integrations. The Distributed Firewall has 16 slots of which VMware reserve 0-3 and 12-15. Slots 4-11 can be used for registered Network Introspection services. This gives the administrator the flexibility to register services and use the correct 3rd Party Integration based on the desired outcome.

If an administrator had Palo Alto Networks and Symantec registered to the NSX for vSphere platform for IDS functionality it can be deployed on a per application basis. With the redirection policy enforced by a Security Policy applied to a Security Group there is choice down to a flow level what action is taken. Application A could leverage Symantec IDS on a flow, Application B could leverage Palo Alto IDS on a flow, and Application C could use both in order for a dual vendor strategy. The flexibility of the architecture leaves the choice to the administrator.

Resource Requirements

There is an amount of hardware required to support the virtual appliances that drive VMware NSX. This is measured in RAM, Storage and CPU. There is very little overhead as you scale in terms on impact on resources. Reliance on vCPU is important and these numbers can help you in terms of attempting to design an NSX environment.

Specification Comments Memory Storage Quantity
NSX Manager 4 12GB 60GB 1
NSX Controller 4 4GB 25GB 3
NSX ESG – Compact 1 512MB 512MB Varies
NSX ESG – Large 2 1GB 512MB Varies
NSX ESG – Quad Large 4 1GB 512MB Varies
NSX ESG – X Large 6 8GB 4.5GB Varies
DLR Control VM 1 512MB 512MB 2 in HA pair; pair per DLR

NSX Firewall Port Requirements

Description Port(s) Protocol Direction
NSX Manager Admin interface 5480 TCP Inbound
NSX Manager REST API 443 TCP Inbound
NSX Manager SSH 22 TCP Inbound
NSX Manager VIB access 80 TCP Inbound
NSX Controller SSH 22 TCP Inbound
NSX Controller REST API 443 TCP Inbound
NSX Control Plane Protocol (UWA to Controllers) 1234 TCP Inbound
Message Bus Agent (AMQP) to NSX Manager 5671 TCP Inbound
NSX Manager vSphere Web access to vCenter Server 443, 902 TCP Outbound
NSX Manager to vSphere host 443, 902 TCP Outbound
VXLAN encapsulations between VTEPs 8742 UDP Both
DNS client 53 TCP + UDP Outbound
NTP client 123 TCP + UDP Outbound
Syslog 514 TCP or UDP Outbound

NSX resources and reference documents.

VMware NSX Network Virtualization Design Guide PDF

The VMware NSX design guide looks at common deployment scenarios and explores the from the ground up the requirements and considerations in a VMWare NSX deployment. I did a write up here about this when it was released and since then there has been additional content added surrounding spine and leaf switch configurations. There are also sections on QoS, DMZ designs and L3 edge services offered by VMware NSX.

VMware Brownfield and Migration Solution Guide

This Solution Guide takes a look at how an administrator can integration VMware NSX for vSphere into an existing environment or architecture. This includes NSX core infrastructure, design tips, caveats, and migration strategies.

VMware Micro segmentation technical note

This paper is an introduction to a key use case of NSX – micro segmentation. At a pseudo-technical level it looks at the place and concepts of micro segmentation and how it delivers value to existing architectures. No longer using just IP address but consuming context from the virtual infrastructure as a foundation for security policy administrators can start building policy they need immediately.

VMware NSX leveraging Nexus 9000 and Cisco UCS infrastructure

This design guide looks at NSX that uses Cisco Infrastructure to provide network connectivity and compute resources. It talks about Cisco’s new data centre switches and how to take full advantage of NSX across a robust switching and compute infrastructure.

VMware NSX leveraging Nexus 7000 and Cisco UCS infrastructure

This new design guide looks at NSX running over the top of existing Cisco Infrastructure. Cisco Nexus and UCS are a mainstay of many data centers and this design document highlights the easy of which NSX can run over the top. Packed full of UCS and Nexus tips and tricks this guide is worth a read.

Next Generation Security with VMware NSX and Palo Alto Security VM-series

Our Net-X API provides partner integration into NSX. The network fabric which we deliver with VMware NSX can be further expanded to partners such as Palo Alto. Their VM-series user-space firewall specialising in integrating into existing PAN deployments and Layer 7 advanced application filtering.

VMware and Arista Network Virtualization Reference Design Guide for VMware vSphere Environments

VMware have published an Arista paper. This shows network topologies that leverage Arista infrastructure that have integration with VMware NSX. It demonstrates VTEP integration with hardware offload on the Top of Rack switch and more. It is worth a read if you are looking at alternatives or have alternatives to the current incumbent.

Change Log

  • v0.1 – Introduction and Core NSX components.
  • v0.2 – Logical Switching and VXLAN replication.
  • v0.3 – LDR and Reference design documentation.
  • v0.4 – Hardware requirements, Distributed Firewall.
  • v0.5 – Service Composer, Whitepapers, changed to markup

9 thoughts on “NSX Compendium”

  1. Thanks for the great resource! Do you know by any chance if there is a design guide for interconnecting two datacenters through NSX?

    1. Whilst there isn’t a public design guide regarding this particular topic have a chat to a local Systems Engineer from VMware. They can help aid in active active and DC failover scenarios using VMware NSX.

      It will appear on the blog in time. 🙂

  2. Hi Antony, thanks a lot for this great and fantastic guide about NSX. I’d like ask you a big favour, if it’s possible to public this guide on my blog, obviously report your references. Anyway, thank you for your fantastic work.

  3. When i try and do the host prep
    i get the following error – not ready

    network operations
    force-sync routing services

    any ideas

  4. Hi Anthony. Great compedium!

    Just a short note regarding unicast VXLAN replication mode covered in “VXLAN Replication – Control Plane” part. You write that “UTEP role is responsible for the delivery of a copy of the de-encapsulated inner frame to the local VMs.”. I believe that UTEP’s role is also to deliver still VXLAN-encapsulated packets to other VTEPs in the local subnet that have VM(s) in the considered VNI.

    1. Thanks for that Tomasz. I have captured this and updated it. The overhaul is designed to catch all this which is coming.

  5. Great article. I love articles that break down these complex concepts and explain them clearly. It’s not an easy thing to do. I’m a VMware Certified Instructor and plan to reference this article when I teach this class.

    Thanks for the great resource

  6. This was a great tutorial on NSX and thank you for the effort. This answered some areas I was not clear on as I lab this up and prepare for VCPN610 exam.

Leave a Reply

Your email address will not be published. Required fields are marked *


*