Clearing stale records from NSX Manager

I was working in the lab and I had installed a fresh vCenter 6 Server Appliance into my lab. I seized the hosts from a previous install. What remained was an old installation of NSX (Manager had previous been attached to the old vCenter) and a Log Insight deployment. I had my vCenter running with new hosts and my new DVS built.

When I went to prepare the clusters for NSX I succeeded in deploying the VIBS to each host. I went to configure VXLAN and I would get an error. “Error retrieving DVS forwarding class for switch dvs-40″. Odd. I went into my ESX host and checked the object ID of my installed DVS. DVS ID was 33. So it seemed somewhere I had stale information from my previous installation. Information about the DVS used for Logical Switching is stored as information in the database on NSX Manager. It stores the DVS ID, associated clusters, uplinks assigned to DVS, and the port-group used for the VMKernel interface for VXLAN.

This can only be performed in Shell of NSX Manager. This password is used by GSS/VMware Staff/VMware Tech Support to gain access to the heart of NSX Manager. Please don’t ask me for it.

Sussing the database

After logging into Shell I drop into the database stored on NSX Manager. Lets see what stale information is in my database.

Here are the DVS associated that NSX Manager is aware of. Note that dvs-40 is here but not dvs-33. dvs-40 and dvs-35 are stale records. There is no dvs-33!

secureall=# select * from vdn_vds_context;
id | switch_id | mtu | teaming_policy | vmknic_dvpg_id | promiscuous_mode
-----+-----------+------+----------------+----------------+------------------
310 | dvs-35 | 1600 | 0 | | f
363 | dvs-40 | 1600 | 0 | | f
(2 rows)

Lets see what compute clusters are registered with NSX. Again more stale records.

secureall=# select * from vdn_cluster;
id | cluster_id | vlan_id | ip_pool_id | vmknic_count
-----+------------+---------+-----------------+--------------
315 | domain-c9 | 0 | ipaddresspool-3 | 1
368 | domain-c7 | 0 | ipaddresspool-7 | 1
(2 rows)

Okay – old clusters. My current clusters are c3 and c5. Lets look at the uplinks on the DVS.

secureall=# select * from vds_teaming_uplink_port;
id | uplink_port_name | vds_context | active_port
-----+------------------+-------------+-------------
311 | Uplink 4 | 310 | t
312 | Uplink 3 | 310 | f
313 | Uplink 2 | 310 | f
314 | Uplink 1 | 310 | f
364 | Uplink 4 | 363 | t
365 | Uplink 3 | 363 | f
366 | Uplink 2 | 363 | f
367 | Uplink 1 | 363 | f
(8 rows)

As I suspect – old uplinks are pinned to a DVS context, uplink 1 is active but stale. The portgroups that the VMKernel interfaces are in will match a stale DVS context which will then reference a port-group that doesn’t exist.

secureall=# select * from vdn_vmknic_portgroup;
id | moid | vds_context | vlan_id | backing_status
-----+----------------+-------------+---------+----------------
318 | dvportgroup-60 | 310 | 0 | 0
374 | dvportgroup-64 | 363 | 0 | 0
(2 rows)

With that assumption confirmed it is clear why the forwarding class is not selected when deploying VXLAN VMKernel interface.

To cleanse some stale records

Now we have confirmed there are a number of stale records lets remove the DVS context completely. That should delete all linked objects.

secureall=#
secureall=# delete from vdn_vds_context where id = '363'
secureall-# ;
ERROR: update or delete on table "vdn_vds_context" violates foreign key constraint "vds_teaming_uplink_port_vds_context_fkey" on table "vds_teaming_uplink_port"
DETAIL: Key (id)=(363) is still referenced from table "vds_teaming_uplink_port".

Whoop! Definitely didn’t do that. The error violates Foreign Key constraints. This essentially means all child objects referencing parent objects must be unlinked (deleted in this case) before deleting the parent.

Here we go

secureall=# delete from vdn_vmknic_portgroup where vds_context = '363';
DELETE 1
secureall=# delete from vds_teaming_uplink_port where vds_context='363';
DELETE 4
secureall=# delete from vdn_cluster_vds_contexts where vds_context_id_fk = '363';
DELETE 1
secureall=# delete from vdn_vds_context where id = '363' ;
DELETE 1
secureall=#

Now that the stale information has been cleared from the database it is time to check that

Given this is a lab environment with no working NSX the quickest way to resolve this would be a redeploy of the NSX Manager. Due to my mentality of troubleshoot, don’t restore, I was adamant on figuring this one out. Shout out to Dmitri Kalintsev and Nick Bradford (Bad conscience, good conscience) who were sitting on my shoulder as we hacked through the database!

Learning by doing: Adding routes to Neutron

This post outlines how to add routes to a Neutron router. The outcome of this post will allow the jumphost to access VMs and networks advertised behind the SRX. Working on my lab environment I have some server infrastructure and jump hosts in the network 192.168.100.0/24. Due to Neutron routing being very plain I could not dynamically peer the SRX with the Neutron gateway.

First time to list my routers in my project

admin@mgt-lnxjump:~$ neutron router-list
+--------------------------------------+-----------------+-----------------------------------------------------------------------------+
| id                                   | name            | external_gateway_info                                                       |
+--------------------------------------+-----------------+-----------------------------------------------------------------------------+
| 27d89917-bb77-46c3-95d5-250a259ba304 | public_router   | {"network_id": "083ad060-d6dd-4e49-84e1-c8a2259982ff", "enable_snat": true} |
| 60aefbeb-d2f2-4daf-91b2-6f59391bfee5 | external_router | {"network_id": "083ad060-d6dd-4e49-84e1-c8a2259982ff", "enable_snat": true} |
| a41a761d-9ee1-449d-80be-3ea0f599c4f9 | isolated_router | {"network_id": "083ad060-d6dd-4e49-84e1-c8a2259982ff", "enable_snat": true} |
+--------------------------------------+-----------------+-----------------------------------------------------------------------------+

The router I want to use is the isolated_router. The ID is a41a761d-9ee1-449d-80be-3ea0f599c4f9.

The attached image below shows the rough network environment.

srx-ecmp-nsx

 

The three networks attached to the Distributed Logical Router are unknown beyond the edge of the SRX. WIN-MGT on the 192.168.100.0/24 network has no idea of it. It can only see the interface of the SRX in the 192.168.110.0/24 network. We need to teach the Neutron Router that routes between these two networks about 172.16.200.0/26.

This can be done with updating the neutron router.

admin@mgt-lnxjump:~$ neutron router-update a41a761d-9ee1-449d-80be-3ea0f599c4f9 --routes type=dict list=true destination=172.16.83.0/24,nexthop=192.168.110.200 destination=172.16.200.0/28,nexthop=192.168.110.200 destination=172.16.200.16/28,nexthop=192.168.110.200
destination=172.16.200.32/28,nexthop=192.168.110.200 destination=172.16.201.0/24,nexthop=192.168.110.200
Updated router: a41a761d-9ee1-449d-80be-3ea0f599c4f9

The result when we look at the Neutron router again is much better.

admin@mgt-lnxjump:~$ neutron router-show a41a761d-9ee1-449d-80be-3ea0f599c4f9
+-----------------------+-----------------------------------------------------------------------------+
| Field                 | Value                                                                       |
+-----------------------+-----------------------------------------------------------------------------+
| admin_state_up        | True                                                                        |
| distributed           | False                                                                       |
| external_gateway_info | {"network_id": "083ad060-d6dd-4e49-84e1-c8a2259982ff", "enable_snat": true} |
| id                    | a41a761d-9ee1-449d-80be-3ea0f599c4f9                                        |
| name                  | isolated_router                                                             |
| routes                | {"destination": "172.16.200.0/28", "nexthop": "192.168.110.200"}            |
|                       | {"destination": "172.16.200.16/28", "nexthop": "192.168.110.200"}            |
|                       | {"destination": "172.16.200.32/28", "nexthop": "192.168.110.200"}            |
|                       | {"destination": "172.16.201.0/24", "nexthop": "192.168.110.200"}            |
|                       | {"destination": "172.16.83.0/24", "nexthop": "192.168.110.200"}             |
| status                | ACTIVE                                                                      |
| tenant_id             | c3485cfe92be4f47852db87ca06b4383                                            |
+-----------------------+-----------------------------------------------------------------------------+

As you can see there is a new field that includes the routes that I have programmed into my Neutron router. I now have connectivity from the 192.168.100.0/24 network into my networks advertised off the DLR. Between the SRX and the DLR is an ECMP fabric.

mgt-lnxjump (0.0.0.0)                   Tue Jul 28 00:32:10 2015
Keys:  Help   Display mode   Restart statistics   Order of fields
   quit                 Packets               Pings
 Host                 Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 192.168.100.1      0.0%   173    0.5   0.3   0.2   4.7   0.3
 2. 192.168.110.200    0.0%   173    6.6   8.0   1.2  11.6   2.3
 3. 172.16.83.3        0.0%   173    3.9   4.0   1.1  23.4   2.3
 4. ???
 5. 172.16.200.17      0.0%   172    7.9   8.7   5.9  22.9   2.1

End to end connectivity. We can see at hop three 172.16.83.3 is the E3 currently passing traffic. If this drops or turns off this hop will be updated with 172.16.83.1,2 or 4. ECMP is great!

Gotcha: A gotcha of this is that neutron doesn’t add additional static routes each time you execute the command. It will refresh the list. Ensure you don’t forget any else you may have some connectivity issues!

The alternative is to assign host routes under a DHCP scope. This is pretty easy. A host route is a DHCP option passed to an instance on boot that would allow an allocation of pre-defined static routes. This would do that but in my case my instance had spawned and the other instance accessing this environment was actually not a nova instance and therefore not in scope for an IP from my Neutron DHCP Client.

There you are. Connectivity to my remote network. Openstack is pretty powerful!