NSX Traceflow in VMC on AWS for self-service traffic Troubleshooting

In a recent post I have talked about the NSX Manager Standalone UI access which was released in VMware Cloud on AWS in version 1.16.

This capability is now permitting customer to access a very useful feature called Traceflow that many NSX customers are familiar with and which allows them to troubleshoot connectivity issues in their SDDC.

What is Traceflow and how does it work?

VMware Cloud on AWS customers can leverage Traceflow to inspect the path of a packet from any source to any destination Virtual Machines running into the SDDC. In addition, Traceflow provides visibility for external communication over VMware Transit Connect or the Internet.

Traceflow allows you to inject a packet into the network and monitor its flow across the network. This flow allows you to monitor your network path and identify issues such as bottlenecks or disruptions.

Traceflow observes the marked packet as it traverses the overlay network, and each packet is monitored as it crosses the overlay network until it reaches a destination guest VM or an Edge uplink. Note that the injected marked packet is never actually delivered to the destination guest VM.

Let’s see what it can do to help gaining visibility and troubleshooting networking connectivity in a VMC on AWS SDDC.

Troubleshooting Connectivity between an SDDC and a native VPC over a vTGW/TGW.

First let’s have a look at the diagram of this lab.

In my lab, I have deployed two SDDCs: SDDC1 and SDDC2 in two different regions and have attached them together within an SDDC Group. As they are in two different region two Virtual Transit Connect are required. I have two VMs deployed in the SDDC1, Deb10-App01 (172.18.12.100) and Deb10-Web001 (172.18.11.100).

I have also deployed a native VPC (IP: 172.20.2.0/24) attached to the SDDCs group through a TGW peered to the vTGW. I then have deployed two VMs in the attached VPC with IPs 172.20.2.148 and .185.

In this example, the trafic I want to gain visibility on will flow over the vTGW (VMware Managed transit Gateway) and the native Transit GatewayTGW which is peered to it.

Peering a vTGW to a native AWS Transit Gateway is a new capability we recently introduced. We can peer them in different region as well as in same region. If you want to know more how to setup this architecture, have a look at my post where I describe the all process.

Once all connectivity is established, I have tested ping connectivity between the VM Deb10-App01 (172.18.12.100) running on a Compute segment in SDDC1 to the EC2 instance (172.20.2.148) running in the native VPC (172.20.2.0/24).

Let’s launch the Traceflow from the NSX Manager UI. After connecting to the interface, the tool is accessible under the Plan & Troubleshoot menu.

Select Source machine in SDDC

In order to gain visibility under the traffic between both VMs, I have first selected the VM in the SDDC in the left Menu which is where you can define the Source machine.

Select Destination Machine in the native VPC

In order to select the destination EC2 instance running in the native VPC, I have had to select IP – Mac/ Layer 3 instead of Virtual Machine.

Displaying and Analyzing the Results

The traceflow is ready to be started!

The Analysis start immediately after clicking on the Trace button.

After a few seconds, the results are displayed. The NSX interface graphically displays the trace route based on the parameters I set (IP address type, traffic type, source, and destination). This display page also enables you to edit the parameters, retrace the traceflow, or create a new one.

Traffic flow diagram with the hops

The screen is split into two sections.

First section, on the top, is showing the diagram with the multiple hops that was crossed by the traffic. Here we can see that the packets has first flowed over the CGW, then it has reached the Intranet Uplink of the EDGE, it hit the vTGW (Transit Connect) and it has finally crossed the native TGW.

We can see that the MAC address of the destination has been collected on the top near the Traceflow ‘title’.

The second section detailed each and every steps followed by the packets with the associated timestamps. The first column shows the number of physical Hops.

The final step show the packet has been correctly delivered to the TGW.

We can confirm that the Distributed Firewall (DFW) have been enforced and were correctly configured :

To confirm which Distributed FW rule have been enforced, you can check on the console the corresponding rule by searching it by the rule ID:

Same thing applies for the Edge Firewall for North South Trafic.

Again I have checked the Compute Gateway Firewall rule to confirm it picked the right one and that it was well configured.

Let’s now do a test with a Route Based VPN to see the difference.

Troubleshooting Connectivity between an SDDC and a TGW via a RB VPN

Now instead of using the vTGW to TGW peering, I have established a Route Based VPN directly to the native Transit GW in order to avoid flowing over the vTGW.

I have enable the RB VPN from the Console:

I have enable only the first one.

After a few minutes the BGP session is established.

The VPN Tunnel in AWS shows 8 routes have been learned in the BGP Session.

I just need to remove the static route in the native TGW route table to avoid asymmetric traffic.

The 172.18.12.0/24 where my Virtual Machine runs is now learned from the BGP session.

Let’s start the analysis agains by clicking the Retrace button.

Click on Retrace button to relaunch the analysis on the same Source and Destination

Click Proceed to start the new request.

The new traceflow request result displays.

This time the packet used the Internet Uplink and the Internet Gateway of the Shadow VPC managed by VMware where the SDDC is deployed. The observations show that packets were successfully delivered to both the NSX-Edge-0 through IPSEC and to the Internet Gateway (igw).

Troubleshooting Firewall rules

Last thing you can test with Traceflow is how to troubleshoot connectivity when a Firewall rule is blocking a packet.

For this scenario, I have changed the Compute Gateway Firewall rule to drop the packets.

I have started the request again and the result is now showing a red exclamation mark.

The reason of the Packet Dropped is a Firewall Rule

The details confirmed it was dropped by the Firewall rule and it displayed the ID of the rule.

That concludes this blog post on how to easily troubleshoot your network connectivity by leveraging the Traceflow tool from the NSX Manager UI in VMware Cloud on AWS.

Thanks for visiting my blog! If you have any questions, please leave a comment below.