DVR的实现原理说明

DVR就是分布式路由,简单的说就是在每台计算节点上都运行l3 agent,当某个租户建立了一个路由的时候,如果某台主机上存在其路由所连接子网的虚机,那么这台主机上就会建立一个路由。所有的东西向流量都会由这个路由进行转发而不是通过网络节点进行转发。如果虚拟机有浮动IP的话,所有的南北向流量也会由这个路由器进行转发而不是由网络节点进行转发。只有没有浮动IP的虚机访问外网的时候,会通过SNAT走网络节点进行转发。

下面是DVR实现上的细节。

首先来看东西向数据包的流向。下面是一台计算节点上的路由情况:

[root@muti-devstack-03 ~]# ip netns list
fip-6cb429d2-640f-4ae8-9f0e-6509866b8600
qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5

当虚拟机要访问另一个subnet中的主机时,其会将数据包发送到本网段的默认路由,在这里就是qr口的10.88.0.1:

[root@muti-devstack-03 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: rfp-a4c2c84a-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 72:30:f8:91:e4:28 brd ff:ff:ff:ff:ff:ff
    inet 169.254.31.28/31 scope global rfp-a4c2c84a-1
       valid_lft forever preferred_lft forever
    inet 10.0.2.23/32 brd 10.0.2.23 scope global rfp-a4c2c84a-1
       valid_lft forever preferred_lft forever
    inet6 fe80::7030:f8ff:fe91:e428/64 scope link 
       valid_lft forever preferred_lft forever
83: qr-bf86b1b9-c0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:14:cf:4a brd ff:ff:ff:ff:ff:ff
    inet 10.88.0.1/16 brd 10.88.255.255 scope global qr-bf86b1b9-c0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe14:cf4a/64 scope link 
       valid_lft forever preferred_lft forever
85: qr-27b1a623-dc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:c8:50:a7 brd ff:ff:ff:ff:ff:ff
    inet 10.99.0.1/16 brd 10.99.255.255 scope global qr-27b1a623-dc
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fec8:50a7/64 scope link 
       valid_lft forever preferred_lft forever

但是,DVR在每个符合条件的主机上都会存在相同的路由器,每个路由器中有相同的qr口,比如在网络节点,同样有qr:

[root@muti-devstack-01 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
60: qr-bf86b1b9-c0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:14:cf:4a brd ff:ff:ff:ff:ff:ff
    inet 10.88.0.1/16 brd 10.88.255.255 scope global qr-bf86b1b9-c0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe14:cf4a/64 scope link 
       valid_lft forever preferred_lft forever
64: qr-27b1a623-dc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:c8:50:a7 brd ff:ff:ff:ff:ff:ff
    inet 10.99.0.1/16 brd 10.99.255.255 scope global qr-27b1a623-dc
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fec8:50a7/64 scope link 
       valid_lft forever preferred_lft forever

当虚拟机请求默认路由的mac地址的时候,其广播包如果同时两个qr都收到了那肯定会有问题。于是对于DVR来说,在计算节点的br-tun口上会有如下规则:

[root@muti-devstack-03 ~]# ovs-ofctl dump-flows br-tun | grep 10.88.0.1
 cookie=0x0, duration=15143.040s, table=1, n_packets=3, n_bytes=126, idle_age=15125, priority=3,arp,dl_vlan=3,arp_tpa=10.88.0.1 actions=drop

正是这条规则保证了到默认路由的广播包只会给到本机的qr口。

下面是qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5中的路由规则:

[root@muti-devstack-03 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip route show
10.88.0.0/16 dev qr-bf86b1b9-c0  proto kernel  scope link  src 10.88.0.1 
10.99.0.0/16 dev qr-27b1a623-dc  proto kernel  scope link  src 10.99.0.1 
169.254.31.28/31 dev rfp-a4c2c84a-1  proto kernel  scope link  src 169.254.31.28 

可以注意到这里并没有默认路由,同时多了个169.254.31.28/31。这个下面会说到。总之如果要到另一个网段10.99.0.0/16去的话,数据包会走qr-27b1a623-dc,这个和普通路由器的实现是一样的。

东西向的数据流就是上面讲的,和普通路由是一样的。现在看下到外网的流量。对于普通的路由器,如果一个虚拟机没有浮动IP,那么会通过公网网关做SNAT出去。对于DVR来说,如果一个虚拟机没有浮动IP,其同样是通过公网网关走SNAT,并且这个事情同样是发生在网络节点上而不是计算节点上(目的是节省不必要的公网IP开销)。

来看下没有浮动IP的南北走向,在10.99.0.0/16中建立一台虚拟机,IP为10.99.0.4,其建立在了muti-devstack-01这个网络节点上。来看下其路由:

[root@muti-devstack-01 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip route show
10.88.0.0/16 dev qr-bf86b1b9-c0  proto kernel  scope link  src 10.88.0.1 
10.99.0.0/16 dev qr-27b1a623-dc  proto kernel  scope link  src 10.99.0.1 

可以看到其只有到内外的路由,没有到公网的。那去公网的流量怎么走呢?DVR在这里通过自定义route table的方式实现额外的路由。其定义的route table有:

[root@muti-devstack-01 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip rule show
0:  from all lookup local 
32766:  from all lookup main 
32767:  from all lookup default 
173539329:  from 10.88.0.1/16 lookup 173539329 
174260225:  from 10.99.0.1/16 lookup 174260225 

local、main和default是我们正常的几个路由,有兴趣的可以看下其内容。现在我们知道这三个都匹配不到我们的数据包,route table根据编号从小到大优先级递减,同时会根据from后面的条件匹配数据包(这也是route table的用处,route的规则是不能指定源的),这里匹配到了174260225: from 10.99.0.1/16 lookup 174260225 ,其路由规则为:

[root@muti-devstack-01 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip route show table 174260225
default via 10.99.0.3 dev qr-27b1a623-dc 

可以看到其会将数据包通过qr-27b1a623-dc发送到10.99.0.3这个地址。普通的路由器建立后其在一个subnet中只会有一个口,那就是qr口,用来做默认路由。但是对于DVR来说,其会在subnet中plug两个口,一个是qr口,功能和普通路由一样,上面的东西向流量将的就是它。DVR还会建立一个口用来做SNAT,数据包发到这个SNAT后,通过iptable的SNAT将源地址改成路由器的公网地址通过SNAT抓发出去,而这个口在集群中只有一台(l3的配置文件中agent_mode配置为dvr_snat的那台,普通的配置的是dvr),这个口会存在在一个snat的namespace中,名字叫做sgXXX。比如我们的例子:

[root@muti-devstack-01 ~]# ip netns exec snat-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
61: sg-3f75facf-c5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:e8:56:c6 brd ff:ff:ff:ff:ff:ff
    inet 10.88.0.5/16 brd 10.88.255.255 scope global sg-3f75facf-c5
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fee8:56c6/64 scope link 
       valid_lft forever preferred_lft forever
62: qg-4cbbb8c4-20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:a8:96:fd brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.22/24 brd 10.0.2.255 scope global qg-4cbbb8c4-20
       valid_lft forever preferred_lft forever
    inet6 2001:db8::4/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fea8:96fd/64 scope link 
       valid_lft forever preferred_lft forever
65: sg-be13ad9d-1d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:c7:c5:12 brd ff:ff:ff:ff:ff:ff
    inet 10.99.0.3/16 brd 10.99.255.255 scope global sg-be13ad9d-1d
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fec7:c512/64 scope link 

snatXXX的XXX是qrouterXXX的XXX,也就是router的uuid。数据包如果没有浮动IP的话,会的走单独的route table抓发到sg-be13ad9d-1d,如果把这里的sg-be13ad9d-1d看成原来的qr,那么这里sg和qg的关系就是原来的qr和qg的关系了。可以看到其有如下NAT规则:

[root@muti-devstack-01 ~]# ip netns exec snat-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 iptables-save -t nat
# Generated by iptables-save v1.4.21 on Thu May 14 22:51:37 2015
*nat
:PREROUTING ACCEPT [2:632]
:INPUT ACCEPT [2:632]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:neutron-l3-agent-OUTPUT - [0:0]
:neutron-l3-agent-POSTROUTING - [0:0]
:neutron-l3-agent-PREROUTING - [0:0]
:neutron-l3-agent-float-snat - [0:0]
:neutron-l3-agent-snat - [0:0]
:neutron-postrouting-bottom - [0:0]
-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-POSTROUTING ! -i qg-4cbbb8c4-20 ! -o qg-4cbbb8c4-20 -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-snat -o qg-4cbbb8c4-20 -j SNAT --to-source 10.0.2.22
-A neutron-l3-agent-snat -m mark ! --mark 0x2 -m conntrack --ctstate DNAT -j SNAT --to-source 10.0.2.22
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3-agent-snat
COMMIT
# Completed on Thu May 14 22:51:37 2015

这里的-A neutron-l3-agent-snat -o qg-4cbbb8c4-20 -j SNAT –to-source 10.0.2.22会的进行SNAT。

最后来看下有浮动IP的数据包是怎么出去的。对于有浮动IP的数据包来说,DVR会让其数据流直接走计算节点(这能极大的减轻网络节点的压力)。10.88.0.6这台主机有浮动IP 10.0.2.23,首先来看下其路由:

[root@muti-devstack-03 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip route show
10.88.0.0/16 dev qr-bf86b1b9-c0  proto kernel  scope link  src 10.88.0.1 
10.99.0.0/16 dev qr-27b1a623-dc  proto kernel  scope link  src 10.99.0.1 
169.254.31.28/31 dev rfp-a4c2c84a-1  proto kernel  scope link  src 169.254.31.28 

可以看到这里没有到非内网的路由,倒是有一条169.254.31.28/31的。按照上面分析的,我们来看下route table:

[root@muti-devstack-03 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip rule show
0:  from all lookup local 
32766:  from all lookup main 
32767:  from all lookup default 
32768:  from 10.88.0.6 lookup 16 
173539329:  from 10.88.0.1/16 lookup 173539329 
174260225:  from 10.99.0.1/16 lookup 174260225 

这里可以看到,对于明确有浮动IP的主机,其计算节点所在的路由器上会有明确的route table,这里就是32768: from 10.88.0.6 lookup 16。其内容为:

[root@muti-devstack-03 ~]# ip netns exec qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 ip route show table 16
default via 169.254.31.29 dev rfp-a4c2c84a-1 

可以看到其数据会走rfp-a4c2c84a-1发送到169.254.31.29。这里来讲下rfp,在DVR中会存在一个浮动IP的namespace,其会通过veth和qrouter的namespace链接。比如:

[root@muti-devstack-03 ~]# ip netns list
fip-6cb429d2-640f-4ae8-9f0e-6509866b8600
qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5
[root@muti-devstack-03 ~]# ip netns exec fip-6cb429d2-640f-4ae8-9f0e-6509866b8600 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: fpr-a4c2c84a-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 3e:e0:01:8f:f6:57 brd ff:ff:ff:ff:ff:ff
    inet 169.254.31.29/31 scope global fpr-a4c2c84a-1
       valid_lft forever preferred_lft forever
    inet6 fe80::3ce0:1ff:fe8f:f657/64 scope link 
       valid_lft forever preferred_lft forever
84: fg-71947aea-a6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:e3:b8:01 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.24/24 brd 10.0.2.255 scope global fg-71947aea-a6
       valid_lft forever preferred_lft forever
    inet6 2001:db8::6/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fee3:b801/64 scope link 
       valid_lft forever preferred_lft forever

这里fpr-a4c2c84a-1和qrouter中的rfp-a4c2c84a-1是一对veth,所以根据路由规则,到外网的数据包会进入到这个namespace中。这个namespace又是很熟悉的qr-qg、sg-qg模式了。只不过这里的qg换了个名字叫做fg,其链接在计算节点的br-ex上:

[root@muti-devstack-03 ~]# ovs-vsctl show 
21dd833b-833c-433f-9aac-1982043b7a09
    Bridge br-ex
        Port "fg-71947aea-a6"
            Interface "fg-71947aea-a6"
                type: internal
        Port br-ex
            Interface br-ex
                type: internal
        ......

另外在qrouter的namespae中的nat中可以看到内外到外网的IP转换:

[root@muti-devstack-03 ~]# ip netns exec  qrouter-a4c2c84a-1baa-4868-8b3f-0424bb8856b5 iptables-save -t nat
# Generated by iptables-save v1.4.21 on Thu May 14 23:08:26 2015
*nat
:PREROUTING ACCEPT [3:716]
:INPUT ACCEPT [3:716]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:neutron-l3-agent-OUTPUT - [0:0]
:neutron-l3-agent-POSTROUTING - [0:0]
:neutron-l3-agent-PREROUTING - [0:0]
:neutron-l3-agent-float-snat - [0:0]
:neutron-l3-agent-snat - [0:0]
:neutron-postrouting-bottom - [0:0]
-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-OUTPUT -d 10.0.2.23/32 -j DNAT --to-destination 10.88.0.6
-A neutron-l3-agent-POSTROUTING ! -i rfp-a4c2c84a-1 ! -o rfp-a4c2c84a-1 -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-PREROUTING -d 10.0.2.23/32 -j DNAT --to-destination 10.88.0.6
-A neutron-l3-agent-float-snat -s 10.88.0.6/32 -j SNAT --to-source 10.0.2.23
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3-agent-snat
COMMIT
# Completed on Thu May 14 23:08:26 2015

当数据包走路由规则要到rfp-a4c2c84a-1的时候,neutron-l3-agent-POSTROUTING开始发挥作用,其会走SNAT让数据包出去。
现在问题来了,由于我们的fg是没有公网IP的,那么其如何响应返回包的ARP请求告知其mac地址呢?这里使用了代理ARP的功能,首先可以看到该功能是打开的:

[root@muti-devstack-03 ~]# ip netns exec  fip-6cb429d2-640f-4ae8-9f0e-6509866b8600 cat /proc/sys/net/ipv4/conf/fg-71947aea-a6/proxy_arp
1

打开这个功能后,只要fip这个namespace知道某个IP的路由该怎么走,其就会响应这个ARP包,于是,由于其路由中有10.0.2.23 via 169.254.31.28 dev fpr-a4c2c84a-1这么一条,所以公网来的数据包就会发给fg,fg通过路由走fpr-a4c2c84a-1,然后进入qrouter,最后通过DNAT,发送给具体的虚拟机。

发表评论

电子邮件地址不会被公开。 必填项已用*标注

*