使用 Linux 網絡虛擬化技術探究容器網絡原理

在使用 Go 和 Linux Kernel 技術探究容器化原理一文中，我們揭祕了容器的本質就是一個特殊的進程，特殊在爲其創建了 NameSpace 隔離運行環境，並用 Cgroups 爲其控制資源開銷。

藉助這兩個底層技術，我們可以成功實現應用容器化，但如何讓多個容器在網絡環境不互相干擾的情況下還能互相通信，讓容器可以訪問外部網絡，讓外部網絡可以訪問特定容器等等的這些容器的網絡問題還得再利用一些 Linux 網絡虛擬化技術。

容器網絡隔離：NameSpace

讓多個容器的網絡環境不會互相干擾，可以延續之前 NameSpace 的知識點。

在前面介紹的 8 種 NameSpace 中，有一個 Network NameSpace ，我們可以藉助這個來給容器配置獨立的網絡視圖。

我們先看宿主機所處的 Default Network NameSpace ：

[root@host ~]# readlink /proc/$$/ns/net
net:[4026531956]
[root@host ~]#

以 net:[4026531956] 爲例，net 代表了 NameSpace 的類型，4026531956 則是 NameSpace 的 inode number。

經過之前的文章介紹，我們已經深知容器的本質就是進程，本文也不再浪費筆墨，接下來的操作將直接透過上層直達本質，藉助 Linux 底層提供的能力來探索和還原容器網絡的實現。（後續本文創建出的進程請直接理解爲容器）

首先通過 ip netns 工具創建兩個網絡命名空間 netns1 和 netns2 ：

[root@host ~]# ip netns add netns1
[root@host ~]# ip netns add netns2
[root@host ~]# ip netns list
netns2
netns1
[root@host ~]#

在這兩個網絡命名空間之上分別創建兩個 bash 進程容器，其中 container1 ：

[root@host ~]# ip netns exec netns1 /bin/bash --rcfile <(echo "PS1=\"container1> \"")
container1> readlink /proc/$$/ns/net
net:[4026532165]
container1> ip link # 查看網絡設備列表
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
container1> route -n # 查看路由表
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
container1> iptables -L # 查看 iptables 規則
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
container1>

同樣的，container2 ：

[root@host ~]# ip netns exec netns2 /bin/bash --rcfile <(echo "PS1=\"container2> \"")
container2> readlink /proc/$$/ns/net
net:[4026532219]
container2> ip link # 查看網絡設備列表
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
container2> route -n # 查看路由表
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
container2> iptables -L # 查看 iptables 規則
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
container2>

可以看出，由於 Network NameSpace 隔離的作用，不同的容器（ container1 和 container2 ）擁有自己獨立的網絡協議棧，包括網絡設備、路由表、ARP 表、iptables 規則、socket 等，所有的容器都會以爲自己運行在獨立的網絡環境中。

現在準備一個簡單的 go web 服務，並分別在 container1 和 container2 的後臺運行：

package main

import (
 "fmt"
 "net/http"
 "os"
)

func main() {
 name := os.Args[1]
 http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
  fmt.Println("req")
  w.Write([]byte(name + "\n"))
 })
 fmt.Println(name, "listen :8080")
 panic(http.ListenAndServe(":8080", nil))
}

container1 ：

container1> go run main.go container1 > container1.log &
[1] 2866
container1> tail container1.log
container1 listen :8080
container1>

container2 ：

container2> go run main.go container2 > container2.log &
[1] 2955
container2> tail container2.log
container2 listen :8080
container2>

當前在同一主機下，即使 container1 和 container2 都監聽着 8080 端口，也並不會發生端口衝突。

我們測試一下剛纔所啓動的服務的可用性，以 container1 爲例：

container1> curl localhost:8080
curl: (7) Failed to connect to ::1: Network is unreachable
container1>

此時訪問不通是因爲我們根本還沒有啓用任何網絡設備，包括我們基本的 lo 迴路設備。直接將其啓用即可：

container1> ifconfig
container1> ifup lo
container1> ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

container1> curl localhost:8080
container1
container1>

container2 同理。

容器點對點通信：Veth

當前兩個容器處於不同的 Network NameSpace 中，它們的網絡環境是互相隔離的，你不認識我，我也不認識你，它們之間自然也無法進行網絡通信。

在進入解決辦法之前，讓我們回顧下，在現實世界中，如果兩臺計算機需要互相進行網絡通信。

我們只需要很簡單地直接使用一根網線把這兩臺計算機的網口給連接起來，就可以了。

那麼回到容器層面呢，我們抽象一點，容器能不能也有一個類似的網口可以用來 “插一根網線” 呢？

答案當然是可以的，在 Linux 網絡虛擬化技術中就爲我們提供了一種這樣的用軟件來模擬硬件網卡的方式： Veth（Virtual Ethernet devices）。

和一根網線有兩端一樣，Veth 也是成對出現的，所以也被稱爲 veth pair 。假設 veth1 和 veth2 是一對設備，那麼從 veth1 進入的數據包將會在 veth2 收到，反過來也一樣。所以只要將一對 Veth 分別放入兩個 Network Namespace 中，這兩個 Network Namespace 就會像連接了網線一樣，可以互相通信了。

開始實踐，首先查看宿主機上已有的網絡設備列表：

[root@host ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:26:eb:d4 brd ff:ff:ff:ff:ff:ff
[root@host ~]#

創建一個 veth pair （包含 veth1 和 veth2 兩個虛擬網絡設備）：

[root@host ~]# ip link add veth1 type veth peer name veth2
[root@host ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:26:eb:d4 brd ff:ff:ff:ff:ff:ff
3: veth2@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 6a:01:c8:fa:9e:6e brd ff:ff:ff:ff:ff:ff
4: veth1@veth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:7e:de:c6:89:ff brd ff:ff:ff:ff:ff:ff
[root@host ~]#

直接將一端的虛擬網卡 veth1 放入 netns1 命名空間，另一端的 veth2 放入 netns2 命名空間，這樣就相當於使用網線將兩個命名空間連接起來了：

[root@host ~]# ip link set veth1 netns netns1
[root@host ~]# ip link set veth2 netns netns2
[root@host ~]#

連接起來後就可以在 container1 和 container2 容器中查看到各自對應的網絡設備：

container1> ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: veth1@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:7e:de:c6:89:ff brd ff:ff:ff:ff:ff:ff link-netnsid 1
container1>

container2> ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: veth2@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 6a:01:c8:fa:9e:6e brd ff:ff:ff:ff:ff:ff link-netnsid 0
container2>

分別爲這兩個網卡設置 IP 地址，使其位於同一個子網 172.17.0.0/24 中，然後啓用網卡：

container1> ip addr add 172.17.0.101/24 dev veth1
container1> ip link set dev veth1 up
container1> ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 10  bytes 942 (942.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 942 (942.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.101  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 42:7e:de:c6:89:ff  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

container1>

container2> ip addr add 172.17.0.102/24 dev veth2
container2> ip link set dev veth2 up
container2> ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 10  bytes 942 (942.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 942 (942.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.102  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::6801:c8ff:fefa:9e6e  prefixlen 64  scopeid 0x20<link>
        ether 6a:01:c8:fa:9e:6e  txqueuelen 1000  (Ethernet)
        RX packets 6  bytes 516 (516.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6  bytes 516 (516.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

container2>

測試 container1 和 container2 容器互相訪問對方的服務：

container1> curl 172.17.0.102:8080
container2
container1>

container2> curl 172.17.0.101:8080
container1
container2>

到這裏，只需依靠 Veth ，我們就得到了一個點對點的二層網絡拓撲，容器點對點通信的問題也得以成功解決。

Veth 工作原理

容器間互相通信：Bridge

我們知道在現實世界中，不可能只有兩臺計算機，當有第三臺、第四臺，乃至無數臺計算機加入到網絡環境中的時候，我們不可能有這麼多網口可以彼此兩兩互相連接。爲了解決這個問題，便發明了二層交換機（或網橋）。

對於容器也是如此，如果我們有 3 個或以上的 Namespace 需要接入同一個二層網絡，就不能簡單的只使用 Veth 了。不過和使用 Veth 作爲虛擬網卡的方式一樣，貼心的 Linux 同樣爲我們提供了網橋（交換機）的虛擬實現方式：Bridge 。

在 netns1 和 netns2 命名空間的基礎上再創建一個 netns3 命名空間：

[root@host ~]# ip netns add netns3
[root@host ~]# ip netns list
netns3
netns2 (id: 1)
netns1 (id: 0)
[root@host ~]#

重複之前的操作，創建 container3 容器：

[root@host ~]# ip netns exec netns3 /bin/bash --rcfile <(echo "PS1=\"container3> \"")
container3> readlink /proc/$$/ns/net
net:[4026532277]
container3> go run main.go container3 > container3.log &
[1] 4270
container3> tail container3.log
container3 listen :8080
container3> ifup lo
container3> ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

container3> curl localhost:8080
container3
container3>

在爲這三個容器互相通網之前，我們先把之前給 container1 和 container2 容器之間連接的 “網線”（veth1 和 veth2 ）拔了（只需在其中一個容器中操作）：

container1> ip link delete veth1
container1> ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
container1>

現在三個容器之間就誰也不認識誰了。

迴歸實踐，創建一個 Bridge ，並將其啓用：

[root@host ~]# ip link add br0 type bridge
[root@host ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:26:eb:d4 brd ff:ff:ff:ff:ff:ff
5: br0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 9e:ac:12:15:98:64 brd ff:ff:ff:ff:ff:ff
[root@host ~]# ip link set dev br0 up
[root@host ~]# ifconfig
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::9cac:12ff:fe15:9864  prefixlen 64  scopeid 0x20<link>
        ether 9e:ac:12:15:98:64  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6  bytes 516 (516.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.12.15  netmask 255.255.252.0  broadcast 10.0.15.255
        inet6 fe80::5054:ff:fe26:ebd4  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:26:eb:d4  txqueuelen 1000  (Ethernet)
        RX packets 114601  bytes 160971385 (153.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 18824  bytes 2035143 (1.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 15  bytes 2000 (1.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 15  bytes 2000 (1.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@host ~]#

準備三條 “網線”（三對 veth）：

[root@host ~]# ip link add veth1 type veth peer name veth1-br
[root@host ~]# ip link add veth2 type veth peer name veth2-br
[root@host ~]# ip link add veth3 type veth peer name veth3-br
[root@host ~]#

將 veth1 插入 netns1 、veth1-br 插入 br0 、 veth2 插入 netns2 、veth2-br 插入 br0 、 veth3 插入 netns3 、veth3-br 插入 br0 （記得啓用 veth*-br ）：

[root@host ~]# ip link set dev veth1 netns netns1
[root@host ~]# ip link set dev veth2 netns netns2
[root@host ~]# ip link set dev veth3 netns netns3
[root@host ~]# ip link set dev veth1-br master br0
[root@host ~]# ip link set dev veth2-br master br0
[root@host ~]# ip link set dev veth3-br master br0
[root@host ~]# ip link set dev veth1-br up
[root@host ~]# ip link set dev veth2-br up
[root@host ~]# ip link set dev veth3-br up
[root@host ~]#

分別在三個容器中，爲各自的網卡設置 IP 地址，並使其位於同一個子網 172.17.0.0/24 中，設置完後同樣需要進行啓用操作：

container1> ip addr add 172.17.0.101/24 dev veth1
container1> ip link set dev veth1 up
container1>

container2> ip addr add 172.17.0.102/24 dev veth2
container2> ip link set dev veth2 up
container2>

container3> ip addr add 172.17.0.103/24 dev veth3
container3> ip link set dev veth3 up
container3>

測試 container1 、 container2 、container3 容器互相訪問對方的服務：

container1> curl 172.17.0.102:8080
container2
container1> curl 172.17.0.103:8080
container3
container1>

container2> curl 172.17.0.101:8080
container1
container2> curl 172.17.0.103:8080
container3
container2>

container3> curl 172.17.0.101:8080
container1
container3> curl 172.17.0.102:8080
container2
container3>

至此，在 Veth 的基礎上，引入 Bridge 功能，我們就將多個 Namespace 連接到了同一個二層網絡中，容器間互相通信的問題得以成功解決。

Veth+Bridge 工作原理

容器與外部網絡通信：route 和 iptables

目前爲止，我們的實驗都是處於同一子網中。但實際的應用場景，更多的是需要容器可以與外部進行互通。

在現實世界中，二層交換機只能解決同一子網內的數據流向，對於不同子網，就需要使用三層路由器（或網關）來轉發。

不過和之前 Linux 提供了交換機的虛擬化實現 Bridge 不同，Linux 並沒有提供一個虛擬的路由器設備。因爲 Linux 其自身就已經具備了路由器的功能，可以直接用來充當路由器，更準確地說，在 Linux 中，一個 Network Namespace 就可以承擔一個路由器的功能。

在 Linux Network Namespace 中，路由功能的定義其實很簡單：直接通過定義路由表規則就可以決定將請求的數據包流向到指定的網絡設備上。

路由的規則都定義在了路由表，對於路由表，我們最常用的是 local 和 main ，當然也可以另外配置其他表，其中 local 的優先級要高於 main ，我們平時訪問本機（localhost）的請求，都會直接在 local 表中找到規則，不會再經過 main 。

可以查看所有的路由表：

[root@host ~]# cat /etc/iproute2/rt_tables
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep
[root@host ~]#

如果要查看指定的路由表中的規則可以使用 ip route list table 表名 ：

[root@host ~]# ip route list table local
broadcast 10.0.12.0 dev eth0 proto kernel scope link src 10.0.12.15
local 10.0.12.15 dev eth0 proto kernel scope host src 10.0.12.15
broadcast 10.0.15.255 dev eth0 proto kernel scope link src 10.0.12.15
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
[root@host ~]# ip route list table main
default via 10.0.12.1 dev eth0
10.0.12.0/22 dev eth0 proto kernel scope link src 10.0.12.15
169.254.0.0/16 dev eth0 scope link metric 1002
[root@host ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.12.1       0.0.0.0         UG    0      0        0 eth0
10.0.12.0       0.0.0.0         255.255.252.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
[root@host ~]#

我們平時使用的 route -n 命令實際就是在查看 main 路由表。

容器和宿主機互通

我們現在有 container1 、 container2 和 container3 三個容器，其 IP 分別爲：172.17.0.101 、172.17.0.102 和 172.17.0.103 ，它們都處於同一子網中。

查看宿主機的 IP ，這裏是 10.0.12.15 ：

[root@host ~]# ifconfig
...
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.12.15  netmask 255.255.252.0  broadcast 10.0.15.255
        inet6 fe80::5054:ff:fe26:ebd4  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:26:eb:d4  txqueuelen 1000  (Ethernet)
        RX packets 119923  bytes 161411733 (153.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 24106  bytes 2884317 (2.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
...
[root@host ~]#

查看容器（以 container1 爲例）的路由規則：

container1> route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.17.0.0      0.0.0.0         255.255.255.0   U     0      0        0 veth1
container1>

可以發現，當前容器的路由表中只有一條規則：當訪問 172.17.0.0/24 子網內的 IP 時，例如 172.17.0.102 ，數據包直接轉發到 veth1 設備。

根據 Veth 的性質，對應 veth1 的另一端 veth1-br 設備就會接收到數據包，又由於 veth1-br 設備是連接在 br0 二層交換機上的，所以 veth2-br 設備也會接收到該數據包，最後就到達了對應的 veth2 目標設備。

我們現在的目的就是要讓容器訪問除了 172.17.0.0/24 子網內的 IP 的其它 IP 時（比如宿主機 IP 10.0.12.15），也可以把數據包轉發出去，也就是需要給容器增加路由規則。

回想我們之前的 Bridge ，從網絡角度上看，我們是把它當作一個二層交換機，所以不需要爲它設置 IP 地址，但現在從宿主機的角度上看，br0 同時也是 Default Network Namespace 裏面的一張網卡，我們可以直接爲這張網卡設置 IP 後，來充當三層路由器（網關），參與主機的路由轉發。

我們在宿主機上給 br0 設備設置 IP 地址爲 172.17.0.1 （同樣位於子網 172.17.0.0/24 內）：

[root@host ~]# ip addr add local 172.17.0.1/24 dev br0
[root@host ~]# ifconfig
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::9cac:12ff:fe15:9864  prefixlen 64  scopeid 0x20<link>
        ether 16:bd:7d:ca:53:bf  txqueuelen 1000  (Ethernet)
        RX packets 27  bytes 1716 (1.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 656 (656.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
......

此時，宿主機會自動增加一條 Destination 爲 172.17.0.0 的路由規則：

[root@host ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.12.1       0.0.0.0         UG    0      0        0 eth0
10.0.12.0       0.0.0.0         255.255.252.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.17.0.0      0.0.0.0         255.255.255.0   U     0      0        0 br0
[root@host ~]#

按照這條路由規則（訪問目標主機爲 172.17.0.0/24 的數據包會轉發到 br0 設備），我們現在的宿主機就可以直接訪問容器了：

[root@host ~]# ping 172.17.0.101
PING 172.17.0.101 (172.17.0.101) 56(84) bytes of data.
64 bytes from 172.17.0.101: icmp_seq=1 ttl=64 time=0.025 ms
64 bytes from 172.17.0.101: icmp_seq=2 ttl=64 time=0.037 ms
64 bytes from 172.17.0.101: icmp_seq=3 ttl=64 time=0.031 ms
^C
--- 172.17.0.101 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.025/0.031/0.037/0.005 ms
[root@host ~]# curl 172.17.0.101:8080
container1
[root@host ~]#

現在宿主機可以訪問容器了，回到容器訪問宿主機的問題上，我們給容器增加路由轉發規則，使默認所有路由（0.0.0.0 ，除了 172.17.0.0/24 子網內的 IP ）都需要透過作用於三層的 br0 網關（172.17.0.1）來轉遞封包：

container1> ip route add default via 172.17.0.1
container1> route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.17.0.1      0.0.0.0         UG    0      0        0 veth1
172.17.0.0      0.0.0.0         255.255.255.0   U     0      0        0 veth1
container1>

這樣一來，當容器訪問宿主機 IP 時，數據包就會直接轉發到 Default Network Namespace 的 br0 網卡上，宿主機發現收到的 IP 數據包是屬於自己的，就會對其進行處理。

也就是說現在容器也可以訪問宿主機了：

container1> ping 10.0.12.15
PING 10.0.12.15 (10.0.12.15) 56(84) bytes of data.
64 bytes from 10.0.12.15: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 10.0.12.15: icmp_seq=2 ttl=64 time=0.031 ms
64 bytes from 10.0.12.15: icmp_seq=3 ttl=64 time=0.032 ms
^C
--- 10.0.12.15 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.022/0.028/0.032/0.006 ms
container1>

容器訪問其它主機（外網）

上面我們給容器配置了路由規則，使其訪問宿主機 IP 時，數據包就會轉發到 Default Network Namespace 的 br0 網卡上，宿主機發現收到的 IP 數據包是屬於自己的，就會對其進行處理。

但在 Linux 中，如果發現收到的 IP 數據包並不是屬於自己的，是會將其直接丟棄的。這裏我們準備了一個和宿主機（10.0.12.15）同一子網的另一臺 IP 地址爲 10.0.12.11 的主機 host2 來測試：

[root@host2 ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.12.11  netmask 255.255.252.0  broadcast 10.0.15.255
        inet6 fe80::5054:ff:fe78:a238  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:78:a2:38  txqueuelen 1000  (Ethernet)
        RX packets 6103  bytes 2710292 (2.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4662  bytes 640093 (625.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 10  bytes 1360 (1.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 1360 (1.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@host2 ~]#

在 container1 內 ping host2 ，以及訪問外網，和我們預期一致，無法訪問：

container1> ping 10.0.12.11
PING 10.0.12.11 (10.0.12.11) 56(84) bytes of data.
^C
--- 10.0.12.11 ping statistics ---
16 packets transmitted, 0 received, 100% packet loss, time 14999ms

container1> curl baidu.com
curl: (6) Could not resolve host: baidu.com; Unknown error
container1>

如果我們想要在容器中可以訪問其他主機，或者外網，我們就不能讓 Linux 丟棄不屬於自己的數據包，而是繼續轉發出去。要做到這一點，只需要打開 Linux 的 IP Forward 功能：

[root@host ~]# cat /proc/sys/net/ipv4/ip_forward # 0 代表關閉，1 代表開啓
0
[root@host ~]# vi /etc/sysctl.d/30-ipforward.conf
net.ipv4.ip_forward=1
net.ipv6.conf.default.forwarding=1
net.ipv6.conf.all.forwarding=1
[root@host ~]# sysctl -p /etc/sysctl.d/30-ipforward.conf
net.ipv4.ip_forward = 1
net.ipv6.conf.default.forwarding = 1
net.ipv6.conf.all.forwarding = 1
[root@host ~]# cat /proc/sys/net/ipv4/ip_forward # 0 代表關閉，1 代表開啓
1
[root@host ~]#

但是，這裏還是會有一個問題，當容器訪問其他主機或者外網時（veth1->br0），Linux 也幫我們轉發了出去（br0->eth0），但其他主機或者外網在響應了我們容器的請求時，它並不認識我們所配置的 172.17.0.0/24 這個網段，還是無法成功處理請求。所以我們還需要使用 NAT（ Network Address Translation ）技術：把 IP 數據報文頭中的 IP 地址和端口轉換成另一個 IP 地址和端口。

NAT 技術也被用於 IPv4 地址枯竭問題

在這裏我們需要修改的是源 IP 地址，所以是 SNAT （Source NAT），把 IP 轉換成宿主機出口網卡的 IP 。

在 Linux 中，我們可以通過 iptables 中的 MASQUERADE 策略來實現這個 SNAT ：

[root@host ~]# iptables -t nat -A POSTROUTING -s 172.17.0.0/24 ! -o br0 -j MASQUERADE
[root@host ~]# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.17.0.0/24        0.0.0.0/0
[root@host ~]#

這條規則的意思是將不是從網卡 br0 發出的且源地址爲 172.17.0.0/24 的數據包（即容器中發出的數據包）做 SNAT 。

此時，我們的容器就可以訪問其它主機（外網）了：

container1> ping 10.0.12.11
PING 10.0.12.11 (10.0.12.11) 56(84) bytes of data.
64 bytes from 10.0.12.11: icmp_seq=1 ttl=63 time=0.231 ms
64 bytes from 10.0.12.11: icmp_seq=2 ttl=63 time=0.216 ms
64 bytes from 10.0.12.11: icmp_seq=3 ttl=63 time=0.206 ms
^C
--- 10.0.12.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.206/0.217/0.231/0.019 ms
container1> curl baidu.com
<html>
<meta http-equiv="refresh" content="0;url=http://www.baidu.com/">
</html>
container1>

外部訪問容器（容器端口映射）

容器與外部網絡通信，我們已經解決了容器和宿主機間的通信，容器訪問其他主機（外網），還剩最後一個，外部來訪問容器。

在 Docker 中，要讓外部可以訪問容器服務，我們會做一次容器端口映射，如：

[root@host ~]# docker run -p 8000:8080 xxx

通過 -p 參數就可以將容器內的 8080 端口映射到宿主機的 8000 端口上，這樣外部就可以通過訪問宿主機 IP + 8000 端口訪問到我們的容器服務了。

這一技術實現其實同樣是使用到了 NAT 技術，只不過和上面的 SNAT 不同的是，這裏我們需要修改的是目的 IP 地址，即 DNAT （Destination NAT），把宿主機上 8000 端口的流量請求轉發到容器中的地址 172.17.0.101:8080 中。

在 Linux 中，我們可以通過 iptables 中的 DNAT 策略來實現這個 DNAT ：

[root@host ~]# iptables -t nat -A PREROUTING  ! -i br0 -p tcp -m tcp --dport 8000 -j DNAT --to-destination 172.17.0.101:8080
[root@host ~]# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8000 to:172.17.0.101:8080

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.17.0.0/24        0.0.0.0/0
[root@host ~]#

這條規則的意思是將訪問宿主機 8000 端口的請求轉發到 172.17.0.101 的 8080 端口上，也就是說我們現在可以在 host2 主機（10.0.12.11）上通過訪問 10.0.12.15:8000 來訪問實際的 container1 服務了：

[root@host2 ~]# curl 10.0.12.15:8000
container1
[root@host2 ~]#

到這裏，我們就已經實現了一個和 Docker 默認網絡模式一樣的拓撲關係：

容器網絡拓撲

總結

Docker 容器網絡用到的技術就是上面的 Veth+Bridge+route+iptables ，只不過是換成了 Go 語言來實現這些配置。

只有當我們瞭解了這些底層技術，平時在處理容器網絡問題時，才能夠更加得心應手。

本文所介紹的僅僅是 Docker 本身默認的網絡模型，而後面的 CNCF 的 CNI 容器網絡接口（Kubernetes 的網絡模型）， Service Mesh + CNI 的層次化 SDN 將會越來越複雜。

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/aXotIih1RkpyDTaokJjGPw