Giving Docker/LXC containers a routable IP address

At work, we are evaluating Docker as part of our “epic next generation deployment platform”. One of the requirements that our operations team has given us is that our containers have “identity” by virtue of their IP address. In this post, I will describe how we achieved this.

But first, let me explain that a little. Docker (as of version 0.6.3) has 2 networking modes. One which goes slightly further that what full virtualisation platforms would typically call “host only”, and no networking at all (if you can call that a mode!) In “host only” mode, the “host” (that is, the server running the container) can communicate with the software inside the container very easily. However, accessing the container from beyond the host (say, a client – shock! horror!) isn’t possible.

As mentioned, Docker goes a little bit further by providing “port forwarding” via iptables/NAT on the host. It selects a “random” port, say 49153 and adds iptables rules such that if you attempt to access this port on the host’s IP address, you will actually reach the container. This is fine, until you stop the container and restart it. In this case, your container will get a new port. Or, if you restart it on a different host, it will get a new port AND IP address.

One way to address this is via a service discovery mechanism, whereby when the service comes up it registers itself with some kind of well-known directory and clients can discover the service’s location by looking it up in the directory. This has its own problems – not least of which is that the service inside the container has no way of knowing what the IP address of the host it’s running on is, and no way of knowing which port Docker has selected to forward to it!

So, back to the problem in hand. Our ops guys want to treat each container as a well-known service in its own right and give it a proper, routable IP address. This is very much like what a virtualisation platform would call “bridged mode” networking. In my opinion, this is a very sensible choice.

Here’s how I achieved this goal:

Defining a new virtual interface

In order to have a separate IP address from the host, we will need a new virtual interface. I chose to use a macvlan type interface so that it could have its own MAC address and therefore receive an IP by DHCP if required.

ip link add virtual0 link eth0 type macvlan mode bridge

This creates a new virtual interface called virtual0. You will need a separate interface for each container. I am using bridged mode as it means container-to-container traffic doesn’t travel across the wire to the switch/router and back.

If you need a fixed MAC address – perhaps because you want a fixed IP address and you achieve this by putting a fixed MAC-to-IP mapping in your DHCP server – you need to add the following to the end of the line:

ip link add ... address 00:11:22:33:44:55

(substitute your own MAC address!)

If you do not do this, the kernel will randomly generate a MAC for you.

Giving the interface an IP address

If you want a fixed IP address and you’re happy to statically configure it:

ip address add 10.10.10.88/24 broadcast 10.10.10.255 dev virtual0

(substitute your own IP address, subnet mask and broadcast address)

If, however, you need to use DHCP to get an address (development environment, for example):

dhclient virtual0

This will start the DHCP client managing just this one interface and leave it running in the background dealing with lease renewals, etc. Once you’re ready to tear down the container and networking, you need to remember to kill the DHCP client, or, better yet:

dhclient -d -r virtual0

will locate the existing instance of the client for that interface, tell it to properly release the DHCP lease, and then exit.

Finally, bring the interface up:

ip link set virtual0 up

Find the container’s internal IP address

If you haven’t done so now, start your container as a daemon.

Use docker inspect <container id> to find the internal IP address under NetworkSettings/IPAddress.

Setting up inbound routing/NAT

Create a chain for the inbound routing rules for this container. I prefer to use a separate chain for each container as it makes cleaning up easier.

iptables -t nat -N BRIDGE-VIRTUAL0 # some unique name for the iptables chain for this container

Now, send all incoming traffic to this chain. We add a rule in PREROUTING which will match traffic coming in from outside, and a rule in OUTPUT which will match traffic generated on the host. Here, <container external IP> is the routable IP allocated statically or via DHCP to the virtual0 interface.

iptables -t nat -A PREROUTING -p all -d <container external IP> -j BRIDGE-VIRTUAL0
iptables -t nat -A OUTPUT -p all -d <container external IP> -j BRIDGE-VIRTUAL0

Finally, NAT all inbound traffic to the container. Here, <container internal IP> is the internal address of the container as discovered using the docker inspect command.

iptables -t nat -A BRIDGE-VIRTUAL0 -p all -j DNAT --to-destination <container internal IP>

This particular rule will forward inbound traffic for any port on the external IP to the same port on the internal IP – effectively exposing anything that the container exposes to the outside world. As an alternative, you can expose individual ports like so:

iptables -t nat -A BRIDGE-VIRTUAL0 -p all -m tcp --dport 80 -j DNAT --to-destination <container internal IP>:8080

In this example, we forward any traffic hitting port 80 on our external IP to port 8080 on the internal IP – effectively exposing a web server on port 8080 inside the container as port 80 on the external IP. Nice tidy URLs that don’t have port numbers on. Neat, huh? You can even map multiple external ports to the same internal port by using multiple rules if you wish.

Setting up outbound NAT

What we have so far works, but if the container initiates any outbound requests, they will appear (by virtual of Docker’s default MASQUERADE rule) to come from the hosts own IP. This may be fine depending on your circumstances. But it probably won’t play well if you want to get firewalls involved anywhere, and it might be a nuisance if you’re using netstat or similar to diagnose where all the incoming connections to a heavily loaded server are coming from.

So… we’ll add “source NAT” into the equation so that outbound connections from the container come from its own IP:

iptables -t nat -I POSTROUTING -p all -s <container internal IP> -j SNAT --to-source <container external IP>

This rule says any traffic we’re routing outbound which has a source address of the containers internal IP should have the source address changed to the containers external IP.

Note that we use -I (not -A) to ensure our rule goes ahead of Docker’s default MASQUERADE rule.

And finally

This all worked fine when I set it up at home, and then failed in the office. It turns out that it fails when you’re on a routed network (my network at home is a simple, flat network) and you try and reach your container from a different subnet. To cut a long story short, you need to disable “reverse path filtering”!

echo 2 > /proc/sys/net/ipv4/conf/eth0/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/virtual0/rp_filter

For more information about this, see this ServerFault question, this article and this article (linked from the previous one).

Summary

There’s a lot of information in this post, but in summary what we’ve done is:

  1. Create a new virtual interface for the container
  2. Given the virtual interface an IP address either statically or using DHCP
  3. Used iptables to set up inbound routing to allow the container to be reached via its external/routable IP from both the host and the network beyond the host
  4. Optionally forwarded (and even more optionally, aliased) only specific inbound ports
  5. Set up source NAT to rewrite outbound traffic from the container to use the correct external IP
  6. Given the kernel a bit of a helping hand to understand that we’re not terrorists and it’s safe to “do the right thing” with our slightly odd-looking traffic

The first 5 steps of this took me about 20 minutes to figure out. Step 6 took me 3 days, the help of several network engineers, one extremely patient ops guy and no small amount of blind luck!

I hope having it all written down in one place will be helpful to others.

Share this:
Facebook Twitter Digg Email

This entry was posted in Uncategorized. Bookmark the permalink.

42 Responses to Giving Docker/LXC containers a routable IP address

  1. nadrimajstor says:

    You’ve spoiled me! And.. Lets see… Saved me cca 3 days… :) Nah… Probably more. Thank you.
    I hope that new version of Decker will offer some helper function for this kind of network setup.

    • Danny says:

      You’re welcome. I’m talking to the Docker guys about getting some kind of hybrid of my approach and the pipework approach built into Docker. I’d do it myself, but I don’t know Go (not a great excuse) and I’d have to do it on my limited free time.

  2. Patrik Sundberg says:

    I’m using openvswitch and the virtual ethernet interfaces for LXC container, and attach the virtual interfaces to the ovs switch via the LXC up and down scripts.

    I’m not using docker but my home grown LXC solution that is similar.

    Once you got the containers on the ovs things are easy, you can run dnsmasq to hand out ip’s on the ovs, add routes and firewall rules easily etc. I’ve been very happy with the combo of LXC and ovs.

    Could be worth evaluating for your situation.

  3. Patrik says:

    I’m using openvswitch together with LXC for this purpose in my homegrown setup. The container get a veth interface that I attach to the OVS, and that let’s me easily handle IP addresses (dynamic or static), routing, firewall rules etc. I don’t use docker but something similar homegrown on top of LXC so should be applicable.

    It may be worth giving that kind of thing a spin, separates things quite nicely. I never liked the portfowarding stuff of docker.

  4. nadrimajstor says:

    hmm…. SDN…
    Can I do with Open vSwitch some crazy stunts like: For every service (mail, http, …) to have separate docker/LXC instance and all of them to share same public IPv4 address but without using NAT?

  5. Patrik Sundberg says:

    Openvswitch is just that – a switch. In my case I don’t want to have the same public ip, I want separate private ip and them to be able to talk to each other over said private network. If you connect your containers to ovs it won’t help with using same public ip, it’s just a switch and you’d need port forwarding etc from interface on physical host.

  6. Danny Iland says:

    One problem with this kind of setup is that the container cannot reach itself using the external IP. Any ideas on how to make that work efficiently with iptables?

  7. saravanan says:

    How to run multiple services like sshd, web server, database in a singe docker container ?
    Any Ideas ?

  8. Lance Boregard says:

    hi, do you have a script to summarized this ? would gladly accept some guidance, coming from a linux/docker newbie.

    from what i gather, you would need to run this “script” each time the container is started/restarted (since the internal ip should change) right ?

  9. Vijay says:

    I am very new to this topic so I believe that I have screwed something while I try to follow your steps. I ran the below commands to complete the configuration.

    1) ip link add vipr2 link eth0 type macvlan mode bridge
    2) ip address add 10.10.192.89/24 broadcast 10.10.192.255 dev vipr2
    3) ip link set vipr2 up
    4) iptables -t nat -N BRIDGE-VIPR2
    5) iptables -t nat -A PREROUTING -p all -d 10.10.192.89 -j BRIDGE-VIPR2
    6) iptables -t nat -A OUTPUT -p all -d 10.10.192.89 -j BRIDGE-VIPR2
    7) iptables -t nat -A BRIDGE-VIPR2 -p all -j DNAT –to-destination 192.168.227.5
    8) iptables -t nat -I POSTROUTING -p all -s 192.168.227.5 -j SNAT –to-source 10.10.192.89
    9) echo 2 > /proc/sys/net/ipv4/conf/eth0/rp_filter
    10) echo 2 > /proc/sys/net/ipv4/conf/vipr2/rp_filter

    After this, I was trying to ping the host machine(s) and container from each other.

    Host-A
    ——
    Host IP: 10.10.192.233
    Docker IP: 192.168.227.1
    Container IP: 192.168.227.5
    MCVLAN IP: 10.10.192.89

    HOST-B
    ——
    Host IP: 10.10.192.200

    I am able to ping
    1) HOST A -> 192.168.227.1, 192.168.227.5, 10.10.192.89, 10.10.192.200
    2) Container -> 10.10.192.233, 192.168.227.1, 10.10.192.89, 10.10.192.200
    3) Host B -> 10.10.192.233 (host a)

    But I am not able to ping
    Host B -> 10.10.192.89 (mcvlan)

    I believe the communication from outside world to the container should happen through the mcvlan IP and in my case it is not working.

    Please let me know if I am missing something?

    • Danny says:

      I see you’ve posted this on the Docker group. To be honest, you’ll probably get a better response there, so I’ll leave this for now if you don’t mind.

  10. Jeroen van Bemmel says:

    If this is your ‘epic next generation deployment platform’, isn’t it time for IPv6 (without NAT)?

  11. Vijay says:

    Hi Danny,

    Thanks for your response. It looks like so far there is no reply in the docker group. Appreciate your inputs and please let me know if I am not following the steps correctly. My requirement is to see the containers created from docker as an identifiable host (using IP address) from within the hosted network (meaning all the machines from the network should be able to talk to the container using IP address).

    Thanks for your time.

  12. kz says:

    Is this post still relevant? Does docker now support something like this with this docker0 bridge interface? Or does that not let you give public ips to containers.

    • Danny says:

      I haven’t looked at networking in a while. Docker has “always” had the docker0 bridge, so I don’t think that’s directly relevant. However, it has recently acquired the ability to bind to any interface of your choosing. So it’s probably possible to use some parts of this post (creating the interface, etc.), and then use some native Docker capabilities to replace some of the iptables stuff in this post. I haven’t tried this though. It would be great to hear if that works.

  13. Terry says:

    Thanks!
    Your howto just keeps on giving in 2014 …..

    G’day from down under :)

    I had spent the last few days trying to solve the current Docker networking issue of external container access, and tried a few solutions but yours is the easiest , and works like a charm on on a Ubuntu 12.04 (LTS) test Docker machine when added to /etc/rc.local.

    Cheers
    Terry

  14. kz says:

    Are there limits to how many network interfaces I can create? Are there any limits on the iptable mappings? I”m thinking of using this technique for a docker vhost site that could Have N amount of containers on one machine

  15. Check out https://github.com/jbemmel/ecDock – it’s a little script I wrote that may do what you need, and more.

    Using OpenVSwitch and Docker, ecDock places containers in ‘slots’ with a fixed IP and MAC address ( without NAT )

  16. Terry Porter says:

    Hi Jeroen,

    Thanks for your post, EC looks really usefull!

    I gave ‘EC’ a go on my Ubuntu 10.04 system running ovs-vsctl (Open vSwitch) 1.4.6 Compiled Jan 10 2014 01:45:55 but got the following error “ovs-vsctl: Interface does not contain a column whose name matches “ofport_request”. I’ve had a look in the
    OVS doc, but it’s a lot to absorb, so I thought perhaps if you have the time, you may know what the problem is ?

    #> ec create-vswitch 192.168.0.254/24
    ec: [INFO] create vswitch: NAME=ovs0 IP=192.168.0.254/24 MAC=52:00:c0:a8:00:fe optional DEV=
    ec: [INFO] Docker and OpenVSwitch dependencies OK
    ec: [INFO] New vswitch ‘ovs0′ created with IP 192.168.0.254/24 and MAC 52:00:c0:a8:00:fe

    #> ec –slot=85 start terryp/unifi_v3 /usr/bin/supervisord
    ec: [INFO] start: VSWITCH=ovs0 SLOT=85 IMAGE=terryp/unifi_v3, parameters for container: /usr/bin/supervisord
    ec: [INFO] Docker and OpenVSwitch dependencies OK
    ovs-vsctl: Interface does not contain a column whose name matches “ofport_request”

    ifconfig
    docker0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
    inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0
    inet6 addr: fe80::1049:d7ff:fe70:b6ad/64 Scope:Link
    UP BROADCAST MULTICAST MTU:1500 Metric:1
    RX packets:136508 errors:0 dropped:0 overruns:0 frame:0
    TX packets:203891 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:23557567 (23.5 MB) TX bytes:267418409 (267.4 MB)

    eth0 Link encap:Ethernet HWaddr 44:1e:a1:3c:ab:98
    inet addr:192.168.0.105 Bcast:192.168.0.255 Mask:255.255.255.0
    inet6 addr: fe80::461e:a1ff:fe3c:ab98/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:671303 errors:0 dropped:0 overruns:0 frame:0
    TX packets:234731 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:334540491 (334.5 MB) TX bytes:52212249 (52.2 MB)
    Interrupt:18

    lo Link encap:Local Loopback
    inet addr:127.0.0.1 Mask:255.0.0.0
    inet6 addr: ::1/128 Scope:Host
    UP LOOPBACK RUNNING MTU:65536 Metric:1
    RX packets:107339 errors:0 dropped:0 overruns:0 frame:0
    TX packets:107339 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:22939616 (22.9 MB) TX bytes:22939616 (22.9 MB)

    lxcbr0 Link encap:Ethernet HWaddr 7e:71:f3:9b:98:33
    inet addr:10.0.3.1 Bcast:10.0.3.255 Mask:255.255.255.0
    inet6 addr: fe80::7c71:f3ff:fe9b:9833/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:0 errors:0 dropped:0 overruns:0 frame:0
    TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:0 (0.0 B) TX bytes:648 (648.0 B)

    ovs0 Link encap:Ethernet HWaddr 52:00:c0:a8:00:fe
    inet addr:192.168.0.254 Bcast:192.168.0.255 Mask:255.255.255.0
    inet6 addr: fe80::5000:c0ff:fea8:fe/64 Scope:Link
    UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
    RX packets:0 errors:0 dropped:0 overruns:0 frame:0
    TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:500
    RX bytes:0 (0.0 B) TX bytes:648 (648.0 B)

    Cheers
    Terry

  17. Terry Porter says:

    Jeroen,

    Upgrading to openvswitch 2.0.0 fixed the problem :)

    ec –slot=85 start terryp/unifi_v3:working_base /usr/bin/supervisord
    ec: [INFO] start: VSWITCH=ovs0 SLOT=85 IMAGE=terryp/unifi_v3:working_base, parameters for container: /usr/bin/supervisord
    ec: [INFO] Docker and OpenVSwitch dependencies OK
    ec: [INFO] New container started in slot ovs0-s85 with ID=aaad2306c42b1126dc029585bb7892392c053e1f1347d19af4d93e735e694f87
    ec: [INFO] Found NSPID=340 for container in slot ovs0-s85
    ec: [INFO] New container up and running in slot ovs0-s85, IP=192.168.0.85/24 and MAC=52:00:c0:a8:00:55 with gateway 192.168.0.254 ( dual NIC: )

  18. Terry Porter says:

    Jeroen,
    One last problem I think, when I run “ec attach 85″ it hangs at :-
    “ec: [INFO] Attaching to container in slot ovs0-s85…” and doesn’t return control to the initiating terminal. I cant ping 192.168.0.85 externally or on the Docker server.

    Any suggestions ?

    ec: [INFO] attach: VSWITCH=ovs0 SLOT=85
    ec: [INFO] Docker and OpenVSwitch dependencies OK
    ec: [INFO] Attaching to container in slot ovs0-s85…

  19. Jeroen van Bemmel says:

    Hi Terry,
    Attaching to a container presumes that it is running some kind of shell to attach to. For example, putting “/bin/bash” as the container’s entrypoint can do the trick.
    You typically have to press when attaching, to get a prompt. I considered sending a newline character automatically somehow, but since one never knows what is running ( could be an application where ” detonates the bomb… ;) I decided not to

    Cheers,
    Jeroen

  20. Danny says:

    I haven’t been following along, but you should also be aware that you can attach to a container using lxc-attach --name <long container id> and it doesn’t require you to have anything in particular running in the container – it simply starts a new shell inside the container namespace. This is not very well documented anywhere on the Docker site (or at least it wasn’t last time I looked).

  21. Haro says:

    Hi Danny, great post! Thank you for it :-)

    It works, except one thing (I don’t now very much about iptables and I dont’ even know what to google for).

    I don’t want to expose all ports from the container to the outside world (which btw. works when I try it). So I tried your solution for that:

    iptables -t nat -A BRIDGE-VIRTUAL0 -p all -m tcp –dport 80 -j DNAT –to-destination :80
    (The container exposes Port 80, too)

    But when I run this, iptables throws this error:
    iptables v1.4.18: Need TCP, UDP, SCTP or DCCP with port specification

    Can you please give me a hint?

  22. KZ says:

    Danny, great post I keep coming back to. Could you give some high-level details on how to accomplish this with IPV6? I’m specifically interested in the routing/nat part. I have a public ipv6 /64 address pool that I would like to use this technique with. I’ve created macvlan interfaces and attached public ipv6 addresses to which I can then ping publically. I am having a hard time connecting the containers to the macvlans. I’m not sure how to handle the routing/nat steps in IPV6.

    I’ve tried using Docker Pipework for this but it I can’t seem to get it working.

  23. Pingback: Docker network performance | Research notes

  24. Jeroen Peeters says:

    Thanks for your article! It helped me allot creating a set of scripts that automatically creates a ‘network’ container that does routing with iptables on a public routable IP acquired with DHCP. Such network container is then linked to a container that you want to make available on the network.

    The scripts, more info and an example are available on: https://github.com/jeroenpeeters/docker-network-containers

  25. Pingback: Docker | Tomorrow's Technology Delivered Today

  26. shankar says:

    Not getting ip from dhcp to virtual0 interface.

    # dhclient -v virtual0
    Internet Systems Consortium DHCP Client 4.1.1-P1
    Copyright 2004-2010 Internet Systems Consortium.
    All rights reserved.
    For info, please visit https://www.isc.org/software/dhcp/

    Listening on LPF/virtual0/0e:e2:20:b7:cc:f8
    Sending on LPF/virtual0/0e:e2:20:b7:cc:f8
    Sending on Socket/fallback
    DHCPDISCOVER on virtual0 to 255.255.255.255 port 67 interval 5 (xid=0x52a7cec2)
    DHCPDISCOVER on virtual0 to 255.255.255.255 port 67 interval 7 (xid=0x52a7cec2)
    DHCPDISCOVER on virtual0 to 255.255.255.255 port 67 interval 7 (xid=0x52a7cec2)
    DHCPDISCOVER on virtual0 to 255.255.255.255 port 67 interval 15 (xid=0x52a7cec2)
    DHCPDISCOVER on virtual0 to 255.255.255.255 port 67 interval 17 (xid=0x52a7cec2)
    DHCPDISCOVER on virtual0 to 255.255.255.255 port 67 interval 10 (xid=0x52a7cec2)
    No DHCPOFFERS received.
    No working leases in persistent database – sleeping.

  27. lingzhi says:

    Thank you very much!!! but if so, I will use the container internal ip to ping the container, but will see the container external ip when get the packages from the container, right?

  28. franco says:

    Hello, Im using your exemple for few months now. I have 4 public ip on a system. I putted 3 ips on different docker containers. But now I want to add another container to share the ip address with another container will it be possible to bridge 2 or more container per ip?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>