So this is our problem. It's a very common use case, and there are plenty of
internet people asking about it, with no specific solutions. I debugged it, and
the details are here.
To figure out what's going on, I made a syscall log on a machine inside the LAN,
where a simple
rostopic echo
does work:
sysdig -A proc.name=rostopic and fd.type contains ipv -s 2000
This shows us all the communication between
inner
running
rostopic
and the
server. It's really chatty. It's all TCP. There are multiple connections to
the
router
on port 11311. It also starts up multiple TCP servers on the client
that listen to connections; these are likely to be broken if we were running the
client on
outer
and a machine inside the LAN tried to talk to them; but
thankfully in my limited testing nothing actually tried to talk to them. The
conversations on port 11311 are really long, but here's the punchline.
inner
tells the
router
:
POST /RPC2 HTTP/1.1
Host: 10.0.1.1:11311
Accept-Encoding: gzip
Content-Type: text/xml
User-Agent: Python-xmlrpc/3.11
Content-Length: 390
<?xml version='1.0'?>
<methodCall>
<methodName>registerSubscriber</methodName>
<params>
<param>
<value><string>/rostopic_2447878_1698362157834</string></value>
</param>
<param>
<value><string>/some/topic</string></value>
</param>
<param>
<value><string>*</string></value>
</param>
<param>
<value><string>http://inner:38229/</string></value>
</param>
</params>
</methodCall>
Yes. It's laughably chatty. Then the
router
replies:
HTTP/1.1 200 OK
Server: BaseHTTP/0.6 Python/3.8.10
Date: Thu, 26 Oct 2023 23:15:28 GMT
Content-type: text/xml
Content-length: 342
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><array><data>
<value><int>1</int></value>
<value><string>Subscribed to [/some/topic]</string></value>
<value><array><data>
<value><string>http://10.0.1.1:45517/</string></value>
</data></array></value>
</data></array></value>
</param>
</params>
</methodResponse>
Then this sequence of system calls happens in the
rostopic
process (an excerpt
from the
sysdig
log):
> connect fd=10(<4>) addr=10.0.1.1:45517
< connect res=-115(EINPROGRESS) tuple=10.0.1.99:47428->10.0.1.1:45517 fd=10(<4t>10.0.1.99:47428->10.0.1.1:45517)
< getsockopt res=0 fd=10(<4t>10.0.1.99:47428->10.0.1.1:45517) level=1(SOL_SOCKET) optname=4(SO_ERROR) val=0 optlen=4
So the
inner
client makes an outgoing TCP connection on the address given to
it by the ROS master above:
10.0.1.1:45517
. This IP is only accessible from
within the LAN, which works fine when talking to it from
inner
, but would be a
problem from the outside. Furthermore, some sort of single-port-forwarding
scheme wouldn't fix connecting from
outer
either, since the port number is
dynamic.
To confirm what we think is happening, the sequence of syscalls when trying to
rostopic echo
from
outer
does indeed fail:
connect fd=10(<4>) addr=10.0.1.1:45517
connect res=-115(EINPROGRESS) tuple=10.0.1.1:46204->10.0.1.1:45517 fd=10(<4t>10.0.1.1:46204->10.0.1.1:45517)
getsockopt res=0 fd=10(<4t>10.0.1.1:46204->10.0.1.1:45517) level=1(SOL_SOCKET) optname=4(SO_ERROR) val=-111(ECONNREFUSED) optlen=4
That's the breakage mechanism: the ROS master asks us to communicate on an
address we can't talk to.
Debugging this is easy with
sysdig
:
sudo sysdig -A -s 400 evt.buffer contains '"Subscribed to"' and proc.name=rostopic
This prints out all syscalls seen by the
rostopic
command that contain the
string
Subscribed to
, so you can see that different addresses the ROS master
gives us in response to different commands.
OK. So can we get the ROS master to give us an address that we can actually talk
to? Sorta. Remember that we invoked the master with
ROS_IP=10.0.1.1 roslaunch whatever
The
ROS_IP
environment variable is exactly the address that the master gives
out. So in this case, we can fix it by doing this instead:
ROS_IP=12.34.56.78 roslaunch whatever
Then the
outer
machine will be asked to talk to 12.34.56.78:45517, which
works. Unfortunately, if we do that, then the
inner
machine won't be able to
communicate.
So some sort of
ssh
port forward cannot fix this: we need a lower-level
tunnel, like a VPN or something.
And another rant. Here
rostopic
tried to connect to an unreachable address,
which failed. But
rostopic
knows the connection failed! It should throw an
error message to the user. Something like this would be wonderful:
ERROR! Tried to connect to 10.0.1.1:45517 ($ROS_IP:dynamicport), but connect() returned ECONNREFUSED
That would be immensely helpful. It would tell the user that something went
wrong (instead of no data being sent), and it would give a strong indication of
the problem and how to fix it. But that would be asking too much.