Idiot’s guide to TCPIP connectivity

I had a working TCPIP network, and made a few “improvements”. Unfortunately these improvements sometimes stopped the connectivity between systems, and I had a frustrating time understanding the problems and fixing them. The idiot in the blog post is me, for next time when I need to connect boxes together.

In concept TCPIP connectivity is simple – it is, but there are some subtle, non obvious things you need to be aware of.

As I was writing this post I found I did not know really how IPV4 worked, because it used “the wrong” IP address but still worked.

I found many ways of failing to connect to TCPIP, and some complex ways of getting it to work – I just wanted a simple way of being able to ping z/OS from my laptop. It is complicated by some definitions need to be done in order, and doing things in a different order sometimes worked, sometimes did not.

Basic TCPIP concepts that every one should know

  • The term socket is used by applications to communicate with TCP/IP, not where you connect a network/phone cable.
  • Think of a connection between two boxes. I have a yellow Ethernet cable between them. There are several terms for the where the cable is plugged in. A common term is the interface.
  • IP addresses
    • Each end of the connection has one or more IP addresses.  I think of it as having plastic labels tied to the end of the cable.
    • IPV6 addresses beginning with fe… and ff… are used by (internal use) advanced technology and can be ignored. You can use them, but the addresses may change every time the connection is started, which makes it hard to automate using them.
    • The system may generate some IPV6 addresses, but you can define your own. The system generated an address like 2a00:9999:8888:7777:894e:9876:781:32f1. Sometimes parts of these (the right hand part) are randomised (to make it harder for people to observer traffic patterns and so hack your system).
    • I use addresses like 2001:db8::f which are shorter to type.
    • On z/OS an IPV4 interface can have only one IP address. An IPV6 interface can have multiple addresses see ADDADDR. On z/OS an interface can be IPV4 or IPV6 but not both.
    • On Linux, an interface can have multiple IPV4, and multiple IPV6 addresses (but only the first IPV4 may be visible to applications)
    • For IPV6, TCP/IP can generate its own IPV6 addresses for internal processing, such as routing.
  • To get data from this machine to that machine over the yellow Ethernet cable, you have a route definition like “for this range of remote addresses use the yellow Ethernet cable, which has the address xxxx at the far end.
  • If you use TCP/IP to send a request, you usually want a response to come back. As well as defining a route to get to the remote end, you need a route defined to get from the remote machine back to the local machine. A ping request can fail because
    • The local end does not have a valid route to the remote end. The packet could be sent to the wrong place(down the wrong cable), or just discarded.
    • An intermediate box does not have a route to the remote end.
    • The remote end receives the request but does not have a route definition to send the response back to the requester.
    • An intermediate box does not have a route to the local end.
    • A firewall says no.
    • You can use the traceroute command to find the path taken to the remote end. This will tell you the path it took to get there. It does not tell you the route back. For this you need to issue the traceroute command on the remote end, and perhaps on intermediate boxes.
  • You define a route from this box using the yellow cable with label xxxx on it. The remote end of the cable has IP address….
  • You need at least two route statements
    • to get the data from the local system to the remote system,
    • the remote system needs a route statement to get to the local system.
  • You can find these address using
    • the Linux command ip -6 addr or ip -4 addr for TCP IPV6 and IPV4 respectively.
    • the z/OS command TSO NETSTAT HOME
  • Subnet: an IP V6 address has 32 hex digits. These are broken up into groups of 16 eg 2001:0DB8… This can be written as 2001:db8… The subnet specified which bits are significant when routing packets to the router. With z/OS usually the top 64 bits are used. This is written as …./64.
  • An address 2001:db8:9::1/64 is in a different subnet to address 2001:db8:8::1/64.
  • Address 2001:db8:8:1::2/64 is in the same subnet as 2001:db8:8:1::3 because only the top 64 bits count towards the subnet (2001:db8:8:1).
  • A gateway is a network point that acts as an entrance to another network. On the Internet, a node or stopping point can be either a gateway node or a host (end-point) node. Both the computers of Internet users and the computers that serve pages to users are host nodes. A gateway can have one protocol in, and output the data in a different protocol. For example I have broadband coming to my house. The gateway router converts this to TCP/IP, and converts it to wireless.

Things that you may not know

  • My end of a connection has two IP addresses defined. If I ping a remote site it uses the first IP address in its list, the remote site sees a packet of data from the first IP address in the list. You may have configured a route at the remote system to get back to your local system, but if you define your local addresses in a different order, a different IP address will be sent – and the remote end may not have a route for it.
  • If the interface at the next machine has two IP addresses 10.1.0.3 and 7.168.1.2 , I have to use the first IP address in the list defining a route sudo ip -4 route add 7.168.1.74 via 10.1.0.3 dev enp0s31f6. If I delete the first address, then I need to use the 7.168.1.2

Why does it take an application using TCPIP, so long to start?

I had problems with a couple of applications taking over 30 seconds to start. For example FTP and the RMF DDS Server.


I found this was cause by my TCPIP resolver was misconfigured. An application can ask DNS for the IP address(10.1.1.2) or the string address (BBC.CO.UK) from the TCPIP Resolver function. On my system this was configured in ADCD.Z31A.TCPPARMS(GBLTDATA) as

LOOKUP DNS LOCAL

This says, go to the network, and ask the DNS server “out there” for information. If this request times out, use the local information. On my system the path to the DNS server was not configured, so it waited, and eventually timed out.

When I changed the LOOKUP definition to

LOOKUP LOCAL

it came up with no delays.

Setting up a JES2 output NJE TCPIP node as a client using AT-TLS

This is part of some work I did to configure AT-TLS for a JES2 TCPIP node to another system.

I didn’t have a remote system to connect to, but I had a Python TLS server which the NJE node could connect to (and then end), which demonstrated the TLS connection.

The JES2 definition

The address of the remote end, running the Python TLS server was 10.1.0.2.

$ADDSOCKET(LAPTOP),IPADDR=10.1.0.2,LINE=3,NETSRV=1,NODE=50,PORT=2175,SECURE=NO 

Starting the NJE node

$SN,SOCKET=LAPTOP

The AT-TLS definitions

This definition acts as a client to a remote server, so AT-TLS needs to be configured as a AT-TLS client.

TTLSRule CPJES2OUT 
{
RemoteAddr 10.1.0.2
RemotePortRange 2175
Direction Output
TTLSGroupAction
{
TTLSEnabled On
}
TTLSEnvironmentAction
{
HandshakeRole Client
TTLSEnvironmentAdvancedParms
{
# clientAuthType needs to be required or Passthru
ClientAuthType PassThru
TLSv1 Off
TLSv1.1 Off
TLSv1.2 On
# TLSv1.3 On
}
TTLSKeyringParms AZFKeyringParms
{
Keyring start1/TN3270
}

TTLSConnectionAction
{
TTLSCipherParmsRef AZFCipherParms
TTLSConnectionAdvancedParms
{
# ServerCertificateLabel is for a server connection
# ServerCertificateLabel RSA2048
CertificateLabel RSA2048
# ApplicationControlled OFF
}
}
}

AZFCipherParms

I put common definitions into their own section, for example

TTLSCipherParms AZFCipherParms 
{
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# TLSv1.3
V3CipherSuites TLS_CHACHA20_POLY1305_SHA256
}

Using TLSv1.3

You need TTLSEnvironmentAdvancedParms to contain

TTLSEnvironmentAdvancedParms 
{
TLSv1.1 Off
TLSv1.2 On
TLSv1.3 On
}

and at least one TLSV1.3 cipher spec.

TTLSCipherParms AZFCipherParms 
{
# TLSv1.2
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
V3CipherSuites TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
V3CipherSuites TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
# TLSv1.3
V3CipherSuites TLS_CHACHA20_POLY1305_SHA256
# TLSv1.2
V3CipherSuites4Char TLS_CHACHA20_POLY1305_SHA256
V3CipherSuites4Char TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384
V3CipherSuites4Char TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256
V3CipherSuites4Char TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256 C02
}

such as TLS_CHACHA20_POLY1305_SHA256

See Cipher suite definitions and search for 1301 (TLS_AES_128_GCM_SHA256) ,1302 (TLS_AES_256_GCM_SHA384) ,1303(TLS_CHACHA20_POLY1305_SHA256).
There is a column called TLSv1.3 (but it is hard to find). There are two tables, you need to use the second table to find what version of TLS the cipher specs provide.

Python server

The code below acted as a remote TLS server for the handshake.

import socket
import ssl
import struct
import pprint

HOST= ''
PORT = 2175

cafile="/home/colinpaice/ssl/ssl2/jun24/docca256.pem"
certfile="/home/colinpaice/ssl/ssl2/jun24/docec521june.pem"
keyfile="/home/colinpaice/ssl/ssl2/jun24/docec521june.key.pem"
certpassword = None

context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
context.minimum_version = ssl.TLSVersion.TLSv1_2
context.maximum_version = ssl.TLSVersion.TLSv1_3
context.load_cert_chain(certfile, keyfile)
context.load_verify_locations(cafile=cafile)

context.verify_mode = ssl. CERT_REQUIRED
getciphers = context.get_ciphers()
#for gc in getciphers:
# print("get cipher",gc)

with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:
sock.bind((HOST, PORT))
sock.listen(1)
with context.wrap_socket(sock, server_side=True) as ssock:
conn, addr = ssock.accept()
cert = conn.getpeercert()
pprint.pprint(cert)
v = conn.version()
print("version",v)
c = conn.cipher()
print("ciphers",c)
sock.close

When this ran, and the z/OS NJE node connected to it ($SN,SOCKET=LAPTOP), the output was

{'issuer': ((('organizationName', 'COLIN'),),
(('organizationalUnitName', 'CA'),),
(('commonName', 'DocZosCA'),)),
'notAfter': 'Jun 17 23:59:59 2025 GMT',
'notBefore': 'Jun 17 00:00:00 2024 GMT',
'serialNumber': '07',
'subject': ((('organizationName', 'RSA2048'),),
(('organizationalUnitName', 'SSS'),),
(('commonName', '10.1.1.2'),)),
'subjectAltName': (('IP Address', '10.1.1.2'),),
'version': 3}
version TLSv1.3
ciphers ('TLS_CHACHA20_POLY1305_SHA256', 'TLSv1.3', 256)

Showing the certificate, the level of TLS and the cipher spec used.

The messages on the z/OS console were

$SN,SOCKET=LAPTOP                                                        
$HASP000 OK
IAZ0543I NETSRV1 TCP/IP connection with IP Addr: 10.1.0.2 Port: 2175
Initiated
IAZ0543I NETSRV1 TCP/IP connection with IP Addr: 10.1.0.2 Port: 2175
Successful
IAZ0543I NETSRV1 TCP/IP connection with IP Addr: ::ffff:10.1.0.2 Port:
2175 ended due to TCP/IP error, rc: 1121

Setting up JES2 input NJE node (server) and AT-TLS

I got this working in response to a question about AT-TLS and JES2.

You need to configure the port and IP address of the destination node using AT-TLS.

I created the socket definitions

$ADDSOCKET(TLS),NODE=1,IPADDR=10.1.1.2,NETSRV=1,PORT=2275

Before you start

Get a working JES2 NJE, and AT-TLS environment. It makes it difficult to get the AT-TLS configured as well as getting NJE to work.

JES2 NJE needs a Netserver (NETSRV) to do the TCP/IP communication.

When you configure AT-TLS this intercepts the traffic to the IP address and port and does the TLS magic. This means you need a different netserver, and a tls specific port, and a TLS specific socket. It looks like the default TLS port is 2252. The doc says

SECURE=OPTIONAL|REQUIRED|USE_SOCKET
Indicates whether the NETSERV should accept only connection requests with a secure protocol in use such as TLS/SSL. When SECURE=REQUIRED is speci®edQ the NETSERV rejects all connection requests that do not specify a secure protocol is to be used for the connection. When SECURE=OPTIONAL is speciedQ the NETSERV allows connections with or without a secure protocol in use.
The default, USE_SOCKET, inherits the SECURE setting from the SOCKET statement associated with the NETSERV. If the SOCKET says SECURE=YES, then processing is the same as specifying
SECURE=REQUIRED on the NETSERV.
To specify that the NETSERV should use NJENET-SSL (2252) as the PORT it is listening on and the default port for outgoing connections, but not require all connections to use TLS/SSL, you must specify SOCKET SECURE=YES on the socket that is associated with the NETSERV and set the NETSERV to SECURE=OPTIONAL.

I do not understand this because AT-TLS will try to do a TLS handshake and fail if the session is not a TLS session.

It feels like the easiest way is to have a netserver just for TLS with its own port. I may be wrong.

In my PAGENT configuration, I took a working TLSrule and created

TTLSRule CPJES2IN 
{
LocalAddr ALL
RemoteAddr ALL
LocalPortRange 2252
Direction Inbound
Priority 255
TTLSGroupActionRef AZFGroupAction1
TTLSEnvironmentActionRef AZFEnvAction1
TTLSConnectionActionRef AZFConnAction1
}

This is for the inbound traffic on port 2252.

I defined the JES2 node

$TSOCKET(TLS),NODE=1,IPADDR=10.1.1.2,NETSRV=1,PORT=2252 

with the matching port=2252

I assigned this socket to netsrv1, and started it

$TNETSRV1,SOCKET=TLS
$SNETSRV1

I used a Python nje client to connect to z/OS. I used a modified version of the python NJE client, where I defined a certfile, keyfile and cafile.

I used

nje = njelib.NJE("N50","S0W1")
nje.set_debuglevel(1)
nje.setTLS is colin added code
#nje.setTLS(certfile="/home/colinpaice/ssl/ssl2/jun24/docec521june.pem",
# keyfile="/home/colinpaice/ssl/ssl2/jun24/docec521june.key.pem",
# cafile="/home/colinpaice/ssl/ssl2/jun24/docca256.pem")
connected = nje.session(host="10.1.1.2",port=2252,timeout=1)

Where the JES2 system is called S0W1, the node used is N50.

The z/OS IP address is 10.1.1.2, and the port is 2252.

There were no helpful messages to say the session was using TLS. I used Wireshark on the connection, and AT-TLS trace to check the TLS calls.

If I used a non TLS connection to the z/OS node I got

EZD1287I TTLS Error RC: 5003 Data Decryption    
LOCAL: ::FFFF:10.1.1.2..2252
REMOTE: ::FFFF:10.1.0.2..41288
JOBNAME: JES2S001 RULE: CPJES2IN

showing the AT-TLS definition was CPJES2IN.

RC 5003 will occur when the AT-TLS process is expecting an TLS message but receives a clear-text message – so no TLS request coming in.

Setting up JES2 NJE using TCP/IP

I was trying to test TLS and JES2 NJE, and needed to get JES2 NJE working. I did not have remote system to use, so I used Python NJE, I also used openssl s_server to act as a server – just for the connection.

For more information on setting up JES 2 NJE with TLS see:

Setting up NJE on JES2

You can use static (defined in the JES2PARM member) or define them dynamically using commands.

The bits you need

TCP/IP work is done in a net server NETSRV task. You can define more than one of these to allow you to partition the work.

The net server needs a SOCKET definition. This socket definition needs the IP address on the local system, and the port used to connect to the socket code. If you let it default to the local IP address, it may not pick the IP address you want to use.

You need a NODE definition for the remote end.

You need a TCP/IP LINE definition for the connection to the remote system.

You need a SOCKET for the remote connection, giving the IP address of the remote end, the port to be used at the remote end, the LINE definition to be used, and the NODE to be used.

These have to be started before they can be used.

I had firewall problems on my Linux server, where it was not forwarding packets to the remote system. Once I fixed this, the connection was easy.

Static definition

The address of my z/OS is 10.1.1.2. The address of the remote end is 10.1.0.2

In the JES2 parmlib members I added

NODE(2)     NAME=LAPTOP    
SOCKET(LOC) NODE=1,IPADDR=10.1.1.2,netsrv=1,PORT=175
NETSRV(1) SOCKET=LOC
SOCKET(LAPTOP) NODE=50,IPADDR=10.1.0.2,LINE=2,NETSRV=1,port=22
LINE(2) UNIT=TCP

Dynamic definitions

I used the following operator commands to define the resources, rather than define them statically

$ADDSOCKET(LOC),NODE=1,IPADDR=10.1.1.2,netsrv=1,PORT=175
$Addnetsrv(1),socket=LOC
$addline(2),unit=tcp
$ADDSOCKET(LAPTOP),IPADDR=10.1.0.2,line=2,netsrv=1,node=50

You need to use a statically defined NODE.

Starting them up

I then issued

  • $SNetsrv1 This starts an address space with name JES2S001.
  • $SLNE2 to start the line
  • $Sn,socket=LAPTOP

Other useful commands

  • $DNETSRV1
  • $DNetsrv1,sessions this gave output like
    • $HASP898 NETSRV1 SESSIONS=(LNE2/LAPTOP/S6)
  • $DNetsrv1,socket this displays which socket the net server is using.
  • $DSOCKET to display all sockets
  • $DSOCKET(LAPTOP4)
  • $TSOCKET(LOC),SECURE=YES,PORT=2275

Destination unreachable, Port unreachable. Which firewall rule is blocking me?

I was trying to connect an application on z/OS through a server to my laptop – so three systems involved.

On the connection from the server to my laptop, using Wireshark I could see no traffic from the application.

When I used Wireshark on the z/OS to server connection I got

   Source   Destination port Protocol info 
>1 10.1.1.2 10.1.0.2 2175 TCP ..
<2 10.1.1.1 10.1.1.2 2175 ICMP Destination unreachable (Port unreachable)

This means

  1. There was a TCP/IP Packet from 10.1.1.2 (z/OS) to 10.1.0.2 (mylaptop) port 2175
  2. Response:Destination unreachable (Port unreachable)

This was a surprise because I could ping from z/OS through the server to the laptop.

Looking in the firewall log ( /var/log/ufw.log) I found

[UFW BLOCK] IN=tap0 OUT=eno1 MAC=... SRC=10.1.1.2 DST=10.1.0.2 ... PROTO=TCP SPT=1050 DPT=2175 ...

This says

  • Packet was blocked. When using the ufw firewall – all of its messages and definitions contain ufw.
  • From 10.1.1.2
  • To 10.1.0.2
  • Source port 1050
  • Destination port 2175

With the command

sudo ufw route allow in on tap0 out on eno1

This allows traffic to be routed through this node from interface tap0 to interface eno1, and solved my problem.

What caused the problem?

iptables allows the systems administrator to define rules (or chains of rules – think subroutines) to control the flow of packets through the Linux kernel. For example

  • control input input packets destined for this system
  • control output packets from this system
  • control forwarded packets flowing through this system.

ufw is an interface to iptables which makes it easier to define rules.

You can use

sudo ufw status

to display the ufw definitions, for example

To                         Action      From
-- ------ ----
22/tcp ALLOW Anywhere
Anywhere on eno1 ALLOW Anywhere
Anywhere on tap0 ALLOW Anywhere (log) # ‘colin-ethernet’

You can use

sudo iptables -L -v

to display the iptables. The -v options show you how many times the rules have been used.

sudo iptables-save reports on all of the rules. For example (a very small subset of my rules)

-A FORWARD -j ufw-before-forward
-A ufw-before-forward -j ufw-user-forward
-A ufw-user-forward -i tap0 -o eno1 -j ACCEPT
-A ufw-user-forward -i eno1 -o tap0 -j ACCEPT

-A ufw-skip-to-policy-forward -j REJECT --reject-with icmp-port-unreachable

Where

  • -A FORWARD.… says when doing forwarding use the rule (subroutine) called ufw-before-forward. You can have many of these statements
  • -A ufw-before-forward -j ufw-user-forward add to the end of subroutine ufw-before-forward, call (-jump to) subroutine ufw-user-forward
  • -A ufw-user-forward -i tap0 -o eno1 -j ACCEPT in subroutine ufw-user-forward, if the input interface is tap0, and the output interface is eno1, then ACCEPT the traffic, and pass it on to interface eno1.
  • -A ufw-user-forward -i eno1 -o tap0 -j ACCEPT in subroutine ufw-user-forward, if the input interface is eno1, and the output interface is tap0, then ACCEPT the traffic, and pass it on to interface eno1.
  • -A ufw-skip-to-policy-forward -j REJECT –reject-with icmp-port-unreachable. In this subroutine do not allow the packet to pass through, but send back a response icmp-port-unreachable. This is the response I saw in Wireshark.

With -j REJECT you can specify

icmp-net-unreachable
icmp-host-unreachable
icmp-port-unreachable
icmp-proto-unreachable
icmp-net-prohibited
icmp-host-prohibited
icmp-admin-prohibiteda

The processing starts at the top of the tree and goes into each relevant “subroutine” in sequence till it finds and ACCEPT or REJECT.

If you use sudo iptables -L -v it lists all the rules and the use count. For example

Chain FORWARD (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
...
259 16364 ufw-before-forward all -- any any anywhere anywhere

Chain ufw-before-forward (1 references)
pkts bytes target prot opt in out source destination
...
77 4620 ufw-user-forward all -- any any anywhere anywhere

Chain ufw-user-forward (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT all -- eno1 tap2 anywhere anywhere
0 0 ACCEPT all -- tap2 eno1 anywhere anywhere
9 540 ACCEPT all -- tap0 eno1 anywhere anywhere
0 0 ACCEPT all -- eno1 tap0 anywhere anywhere

Chain ufw-reject-forward (1 references)
pkts bytes target ...
45 2700 REJECT ... reject-with icmp-port-unreachable
  • For the packet forwarding it processed a number of “rules”
    • 259 packets were processed by subroutine ufw-before-forward
  • Within ufw-before-forward, there were several calls to subroutines
    • 77 packets were processed by subroutine ufw-user-forward
  • Within ufw-user-forward the line (in bold) said there were 9 packets, which were forwarded when the input interface was tap0 and the output was eno1.
  • Within the subroutine ufw-reject-forward, 45 packets were rejected with icmp-port-unreachable.

The ufw-reject-forward was the only instance of icmp-port-unreachable with packet count > 0. This was the rule which blocked me.

Log file

In the /var/log/ufw.log was an entry for [UFW BLOCK] for the address and port,

One minute networking: getting your data to flow around the corner; IP tunnelling

This is another of the little bits of networking knowledge, which, once you understand it, is obvious! Some of the documentation on the web is either wrong or is missing information.

The original problem

I wanted to use a route management protocol (OSPF) for managing the routing information known by each router. It has its own format packets. Not every device or router supports these packets.

You configure the interface name, and the OSPF data flows through the interface.

When the connection is a direct line, the data is passed to the remote system and it can use it. When the connection is indirect, for example via a wireless router. The wireless router does not know how to handle the OSPF packets and throws them away. The result is that my remote machine does not get the OSPF packets.

The solution – use a tunnel

One solution is to wrap the packets of data, so they get passed up to the router, round the corner, and back down to the remote system.

When I was employed, we had an internal mail system for paper correspondence . If we wanted to send a letter to a different site, we took the piece of internal mail, put it in an envelope and sent it through the national mail to the remote site. At the remote site, the mail room removed the external envelope, and sent the internal letter on to the recipient. It is a similar process with IP tunnelling.

I have a laptop with IP address A.B.C.D and a server with address W.X.Y.Z., I can ping from A.B.C.D to W.X.Y.Z, so there is an existing path between the machines.

You define a tunnel to W.X.Y.Z (the external envelope) and give which interface address on your system it should use. (Think of having two mail boxes for your letter, one for Royal Mail, another for FedEx).

You define a route so as to say to get to address p.q.r.s use tunnel ….

The definitions

The wireless interface for my laptop was 192.168.1.222 . The wireless address of my server was 192.168.1.230

I defined a tunnel from Laptop to Server called LS

sudo ip tunnel add LS mode gre local 192.168.1.222 remote 192.168.1.230 

Make it active and define the address on the server 192.168.3.3 .

sudo ip link set LS  up
sudo ip route add 192.168.3.3 dev LS

If I ping 192.168.3.3 the enveloped packet goes to the server machine 192.168.1.230 . If this address is defined on the server the ping sends a response – and the ping worked!

Except it didn’t quite. The packet got there, but the response did not get back to my laptop.

At the server the ping “from” IP address was 10.1.0.2, attached to my laptop’s Ethernet. This was not known on the server.

I had three choices

  • Define a tunnel back from the server to the laptop.
  • Use ping -I 192.168.1.222 192.168.3.3 which says send the ping request to 192.168.1.1 , and set the originator address to 192.168.1.222. The server knows how to route to this address.
  • Define a route from the server back to my laptop.

The simplest option was to use ping -I … because no additional definitions are required.

This does not solve my problem

To get OSPF data from the server to my laptop, I need a tunnel from the server to my laptop; so a tunnel each way

Different sorts of data are used in an IP network

  • IPV6 and IPV4 – different network addressing schemes
  • unicast and multi cast.
    • Unicast – Have one destination address, for example ping, or ftp
    • Multicast – Often used by routers and switches. A router can send a multicast broadcast to all nodes on the local network for example ‘does any nodes have IP address a.b.c.d?‘. The data is cast to multiple nodes.

When I defined the tunnel above I initially specified mode ipip. There are different types of tunnel mode ipip is just one. The list includes

  • ipip – Virtual tunnel interface IPv4 over IPv4 can send unicast traffic, not multi cast
  • sit – Virtual tunnel interface IPv6 over IPv4.
  • ip6tnl – Virtual tunnel interface IPv4 or IPv6 over IPv6.
  • gre – Virtual tunnel interface GRE over IPv4. This supports IPv6 and IPv4, unicast and multicast.
  • ip6gre – Virtual tunnel interface GRE over IPv6. This supports IPv6 and IPv4, unicast and multicast.

The mode ipip did not work for the OSPF data.

I guess that the best protocol is gre.

Setting up a gre tunnel

You may need to load the gre functionality

sudo modprobe ip_gred
lsmod | grep gre

create your tunnel

sudo ip tunnel add GRE mode grep local 192.168.1.222 remote 192.168.1.230 
sudo ip link set GRE up
sudo ip route add 192.168.3.3 dev GRE

and you will a matching definition with the same mode at the remote end.

Displaying the tunnel

The command

ip link show dev AB 

gives information like

9: AB@NONE: mtu 1476 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/gre 192.168.1.222 peer 192.168.1.230

where

  • link/gre this was defined using mode gre
  • 192.168.1.222 the local interface to be used to send the traffic
  • peer 192.168.1.230 the IP address for the far end

The command

ip route 

gave me

192.168.3.3 dev AB scope link

so we can see it gets routed over link(tunnel AB).

Using the tunnel

I could use the tunnel name in my defintions, for example for OSPF

interface AB
area 0.0.0.0

IPV6 getting an address automagically

You can use static definitions to give a device or link an IP address. You can use modern(last 20 years) technology to do this for you – and get additional advantages.

A server application needs a fixed IP address and port. A client, connecting to the server, can use a different IP address and port on different days. This has the advantage that it makes it harder for the bad guys to track you from your address and port combination

Client application usually use the option “allocate me any free port”.

To get a different IP address every time you can use IPv6 Stateless Address Auto-configuration (SLAAC). It is called stateless because it does not need to remember any state information from one day to the next. The client application says “give me an IP address, any IP Address” and then uses the IP address, until the device is shutdown, or the interface is closed.

On Linux You need radvd for this to work.

Router Advertisement Daemon (radvd)

You used to have dedicated routers. Now you can run radvd on a computer and it acts like a router. You can run it on your personal machine, or run it in its own machine.

This supports Neighbor Discovery Protocol. When your machine connects to the network, it asks all routers on your local network for configuration information. It gets back a list of prefixes defined on the router (for example 2001:db8::/64). If your machine wants to send a packet to 2001:db8::99, it sends a request to all routers on the local network, asking if any router has 2001:db8::99 defined. If so, the router responds, and so your machine knows where to send the packet to.

When an IP address is allocated to a device, it sends a request to all devices in the local network, asking “does anyone have this address”. This avoids devices with the same IP address. It is known as Duplicate Address Detection (DAD).

My radvd config file

The syntax of the configuration file is defined here

For my interface vl100 I wanted it to give it an IP address 2100… and 2100…

interface  vl100
{
AdvSendAdvert on;
MaxRtrAdvInterval 60;
MinDelayBetweenRAs 3;

prefix 2100::/64
{
AdvAutonomous on;
};
prefix 2200::/64
{
};
};

Where

  • AdvAutonomous on (the default) says support SLAAC

Creates

: vl100@enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2200::3905:281e:909b:5e00/64 scope global temporary dynamic
valid_lft 86398sec preferred_lft 14398sec
inet6 2200::8e16:45ff:fe36:f48a/64 scope global dynamic mngtmpaddr
valid_lft 86398sec preferred_lft 14398sec
inet6 2100::3863:da22:619a:42e0/64 scope global temporary dynamic
valid_lft 86398sec preferred_lft 14398sec
inet6 2100::8e16:45ff:fe36:f48a/64 scope global dynamic mngtmpaddr
valid_lft 86398sec preferred_lft 14398sec
inet6 fe80::8e16:45ff:fe36:f48a/64 scope link
valid_lft forever preferred_lft forever

See here for the meaning of the fields

The attributes of the connection include :scope global temporary dynamic

  • dynamic was created by using stateless SLAAC configuration. If the address was created by an ip -6 addr add … dev … command, it will not have dynamic.
  • tentative – in the process of Duplicate Address Detection processing.
  • temporary – it expires after the time interval.
  • mngtmpaddr – is used as a template for temporary connections

You can change the attributes of an address using the change command. For example to change the time out value

sudo ip -6 addr change 2200::… dev vl100 valid_lft 100 preferred_lft 10

For me it expired and generated another connection with the same address.

One minute networking: TCP buffer sizes

When data flows over a TCPIP connection there are several factors which control the rate at which data can be sent. You can influence some of these factors.

Data is sent as packets typically of size about 1440 bytes – because old hardware could only support this. You could use larger packets, but you may hit a router which chops it into smaller blocks.

The basic TCPIP flow

Consider a Client Server connection. The client application wants to send some data to a server application

  • The client uses send() to put some data into a TCPIP buffer and returns.
  • TCPIP sends some data (a packet) from this buffer, sets a timer and waits.
  • The server receives the data, end sends back an ACK saying so far I have received this many bytes from you.
  • The application on the server does a receive (if there is no data, the application is suspended until data arrives). If there is enough data to satisfy the receive, the application returns, otherwise it is suspended.
  • At the client end, when TCPIP has received the ACK. It no longer needs the data which has just been acknowledged. It can send more data.
  • If no ACK was received and the timer has timed out, TCPIP resend the data.

There are several parts to this:

  • Putting things into the pipe – the send buffer
  • The pipe
  • Getting things from the pipe, the receive buffer

The send buffer

  • TCPIP has a buffer for its use.
  • The application
    • An application does a send() and passes data to TCPIP.
    • If there is space in the TCPIP buffer, the data is moved into the buffer, and the application returns.
    • If there is not enough space for all of the data, enough data is moved to fill the buffer, and the application waits until more space is available in the buffer.
    • When all of the data has been passed to the TCPIP buffer, the application returns, and can do more application work.
  • TCPIP
    • TCPIP takes a chunk of the buffer (a packet) , sends it over the network, and sets a timer.
    • It can then process another chunk of data, and send it over the network, so there are multiple packets in flight.
    • When the far end has passed the data to the application, it sends the ACK back.
    • The local end, when it has received the ACK for a chunk of data, knows the data has been received by TCPIP at the remote end, it no longer needs to keep a copy of the data, and frees up the space on the buffer.

How big a buffer is needed to get good throughput?

The data is held in the TCPIP buffers; waiting to be sent plus the round trip time; from when the data was put into the TCPIP buffer, to getting the ACK back. This could be 10s of milliseconds. Multiple packets can be in-flight (perhaps 10s or 100s) which improves the throughput. So send 10 packets; wait, when the first ACK is received, send another packet etc., so there are always 10 packets in flight.

If the buffer is too small the application has to wait. Increasing the send buffer size will increase throughput up to a point (when the application does not have to wait) after this point making it larger may make no difference.

As more data is in flight, the connection needs a bigger send buffer.

An application can set the send buffer size using the SETSOCKOPT call. If this is not used, then there will be a TCPIP default send buffer size. On z/OS this is the system wide TCPCONFIG TCPSENDBFRSIZE …. parameter.

The default used to default to 16KB, and currently is typically 64KB. There is a TCPIP enhancement which says if the send buffer size is larger than 64KB, then TCPIP can dynamically increase it if it will improve performance. See Outbound Right Sizing(ORS).

Note: If you change the system wide send buffer size (TCPCONFIG TCPSENDBFRSIZE on z/OS), this will affect all applications that do not set the size using SETSOCKOPT. You should test this before putting it into production because it may affect many applications.

The receive buffer

At the receiving end, TCPIP has a buffer. Data from the network is put into this buffer. After the data has been put into the buffer, TCPIP sends back an ACK with three fields saying

  • so far I’ve received this many bytes from you
  • I’ve sent you this many bytes
  • my buffer has space for this many bytes

An application does a receive to get the data, if there is insufficient data to satisfy the receive, the application can wait, or return just the data in the buffers, depending on the options.

If the receive buffer is full, any incoming data will be thrown away. If the application does receive the data, then does lots of processing on the data, followed by receive more data etc, the receive buffer may fill up. Some applications receive the data, give the data to a subtask to process, immediately do another receive, and so try to keep the receive buffer empty.

If the amount of arriving data is larger than the free space in the buffer, TCPIP will return “no space left in the buffer” as part of an ACK. The sender then knows to wait. When the application receives the data, and makes space, “x bytes are available in the buffer” is sent as part of the ACK, and the sender can start sending data again. This “space available” is known as the Window Size, and helps regulate the flow of data.

If you think about this for several minutes, you will realise that there is a time lag between the receive available buffer size going to zero, and the sender receiving the ACK saying no space in receive buffer. Any in-flight packets may get thrown away, or the end application may get all the data from the buffer. The “no space left in receive buffer” tells the sender to stop sending data until there is space in the buffer, and the sender may then reduce the amount of in-flight data.

Having a zero sized window means there is a problem that the application is not getting the data from the buffer fast enough.

How big a receive buffer is needed to get good throughput?

If the buffer is too small the application has to wait, and packets may be thrown away.

An application can set the receive buffer size using the SETSOCKOPT call. If this is not used, then there will be a TCPIP default receive buffer size. On z/OS this is the TCPCONFIG TCPRCVBFRSIZE …. parameter.

The maximum receive buffer size is specified in TCPMAXRCVBUFRSIZE.

If the receive buffer size is greater than 64B, then a performance enhancement called Dynamic Right Sizing(DRS) can come into action which automatically increases the buffer size up to 2MB.

Inside the pipe

I have described the sender side filling the send buffer for the connection, and the application on the receiver side taking data from the connection’s receive buffer. I’ll look at the pipe in between.

Data is send across the network in packets. The packets are usually small – for example 1500 bytes for Ethernet. Some protocols support larger packet sizes. Data send within a z/OS can have 56KB packet sizes. The Maximum Segment Size (mss) is the maximum size of the user data in a packet.

If a packet is too large for a device, it may be cut into smaller chunks and then passed on – or the packet may just be dropped.

The simplest and slowest transmission is send one packet and wait for the ACK, then send another packet.

It is much more efficient to send multiple packets. For example send 10 packets, when the first ACK comes back (saying the first packet has been received), send the next packet and so on, so there are always 10 packets (or less) in the pipe.

The amount of data on the network is limited by the smaller of the send buffer size and the receive window size. This means you need both a big send buffer, and a big receive buffer to get maximum throughput.

The TCP window is the maximum number of bytes that can be sent before the ACK must be received. If the network is unreliable it is better to keep the window small to reduce the amount of data that needs to be resent after a missing ACK.

Where can I get more information?

I wrote a blog post about tuning MQ channels which gives additional information.

How do I display this buffer information?

On z/OS you can use

  • TSO NETSTAT CONFIG command reports the default receive buffer size, the default send buffer size, and the default maximum receive buffer size
  • TSO NETSTAT ALL (IPPORT nnnn where nnnn is the port number.
  • TCPMON on GITHUB to monitor the buffer and window sizes in near real time.

On Linux

You can use the command

  • ss -im -at ‘( dport = :21 )’ which displays information about connections with destination port of 21.
  • ss -im -at ‘( dst = 10.1.1.2 )’ which displays information about connections with destination ip address of 10.1.1.2

Is there more information available about buffers and windows?

There is a lot of information on the web, but it is not usually easy to digest.

I thought this article was clear about the different buffers and windows.

How do I change the buffer sizes?

An application can change them using the SETSOCKOPT call see here options SO_RCVBUF and SO_SNDBUF

With some applications, they have a specific way of setting the buffer sizes

  • MQ for midrange RcvBuffSize etc
  • MQ on z/OS use +cpf RECOVER QMGR(TUNE CHINTCPRBDYNSZ nnnnn)
    +cpf RECOVER QMGR(TUNE CHINTCPSBDYNSZ nnnnn)
  • FTP on Linux -x option

Otherwise the system defaults are used.

Other information provided with display commands

Commands like netstat provide other information

For example

  • round trip time – this is average time in millisecond taken for a packet to be sent over the network, and the ACK is received
  • RoundTripVariance – this gives the spread of the response times. It is the sum of the square of each response time. A measure of the spread of the response times is the standard deviation = sqrt((the variance – average round trip time ** 2) /N) where N is the number (of packets sent). If all the packets have the same round trip time, this will be close to zero.
  • Local 0 window count – the number of times there was 0 space in the receive buffer
  • Remote – window count – the number of times the remote end had 0 space in its receive buffer.

One minute networks: MAC address

A MAC address is a Media Access Control address. It has two parts, the manufacturer, and the manufacturer’s unique number. For example on my laptop I have an Ethernet socket. I can see from a wireshark trace that a packet is being broadcast with Ethernet address MicroStarInt_e9:31:2a, or 00 d8 61 e9 31 2a in hex. This was created by Micro Start international.

On a different machine the address is LCFCHeFE 36:f4:8a or 8c 16 45 36 f4 8a. This Ethernet adapter has been provided by LCFC Electronics technology, with serial number 36f48a.

Within an Ethernet switch, there are various broadcasts to devices on the switch, such as

ff02::1all nodes
ff02::2all routers
ff02::5all OSPF (Open Shortest Path First) routers

Using wireshark I can see a broadcast with code ff02::2 which is an IPV6 router Solicitation request from 8c:16:45:36:f4:8a. This is basically saying “have any routers been configured on this Ethernet network, if so, please tell me”. I can map the 8c:16:45:36:f4:8a back to the Ethernet adapter LCFCHeFE 36:f4:8a.

Wireshark has logic to map the Ethernet address prefix to the manufacturer, and convert 8c:16:45 to LCFCHeFE.