Open Fabrics Enterprise Distribution (OFED)
		    IPoIB in OFED 1.4 Release Notes
			  
			   December 2008


===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New Features
3. Known Issues
4. DHCP Support of IPoIB
5. The ib-bonding driver
6. Bug Fixes and Enhancements Since OFED 1.3
7. Bug Fixes and Enhancements Since OFED 1.3.1
8. Performance tuning

===============================================================================
1. Overview
===============================================================================
IPoIB is a network driver implementation that enables transmitting IP and ARP
protocol packets over an InfiniBand UD channel. The implementation conforms to
the relevant IETF working group's RFCs (http://www.ietf.org).


===============================================================================
2. New Features
===============================================================================
1. This version of ofed introduces improvements to IPOIB by cutting the CPU
   overhead in handling receive packets. This will improve operation
   in datagram mode:
   Large Receive Offload (LRO) - aggregating multiple incoming packets from a 
   single stream into a larger buffer before they are passed higher up the 
   networking stack, thus reducing the number of packets that have to be 
   processed.
   This feature is enabled on HCAs that can support LRO, e.g. ConnectX.
2. Datagram mode: LSO (large send offload) allows the networking stack to pass 
   SKBs with data size larger than the MTU to the IPoIB driver and have the HCA
   HW fragment the data to multiple MSS-sized packets. Add a device capability
   flag IB_DEVICE_UD_TSO for devices that can perform TCP segmentation offload,
   a new send work request opcode IB_WR_LSO, header, hlen and mss fields for 
   the work request structure, and a new IB_WC_LSO completion type.
   This feature is enabled on HCAs that can support LSO, e.g. ConnectX.


Usage and configuration:
========================
1. To check the current mode used for outgoing connections, enter:
   cat /sys/class/net/ib0/mode
2. To disable IPoIB CM at compile time, enter:
   cd OFED-1.4
   export OFA_KERNEL_PARAMS="--without-ipoib-cm"
   ./install.pl
3. To change the run-time configuration for IPoIB, enter:
   edit /etc/infiniband/openib.conf, change the following parameters:
   # Enable IPoIB Connected Mode
   SET_IPOIB_CM=yes
   # Set IPoIB MTU
   IPOIB_MTU=65520

4. You can also change the mode and MTU for a specific interface manually.
   
   To enable connected mode for interface ib0, enter:
   echo connected > /sys/class/net/ib0/mode
   
   To increase MTU, enter:
   ifconfig ib0 mtu 65520

5. Switching between CM and UD mode can be done in run time:
   echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
   echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM


===============================================================================
3. Known Issues
===============================================================================
1. If a host has multiple interfaces and (a) each interface belongs to a
   different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
   they are connected to the same IB Switch, then the host violates the IP rule
   requiring different broadcast domains. Consequently, the host may build an
   incorrect ARP table.

   The correct setting of a multi-homed IPoIB host is achieved by using a
   different PKEY for each IP subnet. If a host has multiple interfaces on the
   same IP subnet, then to prevent a peer from building an incorrect ARP entry
   (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
   stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
   causes the network stack to send ARP replies only on the interface with the
   IP address specified in the ARP request:

   sysctl -w net.ipv4.conf.ib0.arp_ignore=1
   sysctl -w net.ipv4.conf.ib1.arp_ignore=1

   Or, globally,

   sysctl -w net.ipv4.conf.all.arp_ignore=1

   To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt.
   Note that distributions have the means to make kernel parameters persistent.

2. There are IPoIB alias lines in modprobe.conf which prevent stopping/
   unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). 
   These alias lines cause the drivers to be loaded again by udev scripts.

   Workaround: Change modprobe.conf to set
   OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove 
   the alias lines from modprobe.conf.
   
3. On SLES 10:
   The ib1 interface uses the configuration script of ib0.

   Workaround: Invoke ifup/ifdown using both the interface name and the
   configuration script name (example: ifup ib1 ib1).

4. After a hotplug event, the IPoIB interface falls back to datagram mode, and
   MTU is reduced to 2K.
   Workaround: Re-enable connected mode and increase MTU manually:
   echo connected > /sys/class/net/ib0/mode
   ifconfig ib0 mtu 65520

5. Since the IPoIB configuration files (ifcfg-ib<n>) are installed under the
   standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
   and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
   does not prevent the loading of IPoIB on boot.

6. On RedHat EL 4 up4, the IPOIB implementation is not spec-compliant:
   - ipoib multicast does not work
   - ipoib cannot inter-operate between RHEL4U4 and other hosts. This is due to
     missing code in the kernel which was available in U3 and U5 but removed in
     U4. As a workaround, upgrade to RHEL4U5.

7. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
   messages and a small MTU for datagram (in particular, multicast) messages,
   and relies on path MTU discovery to adjust MTU appropriately. Packets sent
   in the window before MTU discovery automatically reduces the MTU for a
   specific destination will be dropped, producing the following message in the
   system log:
   "packet len <actual length> (> <max allowed length>) too long to send, dropping"

   To warn about this, a message is produced in the system log each time MTU is
   set to a value higher than 2K.

8. In connected mode, TCP latency for short messages is larger by approx. 1usec
   (~5%) than in datagram mode. As a workaround, use datagram mode.

9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with
   newer kernels. We recommend kernels from 2.6.18 and up for
   best IPoIB performance.

10. Connectivity issues encountered when using IPv6 on ia64 systems.

11. The IPoIB module uses a Linux implementation for Large Receive Offload
   (LRO) in kernel 2.6.24 and later. These kernels require installing the
    "inet_lro" module.
    
===============================================================================
4. DHCP Support of IPoIB
===============================================================================
Note: To use DHCP the user must apply a special patch (see "DHCP Notes" below).

DHCP Supported Operating Systems
--------------------------------
1. SLES 10
2. RHEL 5
3. All kernels from 2.6.14 and up

DHCP Unsupported Operating Systems
----------------------------------
RedHat EL 4 distributions are supported.


DHCP Notes
----------
1. It may be required to run over different UDP ports than the well known ports
   (67 and 68). Free port numbers greater than 0x8000 must be chosen. To
   specify a server or a client port number, use the option -p <port number>.
   The client's port number must be the chosen server's port number plus one.

2. For IPoIB to use DHCP, you must patch ISC's DHCP. The patch file can be
   found under OFED-1.3/docs/dhcp after extracting the distribution file.
   (After installation it can also be found under <prefix>/docs/dhcp.) The
   patch should be applied for the server and for each client. Tests were run
   on version 3.0.4 of the DHCP package.


===============================================================================
5. The ib-bonding driver
===============================================================================
The ib-bonding driver is a High Availability solution for IPoIB interfaces. 
It is based on the Linux Ethernet Bonding Driver and was adapted to work with
IPoIB. The ib-bonding package contains a bonding driver and a utility called 
ib-bond to manage and control the driver operation. 
The ib-bonding driver comes with the ib-bonding package (run rpm -qi ib-bonding
to get the package information).

Using the ib-bonding driver
---------------------------
The ib-bonding driver can be loaded manually or automatically.

1. Manual operation:
Use the utility ib-bond to start, query, or stop the driver. For details on this
utility, read the documentation for the ib-bonding package.

2. Automatic operation:
   Use standard OS tools (sysconfig in SuSE and initscripts in Redhat)
   to create a configuration that will come up with network restart. For details
   on this, read the documentation for the ib-bonding package.

Notes:
* Using /etc/infiniband/openib.conf to create a persistent configuration is
  no longer supported


===============================================================================
6. Bug Fixes and Enhancements Since OFED 1.3
===============================================================================
- There is no default configuration for IPoIB interfaces: One should manually
  specify the full IP configuration or use the ofed_net.conf file. See
  OFED_Installation_Guide.txt for details on ipoib configuration.
- Don't drop multicast sends when they can be queued
- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
  SKBs (bug 989)
- IPoIB failed on stress testing (bug 1004)
- Kernel Oops during "port up/down test" (bug 1040)
- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
  panic (bug 985)
- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
- Fix CQ size calculations for ipoib
- Bonding: Enable build for SLES10 SP2
- Bonding: Fix  issue in using the bonding module for Ethernet slaves (see
  documentation for details)

===============================================================================
7. Bug Fixes and Enhancements Since OFED 1.3.1
===============================================================================
- IPoIB: Refresh paths instead of flushing them on SM change events to improve 
  failover respond
- IPoIB: Fix loss of connectivity after bonding failover on both sides
- Bonding: Fix link state detection under RHEL4
- Bonding: Avoid annoying messages from initscripts when starting bond
- Bonding: Set default number of grat. ARP after failover to three (was one)

===============================================================================
8. Performance tuning
===============================================================================
- In IPoIB connected mode, the throughput of medium and large messages can be
  increased by setting the following TCP parameters as follows:

        /sbin/sysctl -w net.ipv4.tcp_timestamps=0
        /sbin/sysctl -w net.ipv4.tcp_sack=0
        /sbin/sysctl -w net.core.netdev_max_backlog=250000
        /sbin/sysctl -w net.core.rmem_max=16777216
        /sbin/sysctl -w net.core.wmem_max=16777216
        /sbin/sysctl -w net.core.rmem_default=16777216
        /sbin/sysctl -w net.core.wmem_default=16777216
        /sbin/sysctl -w net.core.optmem_max=16777216
        /sbin/sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
        /sbin/sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
        /sbin/sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"