Infiniband
From Earlham Cluster Department
Contents |
Infiniband
Installing
To start, there are two flavours of Infiniband; RedHat Open Fabrics distributions (OFED) and Mellanox OFED.
You can either get the required Infiniband packages from the RHEL package manager, or directly from Mellanox.
When making your choice, keep in mind the following:
We had oddities with our IB network until we started using the Mellanox OFED. One of the joys of OFED as an industry standard is that every IB vendor has their own perversion of it. What makes it especially frustrating is that RHEL/CentOS ship their own OFED and disentangling them in an automated way can be challenging. Mellanox OFED will uninstall RHEL OFED during its installation, but woe be unto the one who tries to do a "yum upgrade" at some point in the future. -Skylar Thompson
Mellenox OFED will remove a previous installation of RHEL OFED. After installation you will have to separate the Mellanox OFED "infiniband support"
yum group as its own separate entity as yum upgrade
will cause the files to be overwritten by RHEL's packages. More on this later, after installation.
Installation
We opted for installing the latest Mellanox OFED.
First, download the appropriate CentOS ISO, once complete execute the following script:
tar -xvzf /path/to/MLNX.tgz cd MLNX cd MLNX_OFED_LINUX-2.4-1.0.0-rhel6.5-x86_64/ ls ./mlnxofedinstall --all
Then reboot for good measure.
After installation, to preventyum upgrade
issues you can add an exclusions to the /etc/yum.conf
file. View the packages in the yum group "Infiniband Support" using the command yum groupinfo infiniband supportUsing a text editor, open up the
/etc/yum.conf
and add the line exclude=ibutils-libs*followed by all packages you don't want upgraded (found in the inifiband support group list). No spaces should be added and use commas to separate the different packages. (i.e.,
exclude=foobar1,foobar2
)
All of the packages listed after the groupinfo
command should be added after the ibutils-bin*
(be mindful of the wildcard).
IPoIB vs. native IB, or NFS / RDMA
IPoIB implements a TCP/IB layer on top of Infiniband and adds the Host Channel Adapter (HCA) as a Network Interface Card (NIC) to the system (Ex: ib0).
Using Infiniband "naively" with NFS / RDMA potentially allows for sending messages (packets) with greater bandwidth and significantly less CPU usage / involvement, as long as you have RDMA compatible hardware of course.
Unreliable Datagram vs. Connected Mode
I have read that connected mode is comparable to using jumbo frames (thus favorable), but recently it seems datagram has become more stable and is preferred. In any case you can switch between modes at run-time with:
echo datagram > /sys/class/net/ibX/mode echo connected > /sys/class/net/ibX/mode
Set up IPoIB
Installing the Mellanox Infiniband drivers with the --all
flag should configure much of IPoIB
already. There is a network configuration file in /etc/sysconfig/network-scripts/ifcfg-ib0
.
You can configure IPoIB
to use its own static IP address, or use the network configuration for an existing Ethernet configuration.
Here is an example ifcfg-ib<n>
taken from the [ref:two Mellanox user manual].
# Static settings; all values provided by this file IPADDR_ib0=11.4.3.175 NETMASK_ib0=255.255.0.0 NETWORK_ib0=11.4.0.0 BROADCAST_ib0=11.4.255.255 ONBOOT_ib0=1 # Based on eth0; each '*' will be replaced with a corresponding octet # from eth0. LAN_INTERFACE_ib0=eth0 IPADDR_ib0=11.4.'*'.'*' NETMASK_ib0=255.255.0.0 Mellanox Technologies Confidential 1.5.2-2.1.0-1.1.1000 Driver Features 82 Mellanox Technologies NETWORK_ib0=11.4.0.0 BROADCAST_ib0=11.4.255.255 ONBOOT_ib0=1 # Based on the first eth<n> interface that is found (for n=0,1,...); # each '*' will be replaced with a corresponding octet from eth<n>. LAN_INTERFACE_ib0= IPADDR_ib0=11.4.'*'.'*' NETMASK_ib0=255.255.0.0 NETWORK_ib0=11.4.0.0 BROADCAST_ib0=11.4.255.255 ONBOOT_ib0=1
Subnet Manager (OpenSM) (We run our subnet manager on the switch, this can largely be ignored)
openSM setup
If your infiniband switch does not support a subnet manger on the hardware you will need to set up opensm to be run by the head node.
upon installation the opensm deamon will be found in/etc/init.d/opensmd
, in order to stream-line things add the daemon (as well as any others not found in services) to your services using: complete -W "$(ls /etc/init.d/)" serviceNext, attempt to start opensm by using
service opensmd startMake sure that the opensmd is set to start on boot-up
chkconfig --list opensmdset to
chkconfig opensmd on
troubleshooting
Should the service not start for any reason use lsmod | grep ^ib
to check what infiniband modules are running. Here is an example output of what you should see
ib_ucm 12120 0 ib_ipoib 122881 0 ib_cm 42214 3 ib_ucm,rdma_cm,ib_ipoib ib_uverbs 61976 2 rdma_ucm,ib_ucm ib_umad 12562 0 ib_sa 35753 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib ib_mad 43632 4 ib_cm,ib_umad,mlx4_ib,ib_sa ib_core 117605 12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad ib_addr 7796 3 rdma_cm,ib_uverbs,ib_core
I found that the ib_umad
module is directly related to opensm. If it or any other modules aren't loaded you will need to add them to the rc.modules
file
echo modprobe "module name" >> /etc/rc.modulesand then update permissions
chmod +x /etc/rc.modules
example:
echo modprobe u_mad >> /etc/rc.modules chmod +x /etc/rc.modules
Subnet Manager Failover
Setting up failover for opensm isn't challenging, but it is good to document which nodes are the subnet managers as the behavior of the network will be strange without any of the managers running. We discovered that with our GPFS cluster when we accidentally rebooted both managers at the same time - no nodes could join the network, including the subnet managers, until we took some manual action. -Skylar Thompson
Failover is necessary when ruining a subnet manager (SM) on your Infiniband machines (rather than a switch). Essentially, failover is a configuration that ensures that if one machine goes down, there is guaranteed to be a SM running on another machine. With Infiniband, you need an SM to be active, otherwise the machines will not be able to communicate with each other.
Switch Configuration
Initial Setup
Certain Infiniband switches can run a subnet manager. This is ideal and in this situation, failover is not necessary. To configure our switch, the Mellanox SX6018, you need to connect the console port to the serial port of an Infiniband machine. Next, install and run the serial terminal program minicom
and login with username: admin
and password: admin
. Go through the configuration wizard (the defaults are fine). We did not enable IPv6.
Installing minicom
yum install minicom
And set the port to /dev/ttyS0
minicom -s
Running the switch setup wizard. Run minicom
and login. Then run the following commands proceeding the >
or #
.
switch > enable switch # configure terminal switch (config) # jump-start
From the Mellanox switch manual:
Before attempting a remote (for example, SSH) connection to the switch, check the mgmt0
interface configuration. Specifically, verify the existence of an IP address. To check the current mgmt0 configuration, enter the following commands:
Note that the commands start after the >
or #
.
switch > enable switch # configure terminal switch (config) # show interfaces mgmt0
Enabling / Running the Subnet Manager
You can enable, manage, configure, and run the subnet manager (along with many other things) through the Mellanox switch web interface control (management) panel. However, if you don't want to bother with getting it working, you can simply enable the subnet manager straight from a logged-in minicom
switch prompt.
Again, the commands start after the >
and #
.
switch > enable switch # configure terminal switch (config) # ib sm
OpenMPI testing
We tested OpenMPI using an prime number generator script found here: /cluster/home/charliep/cvs-hopper/primes/
. We ran primes_batch
with mpirun
, specifying the desired amount of machines using a machinefile
/ hostfile
.
Creating a Machine / Hosts File
A machinefile, or hostfile lists information about the nodes for mpirun
to use. You should be able to make
an appropriate file and run it on the connected machines using Infiniband.
make primes_batch mpirun primes_batch --np=4 -hostfile=hostfile primes_batch
General testing
The OFED comes with loads of testing programs.
-
ibping
-
ibdiagnet
-
ibstatus
-
ibstat
-
ibhosts
-
ibswitch
Testing with CHARMM
References
- http://www.mellanox.com/page/products_dyn?product_family=26
- http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v2.2-1.0.1.pdf
- http://www.shocksolution.com/2012/12/installing-and-configuring-infiniband-on-a-red-hat-system/
- https://access.redhat.com/solutions/301643
- https://niktips.wordpress.com/2011/02/02/activating-infiniband-stack-in-linux/
- https://software.intel.com/en-us/articles/understanding-the-infiniband-subnet-manager/
- https://docs.oracle.com/cd/E19802-01/820-2189-10/ib-nem-sw-overview.html
- http://people.redhat.com/dledford/infiniband_get_started.html
- http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
- https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt
- http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf
- http://www.mcs.anl.gov/~balaji/pubs/2010/ispass/ispass10.ipoib.pdf
- http://www.bctes.com/nat-linux-iptables.html
- http://www.mellanox.com/page/products_dyn?product_family=150&mtag=sx6015_sx6018
- https://thegeekinthecorner.wordpress.com/category/infiniband-verbs-rdma/
- http://www.mellanox.com/related-docs/user_manuals/SX60XX_User_Manual.pdf
- http://www.cyberciti.biz/tips/connect-soekris-single-board-computer-using-minicom.html
- http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch04.html#Z1226348317tls
- https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt
- https://community.mellanox.com/docs/DOC-2172