From Earlham Cluster Department
To start, there are two flavours of Infiniband; RedHat Open Fabrics distributions (OFED) and Mellanox OFED.
You can either get the required Infiniband packages from the RHEL package manager, or directly from Mellanox.
When making your choice, keep in mind the following:
We had oddities with our IB network until we started using the Mellanox OFED. One of the joys of OFED as an industry standard is that every IB vendor has their own perversion of it. What makes it especially frustrating is that RHEL/CentOS ship their own OFED and disentangling them in an automated way can be challenging. Mellanox OFED will uninstall RHEL OFED during its installation, but woe be unto the one who tries to do a "yum upgrade" at some point in the future. -Skylar Thompson
Mellenox OFED will remove a previous installation of RHEL OFED. After installation you will have to separate the Mellanox OFED
"infiniband support" yum group as its own separate entity as
yum upgrade will cause the files to be overwritten by RHEL's packages. More on this later, after installation.
We opted for installing the latest Mellanox OFED.
First, download the appropriate CentOS ISO, once complete execute the following script:
tar -xvzf /path/to/MLNX.tgz cd MLNX cd MLNX_OFED_LINUX-2.4-1.0.0-rhel6.5-x86_64/ ls ./mlnxofedinstall --all
Then reboot for good measure.After installation, to prevent
yum upgradeissues you can add an exclusions to the
/etc/yum.conffile. View the packages in the yum group "Infiniband Support" using the command
yum groupinfo infiniband supportUsing a text editor, open up the
/etc/yum.confand add the line
exclude=ibutils-libs*followed by all packages you don't want upgraded (found in the inifiband support group list). No spaces should be added and use commas to separate the different packages. (i.e.,
All of the packages listed after the
groupinfo command should be added after the
ibutils-bin* (be mindful of the wildcard).
IPoIB vs. native IB, or NFS / RDMA
IPoIB implements a TCP/IB layer on top of Infiniband and adds the Host Channel Adapter (HCA) as a Network Interface Card (NIC) to the system (Ex: ib0).
Using Infiniband "naively" with NFS / RDMA potentially allows for sending messages (packets) with greater bandwidth and significantly less CPU usage / involvement, as long as you have RDMA compatible hardware of course.
Unreliable Datagram vs. Connected Mode
I have read that connected mode is comparable to using jumbo frames (thus favorable), but recently it seems datagram has become more stable and is preferred. In any case you can switch between modes at run-time with:
echo datagram > /sys/class/net/ibX/mode echo connected > /sys/class/net/ibX/mode
Set up IPoIB
Installing the Mellanox Infiniband drivers with the
--all flag should configure much of
IPoIB already. There is a network configuration file in
You can configure
IPoIB to use its own static IP address, or use the network configuration for an existing Ethernet configuration.
Here is an example
ifcfg-ib<n> taken from the [ref:two Mellanox user manual].
# Static settings; all values provided by this file IPADDR_ib0=126.96.36.199 NETMASK_ib0=255.255.0.0 NETWORK_ib0=188.8.131.52 BROADCAST_ib0=184.108.40.206 ONBOOT_ib0=1 # Based on eth0; each '*' will be replaced with a corresponding octet # from eth0. LAN_INTERFACE_ib0=eth0 IPADDR_ib0=11.4.'*'.'*' NETMASK_ib0=255.255.0.0 Mellanox Technologies Confidential 1.5.2-2.1.0-1.1.1000 Driver Features 82 Mellanox Technologies NETWORK_ib0=220.127.116.11 BROADCAST_ib0=18.104.22.168 ONBOOT_ib0=1 # Based on the first eth<n> interface that is found (for n=0,1,...); # each '*' will be replaced with a corresponding octet from eth<n>. LAN_INTERFACE_ib0= IPADDR_ib0=11.4.'*'.'*' NETMASK_ib0=255.255.0.0 NETWORK_ib0=22.214.171.124 BROADCAST_ib0=126.96.36.199 ONBOOT_ib0=1
Subnet Manager (OpenSM) (We run our subnet manager on the switch, this can largely be ignored)
If your infiniband switch does not support a subnet manger on the hardware you will need to set up opensm to be run by the head node.upon installation the opensm deamon will be found in
/etc/init.d/opensmd, in order to stream-line things add the daemon (as well as any others not found in services) to your services using:
complete -W "$(ls /etc/init.d/)" serviceNext, attempt to start opensm by using
service opensmd startMake sure that the opensmd is set to start on boot-up
chkconfig --list opensmdset to
chkconfig opensmd on
Should the service not start for any reason use
lsmod | grep ^ib to check what infiniband modules are running. Here is an example output of what you should see
ib_ucm 12120 0 ib_ipoib 122881 0 ib_cm 42214 3 ib_ucm,rdma_cm,ib_ipoib ib_uverbs 61976 2 rdma_ucm,ib_ucm ib_umad 12562 0 ib_sa 35753 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib ib_mad 43632 4 ib_cm,ib_umad,mlx4_ib,ib_sa ib_core 117605 12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad ib_addr 7796 3 rdma_cm,ib_uverbs,ib_core
I found that the
ib_umad module is directly related to opensm. If it or any other modules aren't loaded you will need to add them to the
echo modprobe "module name" >> /etc/rc.modulesand then update permissions
chmod +x /etc/rc.modules
echo modprobe u_mad >> /etc/rc.modules chmod +x /etc/rc.modules
Subnet Manager Failover
Setting up failover for opensm isn't challenging, but it is good to document which nodes are the subnet managers as the behavior of the network will be strange without any of the managers running. We discovered that with our GPFS cluster when we accidentally rebooted both managers at the same time - no nodes could join the network, including the subnet managers, until we took some manual action. -Skylar Thompson
Failover is necessary when ruining a subnet manager (SM) on your Infiniband machines (rather than a switch). Essentially, failover is a configuration that ensures that if one machine goes down, there is guaranteed to be a SM running on another machine. With Infiniband, you need an SM to be active, otherwise the machines will not be able to communicate with each other.
Certain Infiniband switches can run a subnet manager. This is ideal and in this situation, failover is not necessary. To configure our switch, the Mellanox SX6018, you need to connect the console port to the serial port of an Infiniband machine. Next, install and run the serial terminal program
minicom and login with username:
admin and password:
admin. Go through the configuration wizard (the defaults are fine). We did not enable IPv6.
yum install minicom
And set the port to
Running the switch setup wizard. Run
minicom and login. Then run the following commands proceeding the
switch > enable switch # configure terminal switch (config) # jump-start
From the Mellanox switch manual:
Before attempting a remote (for example, SSH) connection to the switch, check the
mgmt0interface configuration. Specifically, verify the existence of an IP address. To check the current mgmt0 configuration, enter the following commands:
Note that the commands start after the
switch > enable switch # configure terminal switch (config) # show interfaces mgmt0
Enabling / Running the Subnet Manager
You can enable, manage, configure, and run the subnet manager (along with many other things) through the Mellanox switch web interface control (management) panel. However, if you don't want to bother with getting it working, you can simply enable the subnet manager straight from a logged-in
minicom switch prompt.
Again, the commands start after the
switch > enable switch # configure terminal switch (config) # ib sm
We tested OpenMPI using an prime number generator script found here:
/cluster/home/charliep/cvs-hopper/primes/. We ran
mpirun, specifying the desired amount of machines using a
Creating a Machine / Hosts File
A machinefile, or hostfile lists information about the nodes for
mpirun to use. You should be able to
make an appropriate file and run it on the connected machines using Infiniband.
make primes_batch mpirun primes_batch --np=4 -hostfile=hostfile primes_batch
The OFED comes with loads of testing programs.
Testing with CHARMM