Infiniband

From Earlham Cluster Department

Revision as of 17:08, 20 April 2017 by Anschwa (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Infiniband

Installing

To start, there are two flavours of Infiniband; RedHat Open Fabrics distributions (OFED) and Mellanox OFED.

You can either get the required Infiniband packages from the RHEL package manager, or directly from Mellanox.

When making your choice, keep in mind the following:

We had oddities with our IB network until we started using the Mellanox OFED. One of the joys of OFED as an industry standard is that every IB vendor has their own perversion of it. What makes it especially frustrating is that RHEL/CentOS ship their own OFED and disentangling them in an automated way can be challenging. Mellanox OFED will uninstall RHEL OFED during its installation, but woe be unto the one who tries to do a "yum upgrade" at some point in the future. -Skylar Thompson

Mellenox OFED will remove a previous installation of RHEL OFED. After installation you will have to separate the Mellanox OFED "infiniband support" yum group as its own separate entity as yum upgrade will cause the files to be overwritten by RHEL's packages. More on this later, after installation.

Installation

We opted for installing the latest Mellanox OFED.

First, download the appropriate CentOS ISO, once complete execute the following script:

tar -xvzf /path/to/MLNX.tgz 
cd MLNX
cd MLNX_OFED_LINUX-2.4-1.0.0-rhel6.5-x86_64/
ls
./mlnxofedinstall --all

Then reboot for good measure.

After installation, to prevent yum upgrade issues you can add an exclusions to the /etc/yum.conf file. View the packages in the yum group "Infiniband Support" using the command
yum groupinfo infiniband support
Using a text editor, open up the /etc/yum.conf and add the line
exclude=ibutils-libs*
followed by all packages you don't want upgraded (found in the inifiband support group list). No spaces should be added and use commas to separate the different packages. (i.e., exclude=foobar1,foobar2 )

All of the packages listed after the groupinfo command should be added after the ibutils-bin* (be mindful of the wildcard).

IPoIB vs. native IB, or NFS / RDMA

IPoIB implements a TCP/IB layer on top of Infiniband and adds the Host Channel Adapter (HCA) as a Network Interface Card (NIC) to the system (Ex: ib0).

Using Infiniband "naively" with NFS / RDMA potentially allows for sending messages (packets) with greater bandwidth and significantly less CPU usage / involvement, as long as you have RDMA compatible hardware of course.

Unreliable Datagram vs. Connected Mode

I have read that connected mode is comparable to using jumbo frames (thus favorable), but recently it seems datagram has become more stable and is preferred. In any case you can switch between modes at run-time with:

echo datagram > /sys/class/net/ibX/mode 
echo connected > /sys/class/net/ibX/mode

Set up IPoIB

Installing the Mellanox Infiniband drivers with the --all flag should configure much of IPoIB already. There is a network configuration file in /etc/sysconfig/network-scripts/ifcfg-ib0.

You can configure IPoIB to use its own static IP address, or use the network configuration for an existing Ethernet configuration.

Here is an example ifcfg-ib<n> taken from the [ref:two Mellanox user manual].

# Static settings; all values provided by this file
IPADDR_ib0=11.4.3.175
NETMASK_ib0=255.255.0.0
NETWORK_ib0=11.4.0.0
BROADCAST_ib0=11.4.255.255
ONBOOT_ib0=1
# Based on eth0; each '*' will be replaced with a corresponding octet
# from eth0.
LAN_INTERFACE_ib0=eth0
IPADDR_ib0=11.4.'*'.'*'
NETMASK_ib0=255.255.0.0
Mellanox Technologies Confidential
1.5.2-2.1.0-1.1.1000 Driver Features
82 Mellanox Technologies
NETWORK_ib0=11.4.0.0
BROADCAST_ib0=11.4.255.255
ONBOOT_ib0=1
# Based on the first eth<n> interface that is found (for n=0,1,...);
# each '*' will be replaced with a corresponding octet from eth<n>.
LAN_INTERFACE_ib0=
IPADDR_ib0=11.4.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=11.4.0.0
BROADCAST_ib0=11.4.255.255
ONBOOT_ib0=1

Subnet Manager (OpenSM) (We run our subnet manager on the switch, this can largely be ignored)

openSM setup

If your infiniband switch does not support a subnet manger on the hardware you will need to set up opensm to be run by the head node.

upon installation the opensm deamon will be found in /etc/init.d/opensmd , in order to stream-line things add the daemon (as well as any others not found in services) to your services using:
complete -W "$(ls /etc/init.d/)" service 
Next, attempt to start opensm by using
 service opensmd start 
Make sure that the opensmd is set to start on boot-up
 chkconfig --list opensmd
set to
chkconfig opensmd on

troubleshooting

Should the service not start for any reason use lsmod | grep ^ib to check what infiniband modules are running. Here is an example output of what you should see

ib_ucm                 12120  0 
ib_ipoib              122881  0 
ib_cm                  42214  3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs              61976  2 rdma_ucm,ib_ucm
ib_umad                12562  0 
ib_sa                  35753  5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad                 43632  4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core               117605  12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr                 7796  3 rdma_cm,ib_uverbs,ib_core

I found that the ib_umad module is directly related to opensm. If it or any other modules aren't loaded you will need to add them to the rc.modules file

echo modprobe "module name" >> /etc/rc.modules 
and then update permissions
chmod +x /etc/rc.modules

example:

echo modprobe u_mad >> /etc/rc.modules
chmod +x /etc/rc.modules

Subnet Manager Failover

Setting up failover for opensm isn't challenging, but it is good to document which nodes are the subnet managers as the behavior of the network will be strange without any of the managers running. We discovered that with our GPFS cluster when we accidentally rebooted both managers at the same time - no nodes could join the network, including the subnet managers, until we took some manual action. -Skylar Thompson

Failover is necessary when ruining a subnet manager (SM) on your Infiniband machines (rather than a switch). Essentially, failover is a configuration that ensures that if one machine goes down, there is guaranteed to be a SM running on another machine. With Infiniband, you need an SM to be active, otherwise the machines will not be able to communicate with each other.

Switch Configuration

Initial Setup

Certain Infiniband switches can run a subnet manager. This is ideal and in this situation, failover is not necessary. To configure our switch, the Mellanox SX6018, you need to connect the console port to the serial port of an Infiniband machine. Next, install and run the serial terminal program minicom and login with username: admin and password: admin. Go through the configuration wizard (the defaults are fine). We did not enable IPv6.

Installing minicom

yum install minicom

And set the port to /dev/ttyS0

minicom -s

Running the switch setup wizard. Run minicom and login. Then run the following commands proceeding the > or #.

switch > enable
switch # configure terminal
switch (config) # jump-start

From the Mellanox switch manual:

Before attempting a remote (for example, SSH) connection to the switch, check the mgmt0 interface configuration. Specifically, verify the existence of an IP address. To check the current mgmt0 configuration, enter the following commands:

Note that the commands start after the > or #.

switch > enable
switch # configure terminal
switch (config) # show interfaces mgmt0 

Enabling / Running the Subnet Manager

You can enable, manage, configure, and run the subnet manager (along with many other things) through the Mellanox switch web interface control (management) panel. However, if you don't want to bother with getting it working, you can simply enable the subnet manager straight from a logged-in minicom switch prompt.

Again, the commands start after the > and #.

switch > enable
switch # configure terminal
switch (config) # ib sm

OpenMPI testing

We tested OpenMPI using an prime number generator script found here: /cluster/home/charliep/cvs-hopper/primes/. We ran primes_batch with mpirun, specifying the desired amount of machines using a machinefile / hostfile.

Creating a Machine / Hosts File

A machinefile, or hostfile lists information about the nodes for mpirun to use. You should be able to make an appropriate file and run it on the connected machines using Infiniband.

make primes_batch
mpirun primes_batch --np=4 -hostfile=hostfile primes_batch

General testing

The OFED comes with loads of testing programs.

Testing with CHARMM

References

Personal tools
Namespaces
Variants
Actions
websites
wiki
this semester
Toolbox