From Earlham Cluster Department

Revision as of 19:11, 15 July 2015 by Buzzlightyear (Talk | contribs)
Jump to: navigation, search


Current To Do

Cluster Pages

New Software

Building software packages

Installing a yum package into Modules structure

Enabling built software packages within Modules structure

If you think your new package is important enough to be loaded by default, then add it to the list in /mounts/al-salam/software/Modules/3.2.7/init/al-salam.{sh,csh}

DNS/DCHP for a single host

Find an IP that's not in use. Easiest way to do that is look in this file /var/named/master/ Add name and IP like the pattern in the file, like below. At the top of the file, be sure to change the serial number at the top to represent the year, month, day, and version.	IN	A 

Save the file. Every time you add an entry to the zone file, you have to edit the reverse zone file. The reverse zone file is /var/named/master/ Add an entry for the host you added in the zone file. Notice the first number there is the last octet of the IP that you gave the host.

    126	IN	PTR 

Next you'll want to stop DNS and then start DNS with the following command.

   service named stop
   service named start

Now that DNS is updated, we have to update DHCP. The file you want to edit is /etc/dhcp/dhcpd.conf. Towards the bottom of the file you'll add

    host <hostname> { hardware ethernet <MACaddress> fixed-address <hostname>; .

Save the file. Just like we did for the DNS config file, we need to stop and the start DHCP.

   service dhcpd stop 
   service dhcpd start

As a test, reboot the client.

Setting up LDAP

When installing and configuring ldap, it can be tedious and frustrating, but no worries! I went through the troubles and took notes as I went so no one else would have to suffer like I did! These notes are pretty detailed, but I would suggest using one of the other servers with a newer centos version (layout, fatboy) as a resource when installing and configuring, especially if you are configuring it for a cluster.

Packages that need to be installed (both head and compute nodes):

We use NSS and NSLCD in conjunction with PAM for ldap authentication. It may be older than SSSD, but we already know how to do it. So, we want to turn off SSSD. If sssd is not running, then great, that'll make your life a lot easier!

   service stop sssd
   chkconfig sssd off #so it doesn't restart if the machines reboots
   chkconfig --del sssd #delete it as a service because we don't want it

There are a lot of files that need to be modified in order for ldap to work correctly.

   URI ldap://
   BASE dc=cluster, dc=loc
   TLS_CACERTDIR /etc/openldap/cacerts
    passwd:  ldap files
    groups:   ldap files
    shadow:  files ldap

    ethers:     files
    netmasks:   files
    networks:   files
    protocols:  files
    rpc:        files
    services:   files ldap

    netgroup:   ldap files

    publickey:  nisplus

    automount:  files ldap
    aliases:    files

    sudoers:    ldap files
   nss_base_passwd ou=people,dc=cluster,dc=loc?one
   nss_base_shadow ou=people,dc=cluster,dc=loc?one
   nss_base_group ou=group,dc=cluster,dc=loc?one
   nss_map_attribute uid userName
   nss_map_attribute gidNumber gid
   nss_map_attribute uidNumber uid
   base dc=cluster,dc=loc
   pam_password crypt
   uri ldap://
   ssl no
   tls_cacertdir /etc/openldap/cacert
   uri ldap://
   instead of pam_sss.o, it should be
   base   group  ou=group,dc=cluster,dc=loc
   base   passwd ou=people,dc=cluster,dc=loc
   base   shadow ou=people,dc=cluster,dc=loc
   uid nslcd
   gid ldap
   uri ldap://
   base dc=cluster, dc=loc
   ssl no
   tls_cacertdir /etc/openldap/cacerts
   UsePAM yes

Since we deleted sssd, we need to start the alternative, and make sure it starts on boot up.

   service nslcd start
   chkconfig nslcd on

Check to make sure nscd is turned off. That is a caching service for ldap. Since we're so small here, we don't really need that.

   service nscd off

Users and Groups

Users are authenticated using an LDAP (Lightweight Directory Access Protocol) server running on Hopper. This is how users are authenticated throughout the entire cluster realm. We use LDAP for all users and groups except for ccg-admin user, root user, and the wheel group. Those users and that group are local to each cluster. Every user is apart of the users LDAP group, which is group number 115, and all clusters should look at ldap first and then files. This is specified in the /etc/nsswitch.conf file.

A user can change their password by the passwd command while on Hopper. It will prompt them for their current password, and then their desired new password. If it's successful it will tell you that at the end, with something like 'All LDAP tokens successfully changed.' or something close to that.

Creating New Users

Because creating users in LDAP is somewhat confusing, sample files and a python script were written to help. The script is and lives in ~root/ldap-files/ on hopper. I'll explain things on here, but there's a README file in that directory that will explain everything as well.

To create a user in LDAP, you must create an .ldif file for that user. This is what does for you. takes a file of new users as a command line argument. The file must specify First Name:username:email for each user, and each user should be on a separate line. The file add-user.ldif is an example of what the file should look like.

sudo python add-user.ldif will create an .ldif file for each user, and use ldapadd to add them to the LDAP database. The contents of the .ldif file for each user added will be printed to the screen, and each user will be sent an email with their username and password. is set to clean up after itself, so you don't need to worry about that. There's one more thing that has to be done after this step. We need to setup the ssh keys for each user. For each user created:

Become that user: su - user

SSH to as0: ssh as0

It will ask you for the password of the user. Then it will prompt you for information about where to save the public key file and for a passphrase. For all of these, just hit enter. That will set it to the default. Go back to hopper and do the same thing for all the new users.

It is VERY important that you use UID and GID numbers that have not already been taken. If new users and groups have been being added correctly, then there shouldn't be a problem with that. maxuid is a file that specifies the next UID to use when creating a new user. reads from that file when creating the .ldif files for each user, and at the end overwrites that file with the new naxuid. If you're nervous about the UID numbers, it is ok to double check. Doing ldapsearch -x should output everything in the database with the latest entry at the bottom. Look for the UID in that entry and compare it with maxuid. If maxuid is one above that number then all is golden. It's also safe to look in /etc/passwd to make sure no one is using that number either.

Other modifications to the DB

Other modifications to the database, like adding a new group, adding users to that group, deleting users, all use .ldif files similar to adding users. In the same directory as the above files, there are sample .ldif files that do these operations. Each file's name should be what it does. add-group.ldif will add a group, using ldapadd command. add-user-to-group.dif can be used to add a user to a group, and del-from-grp.ldif can delete users from groups, and chg-pw show an example of changing the password of a user. All three using the ldapmodify command. Make sure you modify the files to what you need, especially don't forget to change the gid if you add a group. Make sure it's a GID that's not already in use.

The command for modifying the database, if i was adding a user to a group. You'll need to specify the Manager password to the end of this command. It has been redacted here but can be found in the README file, only with root privileges.

ldapmodify -f add-user-to-group.ldif -D "cn=Manager,dc=cluster,dc=loc"

If you're adding a group, change ldapmodify to ldapadd. The cn=Manager stuff is just specifying the manager of the database, which will allow you to change it.

To delete a user from the ldap database, it's a little simpler. Use the command below. Again, the Manager password must be specified and has been redacted here, but it will be the same as the password used in the other commands, and can be found in the README file. You will need to change the uid to be equal to the uid of the person you want to delete. The uid here is just the person's username.

ldapdelete "uid=sbsp,ou=people,dc=cluster,dc=loc" -D "cn=Manager,dc=cluster,dc=loc"


Disable graphical booting screen in CentOS

To enable verbose booting and remove the loading bar graphic, simply remove rhgb quiet from the file /boot/grub/grub.conf.

rhgb stands for redhat graphical boot, the quiet option tells CentOS to suppress even more boot information.

Rebuilding and Creating RAID 1 Arrays with mdadm

Crating Arrays

To create a mirrored array with two drives, sda and sdb, on partitions, sda1 and sdb1:

mdadm --create --verbose /dev/md0 --level=raid1 --raid-devices=2 /dev/sda1 /dev/sdb1

Now you can monitor the status of the building of the array with:

cat /proc/mdstat

Once finished, save your mdadm configuration with:

mdadm --verbose --detail --scan > /etc/mdadm.conf

You may need to edit this file to remove unwanted lines or to add an email address to MAILADDR to be notified if a drive failure occurs:


On some systems, mdadm's configuration file is /etc/mdadm/mdadm.conf, it is very important to put the configuration in the correct location.

Rebuilding Arrays

If a drive ever fails, or is the system is booted with a drive removed, you will need to add it back into the array.

Failed Drive

In this example, /dev/sda1 and /dev/sdb1 make up the RAID 1 array /dev/md0. Let us say that /dev/sdb fails.

Determining failed drive


cat /proc/mdstat
[root@lo4 ~]# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[1](F) sda1[0]
      204736 blocks super 1.0 [2/1] [U_]

unused devices: <none>

When a drive fails or is missing, you will see an underscore in the array output ([U_] instead of [UU]). (F) will be displayed next to the failed drive (sdb1[1](F)).
If not, running lsblk or fdisk -l may help you determine which drive it is that failed

hdparm -I /dev/sda | grep "Serial Number"

Will give you the serial number of /dev/sda, which may help you identify physical disks as well.

Remove Failed Drive

If a drive has has failed, it should be removed from the mdadm array before being replaced.

mdadm --manage /dev/md0 --fail /dev/sdb1
[root@lo4 ~]# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0

Now, we can remove it from the array.

mdadm --manage /dev/md0 --remove /dev/sdb1
[root@lo4 ~]# mdadm --manage /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0

Check /proc/mdstat. There should no longer be any (F) or listed drive besides sda1[0].

Power down the system.

shutdown -h now

Replace Drive

Now that everything is powered down, remove the failed HDD then replace it with the new one.

Once the drive is replaced, boot the system back up.

Add New Drive to Array

Recreate the partitioning scheme of /dev/sda on the new drive.

sfdisk -d /dev/sda | fdisk /dev/sdb

Then verify with lsblk or fdisk -l.

mdadm --manage /dev/md0 --add /dev/sdb1
[root@lo4 ~]# mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: added /dev/sdb1

Finally, check the status of the rebuilding with

cat /proc/mdstat
[root@lo4 ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      204736 blocks super 1.0 [2/1] [U_]
      [===========>.........]  recovery = 57.7% (118400/204736) finish=0.0min speed=118400K/sec

unused devices: <none>

Missing Drive

In this example, /dev/sda1 and /dev/sdb1 make up the RAID 1 array /dev/md0. Let us say that /dev/sdb1 is missing.

Use lsblk to examine HDD partitions with block sizes. Alternatively, you can use fdisk -l or any other utility you prefer.

Now check the status of mdadm with:

cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[1](F) sda1[0]
      204736 blocks super 1.0 [2/1] [U_]

unused devices: <none>

When a drive fails or is missing, you will see an underscore in the array output ([U_] instead of [UU]).

Use the output from lsblk and /proc/mdsat to match the present drive in an active mdadm array (/dev/md0) with the corresponding partition on the missing drive. For example, "match" /dev/sda1 with /dev/sdb1 (after verifying their block sizes are the same).

Now add /dev/sdb1 back into the array:

mdadm --manage /dev/md0 --add /dev/sdb1
[root@lo4 ~]# mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: added /dev/sdb1

You can view the status of the rebuilding array with:

cat /proc/mdstat
[root@lo4 ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      204736 blocks super 1.0 [2/1] [U_]
      [===========>.........]  recovery = 57.7% (118400/204736) finish=0.0min speed=118400K/sec

unused devices: <none>

References and Resources

Torque PBS

Modifying pbs_server Configuration

  qmgr -c 'print server' > qmgr_pbs_server.backup

Note: you can simply list the qmgr pbs_server configuration with: qmgr -c 'p s'?.

Modify qmgr Server Variable

An example for modifying a server variable for pbs_server with qmgr.

  $ qmgr
  $ unset server acl_hosts
  $ set server acl_hosts = headnode.hostname

Restarting pbs_server

Sometimes you need to make a change to the pbs_server (or add a new node).

The following shuts down pbs_server without killing jobs.

  $ qterm -t quick
  $ pbs_server

Running the Head Node as a Compute Node

Make sure the correct hostname (local to the compute nodes) is specified in /var/spool/torque/server_priv/nodes and /var/spool/torque/mom_priv/config.

When running pbs_mom from the head node, it may be necessary to specify the local hostname with:

  pbs_mom -H headnode.hostname

If you are running pbs as a service, you may also need to modify the init script for pbs_mom.

Starting pbs_mom at boot (pbs as a Service)

Copy pbs_mom init script into /etc/init.d/

To find where the pbs_mom init script is located, use the locate command.

$ locate /init.d/pbs_mom

  cp /path/to/contrib/init.d/pbs_mom /etc/init.d/pbs_mom

Add to chkconfig

  $ chkconfig --add pbs_mom
  $ chkconfig pbs_mom on

Custom init script for pbs_mom

You can make a copy of /etc/init.d/pbs_mom called something like my_pbs_mom in order to specify your own pbs_mom flags. For example if you need to specify the local hostname with -H headnode.hostname. If you do this, the above chkconfig commands should be issued with my_pbs_mom instead of pbs_mom.

Note: init scripts should have permissions 0755.



Mellanox vs. RedHat Open Fabrics distributions (OFED)

You can either get the required Infiniband packages from the RHEL package manager, or directly from Mellanox.

When making your choice, keep in mind the following:

We had oddities with our IB network until we started using the Mellanox OFED. One of the joys of OFED as an industry standard is that every IB vendor has their own perversion of it. What makes it especially frustrating is that RHEL/CentOS ship their own OFED and disentangling them in an automated way can be challenging. Mellanox OFED will uninstall RHEL OFED during its installation, but woe be unto the one who tries to do a "yum upgrade" at some point in the future. -Skylar Thompson

We opted for installing the latest Mellanox OFED. Mellenox OFED will remove a previous installation of RHEL OFED. After installation we have to separate the Mellanox OFED "infiniband support" yum group as its own separate entity as yum upgrade will cause the files to be overwritten by RHEL's packages.

We can do this by editing the /etc/yum.conf file and adding an exclusion. View the packages in the yum group "Infiniband Support" using the command
yum groupinfo infiniband support
Using a text editor, open up the /etc/yum.conf and add the line
followed by all packages you don't want upgraded (found in the inifiband support group list). No spaces should be added and use commas to separate the different packages.

Once the appropriate CentOS ISO is downloaded, execute the following script:

tar -xvzf /path/to/MLNX.tgz 
cd MLNX_OFED_LINUX-2.4-1.0.0-rhel6.5-x86_64/
./mlnxofedinstall [OPTIONS]

Then reboot for good measure.

IPoIB vs. native IB, or NFS / RDMA

IPoIB implements a TCP/IB layer on top of Infiniband and adds the Host Channel Adapter (HCA) as a Network Interface Card (NIC) to the system (Ex: ib0).

Using Infiniband "naively" with NFS / RDMA potentially allows for sending messages (packets) with greater bandwidth and significantly less CPU usage / involvement, as long as you have RDMA compatible hardware of course.

Unreliable Datagram vs. Connected Mode

I have read that connected mode is comparable to using jumbo frames (thus favorable), but recently it seems datagram has become more stable and is preferred. In any case you can switch between modes at run-time with:

echo datagram > /sys/class/net/ibX/mode 
echo connected > /sys/class/net/ibX/mode

Set up IPoIB

Installing the Mellanox Infiniband drivers with the --all flag should configure much of IPoIB already. There is a network configuration file in /etc/sysconfig/network-scripts/ifcfg-ib0.

You can configure IPoIB to use its own static IP address, or use the network configuration for an existing Ethernet configuration.

Here is an example ifcfg-ib<n> taken from the [ref:two Mellanox user manual].

# Static settings; all values provided by this file
# Based on eth0; each '*' will be replaced with a corresponding octet
# from eth0.
Mellanox Technologies Confidential
1.5.2-2.1.0-1.1.1000 Driver Features
82 Mellanox Technologies
# Based on the first eth<n> interface that is found (for n=0,1,...);
# each '*' will be replaced with a corresponding octet from eth<n>.

Subnet Manager (OpenSM)

openSM setup

If your infiniband switch does not support a subnet manger on the hardware you will need to set up opensm to be run by the head node.

upon installation the opensm deamon will be found in /etc/init.d/opensmd , in order to stream-line things add the daemon (as well as any others not found in services) to your services using:
complete -W "$(ls /etc/init.d/)" service 
Next, attempt to start opensm by using
 service opensmd start 
Make sure that the opensmd is set to start on boot-up
 chkconfig --list opensmd
set to
chkconfig opensmd on


Should the service not start for any reason use lsmod | grep ^ib to check what infiniband modules are running. Here is an example output of what you should see

ib_ucm                 12120  0 
ib_ipoib              122881  0 
ib_cm                  42214  3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs              61976  2 rdma_ucm,ib_ucm
ib_umad                12562  0 
ib_sa                  35753  5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad                 43632  4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core               117605  12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr                 7796  3 rdma_cm,ib_uverbs,ib_core

I found that the ib_umad module is directly related to opensm. If it or any other modules aren't loaded you will need to add them to the rc.modules file

echo modprobe "module name" >> /etc/rc.modules 
and then update permissions
chmod +x /etc/rc.modules


echo modprobe u_mad >> /etc/rc.modules
chmod +x /etc/rc.modules

Subnet Manager Failover

Setting up failover for opensm isn't challenging, but it is good to document which nodes are the subnet managers as the behavior of the network will be strange without any of the managers running. We discovered that with our GPFS cluster when we accidentally rebooted both managers at the same time - no nodes could join the network, including the subnet managers, until we took some manual action. -Skylar Thompson

Failover is necessary when ruining a subnet manager (SM) on your Infiniband machines (rather than a switch). Essentially, failover is a configuration that ensures that if one machine goes down, there is guaranteed to be a SM running on another machine. With Infiniband, you need an SM to be active, otherwise the machines will not be able to communicate with each other.

Switch Configuration

Initial Setup

Certain Infiniband switches can run a subnet manager. This is ideal and in this situation, failover is not necessary. To configure our switch, the Mellanox SX6018, you need to connect the console port to the serial port of an Infiniband machine. Next, install and run the serial terminal program minicom and login with username: admin and password: admin. Go through the configuration wizard (the defaults are fine). We did not enable IPv6.

Installing minicom

yum install minicom

And set the port to /dev/ttyS0

minicom -s

Running the switch setup wizard. Run minicom and login. Then run the following commands proceeding the > or #.

switch > enable
switch # configure terminal
switch (config) # jump-start

From the Mellanox switch manual:

Before attempting a remote (for example, SSH) connection to the switch, check the mgmt0 interface configuration. Specifically, verify the existence of an IP address. To check the current mgmt0 configuration, enter the following commands:

Note that the commands start after the > or #.

switch > enable
switch # configure terminal
switch (config) # show interfaces mgmt0 

Enabling / Running the Subnet Manager

You can enable, manage, configure, and run the subnet manager (along with many other things) through the Mellanox switch web interface control (management) panel. However, if you don't want to bother with getting it working, you can simply enable the subnet manager straight from a logged-in minicom switch prompt.

Again, the commands start after the > and #.

switch > enable
switch # configure terminal
switch (config) # ib sm

OpenMPI testing

We tested OpenMPI using an prime number generator script found here: /cluster/home/charliep/cvs-hopper/primes/. We ran primes_batch with mpirun, specifying the desired amount of machines using a machinefile / hostfile.

Creating a Machine / Hosts File

A machinefile, or hostfile lists information about the nodes for mpirun to use. You should be able to make an appropriate file and run it on the connected machines using Infiniband.

make primes_batch
mpirun primes_batch --np=4 -hostfile=hostfile primes_batch

General testing

The OFED comes with loads of testing programs.

Testing with CHARMM


Installing CHARMM

Load latest openMPI

module load modules
module load gcc/4.9.0
module load openmpi


./ gnu M

Clean (if needed)

  ./ gnu M distclean
Personal tools
this semester