From Earlham Cluster Department

(Difference between revisions)
Jump to: navigation, search
(Bacula Backup Management)
(Using Infiniband for Layout's NFS)
Line 295: Line 295:
NFS Reference:
NFS Reference:
== Routing between 10Gb and IB in Layout ==
In order to route traffic over the 10Gb network to layout's compute nodes, we need lo0 to mediate the exchange.
To temporarily establish routing between severs we can use the <code>ip route add</code> command <code>ip route add via dev ib0</code>
In order to make this presistant, we must create the file <code>/etc/sysconfig/network-scripts/route-ib0</code> with the static routing rule inside it.
* [ static routing with centos]

Revision as of 21:26, 13 April 2017


Current To Do

Cluster Pages

New Software

For a lot of the software we install on the clusters, we install them as modules. Environment modules are an easy way of installing multiple versions of software and allowing users to trivially change their environment variables path (PATH, LD_LIBRARY_PATH, C_INCLUDE_PATH, etc. ) to point to which version they want to use. If I want to use gcc version 5.1.0 instead of the system default 4.4.7, all I would have to type is:

    module load gcc/5.1.0  

If gcc 4.4.7 were also a module and I wanted to swap them to make sure I'm using gcc 5.1.0, all I would type is:

    module swap gcc/4.4.7 gcc/5.1.0 

Before installing the software, figure-out if this should be a yum package or a source kit, installed into the system space or modules, and what library dependencies it has. Proceed as appropriate.

Building software packages

Needs to be Updated

Installing a yum package into Modules structure

Enabling built software packages within Modules structure

On the head node all of the clusters, modules are installed into /mounts/machine/software where machine is the name of the actual machine. This directory is visible to all of the nodes of the machine. In that directory, there should be subdirectories of the different modules that are available, and within those each version has it's own directory (as of July 15 2015, the modules setup and organization is messy on many of the cluster, the notes here are how it should be set up from here on out).

So, for example, if we have gcc versions 5.1.0, 4.7.1, and 4.9.0 installed on layout, it would look like this:

   $ ls -1 /mounts/fatboy/software/gcc

If you think your new package is important enough to be loaded by default, then add it to the list in /mounts/al-salam/software/Modules/3.2.7/init/al-salam.{sh,csh}

DNS/DCHP for a single host

Find an IP that's not in use. Easiest way to do that is look in this file /var/named/master/ Add name and IP like the pattern in the file, like below. At the top of the file, be sure to change the serial number at the top to represent the year, month, day, and version.	IN	A 

Save the file. Every time you add an entry to the zone file, you have to edit the reverse zone file. The reverse zone file is /var/named/master/ Add an entry for the host you added in the zone file. Notice the first number there is the last octet of the IP that you gave the host.

    126	IN	PTR 

Next you'll want to stop DNS and then start DNS with the following command.

   service named stop
   service named start

Now that DNS is updated, we have to update DHCP. The file you want to edit is /etc/dhcp/dhcpd.conf. Towards the bottom of the file you'll add

    host <hostname> { hardware ethernet <MACaddress> fixed-address <hostname>; .

Save the file. Just like we did for the DNS config file, we need to stop and the start DHCP.

   service dhcpd stop 
   service dhcpd start

As a test, reboot the client.

Setting up LDAP

When installing and configuring ldap, it can be tedious and frustrating, but no worries! I went through the troubles and took notes as I went so no one else would have to suffer like I did! These notes are pretty detailed, but I would suggest using one of the other servers with a newer centos version (layout, fatboy) as a resource when installing and configuring, especially if you are configuring it for a cluster.

Packages that need to be installed (both head and compute nodes):

We use NSS and NSLCD in conjunction with PAM for ldap authentication. It may be older than SSSD, but we already know how to do it. So, we want to turn off SSSD. If sssd is not running, then great, that'll make your life a lot easier!

   service stop sssd
   chkconfig sssd off #so it doesn't restart if the machines reboots
   chkconfig --del sssd #delete it as a service because we don't want it

There are a lot of files that need to be modified in order for ldap to work correctly.

   URI ldap://
   BASE dc=cluster, dc=loc
   TLS_CACERTDIR /etc/openldap/cacerts
    passwd:  ldap files
    group:  ldap files
    shadow: ldap files

    ethers:     files
    netmasks:   files
    networks:   files
    protocols:  files
    rpc:        files
    services:   files ldap

    netgroup:   ldap files

    publickey:  nisplus

    automount:  files ldap
    aliases:    files

    sudoers:    ldap files
   rootbinddn cn=Manager,dc=cluster,dc=loc
   nss_base_passwd ou=people,dc=cluster,dc=loc?one
   nss_base_shadow ou=people,dc=cluster,dc=loc?one
   nss_base_group ou=group,dc=cluster,dc=loc?one
   nss_map_objectclass posixAccount User
   nss_map_objectclass shadowAccount User
   nss_map_objectclass posixGroup Group
   nss_map_attribute uid userName
   nss_map_attribute gidNumber gid
   nss_map_attribute uidNumber uid
   nss_map_attribute cn groupName
   base dc=cluster,dc=loc
   pam_password crypt
   uri ldap://
   ssl no
   tls_cacertdir /etc/openldap/cacert
   uri ldap://
   instead of pam_sss.o, it should be
   base   group  ou=group,dc=cluster,dc=loc
   base   passwd ou=people,dc=cluster,dc=loc
   base   shadow ou=people,dc=cluster,dc=loc
   uid nslcd
   gid ldap
   uri ldap://
   base dc=cluster, dc=loc
   ssl no
   tls_cacertdir /etc/openldap/cacerts
   UsePAM yes

Since we deleted sssd, we need to start the alternative, and make sure it starts on boot up.

   service nslcd start
   chkconfig nslcd on

Check to make sure nscd is turned off. That is a caching service for ldap. Since we're so small here, we don't really need that.

   service nscd stop
   chkconfig nscd off

Users and Groups

Users are authenticated using an LDAP (Lightweight Directory Access Protocol) server running on Hopper. This is how users are authenticated throughout the entire cluster realm. We use LDAP for all users and groups except for ccg-admin user, root user, and the wheel group. Those users and that group are local to each cluster. Every user is apart of the users LDAP group, which is group number 115, and all clusters should look at ldap first and then files. This is specified in the /etc/nsswitch.conf file.

A user can change their password by the passwd command while on Hopper. It will prompt them for their current password, and then their desired new password. If it's successful it will tell you that at the end, with something like 'All LDAP tokens successfully changed.' or something close to that.

Creating New Users

Because creating users in LDAP is somewhat confusing, sample files and a python script were written to help. The script is and lives in ~root/ldap-files/ on hopper. I'll explain things on here, but there's a README file in that directory that will explain everything as well.

To create a user in LDAP, you must create an .ldif file for that user. This is what does for you. takes a file of new users as a command line argument. The file must specify First Name:username:email for each user, and each user should be on a separate line. The file add-user.ldif is an example of what the file should look like.

sudo python add-user.ldif will create an .ldif file for each user, and use ldapadd to add them to the LDAP database. The contents of the .ldif file for each user added will be printed to the screen, and each user will be sent an email with their username and password. is set to clean up after itself, so you don't need to worry about that. There's one more thing that has to be done after this step. We need to setup the ssh keys for each user. For each user created:

Become that user:

    su - user 

SSH to as0:

    ssh as0  

It will ask you for the password of the user. Then it will prompt you for information about where to save the public key file and for a passphrase. For all of these, just hit enter. That will set it to the default. Go back to hopper and do the same thing for all the new users.

It is VERY important that you use UID and GID numbers that have not already been taken. If new users and groups have been being added correctly, then there shouldn't be a problem with that. maxuid is a file that specifies the next UID to use when creating a new user. reads from that file when creating the .ldif files for each user, and at the end overwrites that file with the new naxuid. If you're nervous about the UID numbers, it is ok to double check. Doing ldapsearch -x should output everything in the database with the latest entry at the bottom. Look for the UID in that entry and compare it with maxuid. If maxuid is one above that number then all is golden. It's also safe to look in /etc/passwd to make sure no one is using that number either.

Other modifications to the DB

Other modifications to the database, like adding a new group, adding users to that group, deleting users, all use .ldif files similar to adding users. In the same directory as the above files, there are sample .ldif files that do these operations. Each file's name should be what it does. add-group.ldif will add a group, using ldapadd command. add-user-to-group.dif can be used to add a user to a group, and del-from-grp.ldif can delete users from groups, and chg-pw show an example of changing the password of a user. All three using the ldapmodify command. Make sure you modify the files to what you need, especially don't forget to change the gid if you add a group. Make sure it's a GID that's not already in use.

The command for modifying the database, if i was adding a user to a group. You'll need to specify the Manager password to the end of this command. It has been redacted here but can be found in the README file, only with root privileges.

    ldapmodify -f add-user-to-group.ldif -D "cn=Manager,dc=cluster,dc=loc" 

If you're adding a group, change ldapmodify to ldapadd. The cn=Manager stuff is just specifying the manager of the database, which will allow you to change it.

To delete a user from the ldap database, it's a little simpler. Use the command below. Again, the Manager password must be specified and has been redacted here, but it will be the same as the password used in the other commands, and can be found in the README file. You will need to change the uid to be equal to the uid of the person you want to delete. The uid here is just the person's username.

    ldapdelete "uid=sbsp,ou=people,dc=cluster,dc=loc" -D "cn=Manager,dc=cluster,dc=loc" 


Disable graphical booting screen in CentOS

To enable verbose booting and remove the loading bar graphic, simply remove rhgb quiet from the file /boot/grub/grub.conf.

rhgb stands for redhat graphical boot, the quiet option tells CentOS to suppress even more boot information.

Rebuilding and Creating RAID arrays with mdadm

mdadm RAID Documentation

Torque PBS

Torque PBS Documentation


Infiniband Documentation

Installing CHARMM

Load latest openMPI

module load modules
module load gcc/4.9.0
module load openmpi


./ gnu M

Clean (if needed)

  ./ gnu M distclean

Bacula Backup Management


Fluorite (Machine)

The jail quartz is the CS bacula director which lives on the machine fluorite,

Location of configuration file on BSD (i.e. quartz) can be found here: /usr/local/etc/bacula/bacula-dir.conf/ and /usr/local/etc/bacula/bacula-fd.conf/. Each bacula client has its own bacula-fd.conf configuration file that points back to quartz.

Helpful Commands for working with jails

Bacula Commands (on Quartz)

Using Infiniband for Layout's NFS

Internal NFS mounts on Layout are now done over Infiniband

  1. add to /etc/hosts of all layout nodes
    • lo0.layout.ib
  2. update /etc/exports on lo0 with infiniband IP address
    • then run exportfs -a to update nfs
    • use showmount to check mounted machines
  3. change nfs mounts to ib fabric on layout nodes
  4. un-mount then mount under lo0.layout.ib

    # umount /scratch
    # mount lo0.layout.ib:/scratch /scratch
    # umount /mounts
    # mount lo0.layout.ib:/mounts /mounts
    # umount /var/www/
    # mount lo0.layout.ib:/var/www/ /var/www/
    • if it says "device is busy", then you can try umount -l /some/mount
  5. update /etc/fstab to use lo0.layout.ib hostname

NFS Reference:

Routing between 10Gb and IB in Layout

In order to route traffic over the 10Gb network to layout's compute nodes, we need lo0 to mediate the exchange.

To temporarily establish routing between severs we can use the ip route add command ip route add via dev ib0

In order to make this presistant, we must create the file /etc/sysconfig/network-scripts/route-ib0 with the static routing rule inside it.

Personal tools
this semester