Opened 11 years ago

Closed 11 years ago

#452 closed task (fixed)

Test newly merged changes

Reported by: skylar Owned by: skylar
Priority: critical Milestone:
Component: Version:
Keywords: Cc:
Blocked By: Blocking:
Estimated Hours: 3 Total Hours: 1.83

Description

in r2158

Change History (5)

comment:1 Changed 11 years ago by skylar

  • Status changed from new to assigned

galaxsee works fine, liberating now

comment:2 Changed 11 years ago by skylar

fit'z report:

Running this build in a VirtualBox liberation environment I ran into the following:

  • changing hostnames and/or IPs requires restart of mpd
  • Restarting mpd on headnode requires restart of mpd on compute nodes

The mpd init script will choose a new port for the headnode on restart. This could easily be changed but I don't know [yet] if compute nodes' mpds need to be bounced even if the headnode bounces on the same port.

  • if you switch MPI implementation, you must bccd-snarfhost again (different machines file syntax)

a couple of us talked briefly a while ago about an inline solution to this, but nothing has been implemented yet

  • choosing 192.168.3.1 for BCCD net:

during live boot causes eth1:1 to fail to be brought up during bccd-reset-network after liberation causes eth1:1 to fail to be brought up AND causes dhcpd to fail to start (duplicate entries in conf) * One-time, "works /right now/," fix: Edit /etc/dhcp3/bccd_net.conf and remove first section causes pxe nodes to be booted with hostname == "nodeXXX"; minor side effect, doesn't affect functionality Plus side: OpenMPI issue below doesn't appear; name affiliation on head node isn't confused.

  • pxe node hangs after "Opening socket /var/cache/pdnsd/pdnsd.status" 2+ reboots until reaching login prompt
  • OpenMPI programs hang under certain conditions

2(+) nodes. Head node has virtual nic (e.g. eth1:1). non-virtual nic assigned to BCCD subnet, virtual nic assigned to something else. Running -np 2(+) over the nodes will hang indefinitely fix is to ifdown the virtual nic. this condition does not appear if you manually choose 192.168.3.0/24 when setting up the BCCD net. (ethX:1 is never brought up in that case, see above) my guess is a bug in OpenMPI itself, but I'd need to test in other environments to be sure Tried with more recent version, same behavior (yes, even though we /just/ upgraded OpenMPI, there's a newer "stable" version).

Some of these issues are ticket-able, and will start appearing in trac soon.

comment:3 Changed 11 years ago by skylar

  • Estimated Hours changed from 1 to 3
  • Not sure whether the MPI problems are a actually BCCD problems.
  • Can't replicate the eth1:1 problem, testing liberation now.

comment:4 Changed 11 years ago by skylar

not able to replicate the liberation problem either

comment:5 Changed 11 years ago by skylar

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.