Project

General

Profile

Actions

Bug #1948

closed

xmipp CL2D jobs not running and blocking cluster

Added by Dmitry Lyumkis over 12 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Sargis Dallakyan
Category:
-
Target version:
-
Start date:
07/25/2012
Due date:
% Done:

0%

Estimated time:
Affected Version:
Appion/Leginon 2.1.0
Show in known bugs:
No
Workaround:

Description

I've noticed that a lot of the CL2D jobs that have been launched are not running, that is, they're not producing any output, despite having very few (<1000) particles and small boxsizes. They are stalled on the cluster, and as a result, no one can get on. Could this be with the recent changes to the MPI installation on the nodes that Xmipp uses? Has anyone recently had a successful CL2D run, and if so, what is the directory?

Actions #1

Updated by Sargis Dallakyan over 12 years ago

  • Status changed from New to In Test

Thank you Dmitry. I have checked the logs and I can see that currently Daniel Murin has CL2D job running on guppy.

 tracejob -n 3 78614
...
Job: 78614.guppy.emg.nysbc.org

07/25/2012 15:51:04  S    Job Queued at request of dmurin@guppy.emg.nysbc.org, owner = dmurin@guppy.emg.nysbc.org, job name = cl2d2.appionsub.job, queue = batch
07/25/2012 15:51:04  S    Job Modified at request of Scheduler@guppy.emg.nysbc.org
07/25/2012 15:51:04  L    Not enough of the right type of nodes available
07/25/2012 15:51:04  S    enqueuing into batch, state 1 hop 1
07/25/2012 15:51:04  A    queue=batch
07/25/2012 16:13:02  S    Job Modified at request of Scheduler@guppy.emg.nysbc.org
07/25/2012 16:13:02  L    Job Run
07/25/2012 16:13:02  S    Job Run at request of Scheduler@guppy.emg.nysbc.org
07/25/2012 16:13:02  A    user=dmurin group=users jobname=cl2d2.appionsub.job queue=batch ctime=1343256664 qtime=1343256664 etime=1343256664 start=1343257982 owner=dmurin@guppy.emg.nysbc.org
                          exec_host=guppy-24/3+guppy-24/2+guppy-24/1+guppy-24/0 Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.walltime=240:00:00 

I did ssh to guppy-24 and see xmipp_mpi_class running there:

[root@guppy-24 jobs]# top

top - 10:24:44 up 113 days, 3 min,  1 user,  load average: 4.02, 4.01, 3.93
Tasks: 104 total,   5 running,  99 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.8%us,  0.1%sy,  0.0%ni,  0.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16440060k total, 16266248k used,   173812k free,   109580k buffers
Swap:  2096472k total,    15100k used,  2081372k free, 15800352k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                           
19327 dmurin    25   0  191m  23m 6176 R 99.8  0.1   1056:13 xmipp_mpi_class                                                                                                                                   
19328 dmurin    18   0  191m  23m 6036 R 99.8  0.1   1056:39 xmipp_mpi_class                                                                                                                                   
19329 dmurin    18   0  191m  23m 6036 R 99.8  0.1   1055:55 xmipp_mpi_class                                                                                                                                   
19330 dmurin    25   0  191m  23m 6040 R 99.8  0.1   1055:47 xmipp_mpi_class   

The folder where this job runs has the following files:
[root@guppy-24 jobs]# ls -al /ami/data00/appion/12jul23c/align/cl2d2/
total 913868
drwxrwxr-x  3 dmurin users      4096 Jul 26 08:21 .
drwxrwxr-x  6 dmurin users        58 Jul 25 15:52 ..
-rw-rw-r--  1 dmurin users  43236352 Jul 25 16:25 12jul25q13.hed
-rw-rw-r--  1 dmurin users 875536128 Jul 25 16:25 12jul25q13.img
-rw-rw-r--  1 dmurin users       544 Jul 25 15:51 cl2d2.appionsub.job
-rw-rw-r--  1 dmurin users     69573 Jul 25 16:27 cl2d2.appionsub.log
-rw-rw-r--  1 dmurin users       423 Jul 25 16:25 .emanlog
-rw-rw-r--  1 dmurin users   1103127 Jul 25 18:05 part12jul25q13_level_00_000000.sel
-rw-rw-r--  1 dmurin users     21888 Jul 25 18:05 part12jul25q13_level_00_000000.xmp
-rw-rw-r--  1 dmurin users   1894706 Jul 25 18:05 part12jul25q13_level_00_000001.sel
-rw-rw-r--  1 dmurin users     21888 Jul 25 18:05 part12jul25q13_level_00_000001.xmp
-rw-rw-r--  1 dmurin users       276 Jul 25 18:05 part12jul25q13_level_00_.doc
-rw-rw-r--  1 dmurin users       154 Jul 25 18:05 part12jul25q13_level_00_.sel
-rw-rw-r--  1 dmurin users    918527 Jul 25 22:55 part12jul25q13_level_01_000000.sel
-rw-rw-r--  1 dmurin users     21888 Jul 25 22:55 part12jul25q13_level_01_000000.xmp
-rw-rw-r--  1 dmurin users    808903 Jul 25 22:55 part12jul25q13_level_01_000001.sel
-rw-rw-r--  1 dmurin users     21888 Jul 25 22:55 part12jul25q13_level_01_000001.xmp
-rw-rw-r--  1 dmurin users    425077 Jul 25 22:55 part12jul25q13_level_01_000002.sel
-rw-rw-r--  1 dmurin users     21888 Jul 25 22:55 part12jul25q13_level_01_000002.xmp
-rw-rw-r--  1 dmurin users    845326 Jul 25 22:55 part12jul25q13_level_01_000003.sel
-rw-rw-r--  1 dmurin users     21888 Jul 25 22:55 part12jul25q13_level_01_000003.xmp
-rw-rw-r--  1 dmurin users       497 Jul 25 22:55 part12jul25q13_level_01_.doc
-rw-rw-r--  1 dmurin users       308 Jul 25 22:55 part12jul25q13_level_01_.sel
-rw-rw-r--  1 dmurin users    531009 Jul 26 05:53 part12jul25q13_level_02_000000.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000000.xmp
-rw-rw-r--  1 dmurin users    419965 Jul 26 05:53 part12jul25q13_level_02_000001.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000001.xmp
-rw-rw-r--  1 dmurin users    255316 Jul 26 05:53 part12jul25q13_level_02_000002.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000002.xmp
-rw-rw-r--  1 dmurin users    286840 Jul 26 05:53 part12jul25q13_level_02_000003.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000003.xmp
-rw-rw-r--  1 dmurin users    487912 Jul 26 05:53 part12jul25q13_level_02_000004.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000004.xmp
-rw-rw-r--  1 dmurin users    427917 Jul 26 05:53 part12jul25q13_level_02_000005.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000005.xmp
-rw-rw-r--  1 dmurin users    281089 Jul 26 05:53 part12jul25q13_level_02_000006.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000006.xmp
-rw-rw-r--  1 dmurin users    307785 Jul 26 05:53 part12jul25q13_level_02_000007.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 05:53 part12jul25q13_level_02_000007.xmp
-rw-rw-r--  1 dmurin users       934 Jul 26 05:53 part12jul25q13_level_02_.doc
-rw-rw-r--  1 dmurin users       616 Jul 26 05:53 part12jul25q13_level_02_.sel
-rw-rw-r--  1 dmurin users    160247 Jul 26 10:16 part12jul25q13_level_03_000000.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000000.xmp
-rw-rw-r--  1 dmurin users    231318 Jul 26 10:16 part12jul25q13_level_03_000001.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000001.xmp
-rw-rw-r--  1 dmurin users    235720 Jul 26 10:16 part12jul25q13_level_03_000002.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000002.xmp
-rw-rw-r--  1 dmurin users    241471 Jul 26 10:16 part12jul25q13_level_03_000003.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000003.xmp
-rw-rw-r--  1 dmurin users    203841 Jul 26 10:16 part12jul25q13_level_03_000004.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000004.xmp
-rw-rw-r--  1 dmurin users    214420 Jul 26 10:16 part12jul25q13_level_03_000005.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000005.xmp
-rw-rw-r--  1 dmurin users    199510 Jul 26 10:16 part12jul25q13_level_03_000006.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000006.xmp
-rw-rw-r--  1 dmurin users    214562 Jul 26 10:16 part12jul25q13_level_03_000007.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000007.xmp
-rw-rw-r--  1 dmurin users    142142 Jul 26 10:16 part12jul25q13_level_03_000008.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000008.xmp
-rw-rw-r--  1 dmurin users    151727 Jul 26 10:16 part12jul25q13_level_03_000009.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000009.xmp
-rw-rw-r--  1 dmurin users    138450 Jul 26 10:16 part12jul25q13_level_03_000010.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000010.xmp
-rw-rw-r--  1 dmurin users    149739 Jul 26 10:16 part12jul25q13_level_03_000011.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000011.xmp
-rw-rw-r--  1 dmurin users    169051 Jul 26 10:16 part12jul25q13_level_03_000012.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000012.xmp
-rw-rw-r--  1 dmurin users    114310 Jul 26 10:16 part12jul25q13_level_03_000013.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000013.xmp
-rw-rw-r--  1 dmurin users    217402 Jul 26 10:16 part12jul25q13_level_03_000014.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000014.xmp
-rw-rw-r--  1 dmurin users    213923 Jul 26 10:16 part12jul25q13_level_03_000015.sel
-rw-rw-r--  1 dmurin users     21888 Jul 26 10:16 part12jul25q13_level_03_000015.xmp
-rw-rw-r--  1 dmurin users      1814 Jul 26 10:16 part12jul25q13_level_03_.doc
-rw-rw-r--  1 dmurin users      1232 Jul 26 10:16 part12jul25q13_level_03_.sel
drwxrwxr-x 14 dmurin users       126 Jul 25 16:27 partfiles
-rw-rw-r--  1 dmurin users   2997833 Jul 25 16:27 partlist.sel
-rw-rw-r--  1 dmurin users       382 Jul 25 16:13 runXmippCL2D.log
-rw-rw-r--  1 dmurin users        77 Jul 25 16:13 thread000.log
-rw-rw-r--  1 dmurin users        77 Jul 25 16:13 thread001.log
-rw-rw-r--  1 dmurin users       307 Jul 25 16:27 xmipp.log
-rw-rw-r--  1 dmurin users    959977 Jul 26 10:25 xmipp.std

Am I missing something?

Actions #2

Updated by Dmitry Lyumkis over 12 years ago

JH ran a CL2D job, specifying nodes=1:ppn=8. This worked without problems. When specifying nodes=2:ppn=4, xmipp portion stalls. Are there different openmpi installations on the different nodes?

Actions #3

Updated by Sargis Dallakyan over 12 years ago

  • Status changed from In Test to Closed

Thank you Dmitry. I had /usr/local moved from the work nodes and mounted them from the head node instead. This was causing this issue. I copied the following files from /usr/local_old/sbin/ to /usr/local/sbin/ and now it's working fine again:

pbs_demux pbs_iff pbs_mom qnoded

Actions

Also available in: Atom PDF