Bug #1948
closedxmipp CL2D jobs not running and blocking cluster
Added by Dmitry Lyumkis over 12 years ago. Updated over 12 years ago.
0%
Description
I've noticed that a lot of the CL2D jobs that have been launched are not running, that is, they're not producing any output, despite having very few (<1000) particles and small boxsizes. They are stalled on the cluster, and as a result, no one can get on. Could this be with the recent changes to the MPI installation on the nodes that Xmipp uses? Has anyone recently had a successful CL2D run, and if so, what is the directory?
Updated by Sargis Dallakyan over 12 years ago
- Status changed from New to In Test
Thank you Dmitry. I have checked the logs and I can see that currently Daniel Murin has CL2D job running on guppy.
tracejob -n 3 78614 ... Job: 78614.guppy.emg.nysbc.org 07/25/2012 15:51:04 S Job Queued at request of dmurin@guppy.emg.nysbc.org, owner = dmurin@guppy.emg.nysbc.org, job name = cl2d2.appionsub.job, queue = batch 07/25/2012 15:51:04 S Job Modified at request of Scheduler@guppy.emg.nysbc.org 07/25/2012 15:51:04 L Not enough of the right type of nodes available 07/25/2012 15:51:04 S enqueuing into batch, state 1 hop 1 07/25/2012 15:51:04 A queue=batch 07/25/2012 16:13:02 S Job Modified at request of Scheduler@guppy.emg.nysbc.org 07/25/2012 16:13:02 L Job Run 07/25/2012 16:13:02 S Job Run at request of Scheduler@guppy.emg.nysbc.org 07/25/2012 16:13:02 A user=dmurin group=users jobname=cl2d2.appionsub.job queue=batch ctime=1343256664 qtime=1343256664 etime=1343256664 start=1343257982 owner=dmurin@guppy.emg.nysbc.org exec_host=guppy-24/3+guppy-24/2+guppy-24/1+guppy-24/0 Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1 Resource_List.nodes=1:ppn=4 Resource_List.walltime=240:00:00
I did ssh to guppy-24 and see xmipp_mpi_class running there:
[root@guppy-24 jobs]# top top - 10:24:44 up 113 days, 3 min, 1 user, load average: 4.02, 4.01, 3.93 Tasks: 104 total, 5 running, 99 sleeping, 0 stopped, 0 zombie Cpu(s): 99.8%us, 0.1%sy, 0.0%ni, 0.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16440060k total, 16266248k used, 173812k free, 109580k buffers Swap: 2096472k total, 15100k used, 2081372k free, 15800352k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19327 dmurin 25 0 191m 23m 6176 R 99.8 0.1 1056:13 xmipp_mpi_class 19328 dmurin 18 0 191m 23m 6036 R 99.8 0.1 1056:39 xmipp_mpi_class 19329 dmurin 18 0 191m 23m 6036 R 99.8 0.1 1055:55 xmipp_mpi_class 19330 dmurin 25 0 191m 23m 6040 R 99.8 0.1 1055:47 xmipp_mpi_class
The folder where this job runs has the following files:
[root@guppy-24 jobs]# ls -al /ami/data00/appion/12jul23c/align/cl2d2/ total 913868 drwxrwxr-x 3 dmurin users 4096 Jul 26 08:21 . drwxrwxr-x 6 dmurin users 58 Jul 25 15:52 .. -rw-rw-r-- 1 dmurin users 43236352 Jul 25 16:25 12jul25q13.hed -rw-rw-r-- 1 dmurin users 875536128 Jul 25 16:25 12jul25q13.img -rw-rw-r-- 1 dmurin users 544 Jul 25 15:51 cl2d2.appionsub.job -rw-rw-r-- 1 dmurin users 69573 Jul 25 16:27 cl2d2.appionsub.log -rw-rw-r-- 1 dmurin users 423 Jul 25 16:25 .emanlog -rw-rw-r-- 1 dmurin users 1103127 Jul 25 18:05 part12jul25q13_level_00_000000.sel -rw-rw-r-- 1 dmurin users 21888 Jul 25 18:05 part12jul25q13_level_00_000000.xmp -rw-rw-r-- 1 dmurin users 1894706 Jul 25 18:05 part12jul25q13_level_00_000001.sel -rw-rw-r-- 1 dmurin users 21888 Jul 25 18:05 part12jul25q13_level_00_000001.xmp -rw-rw-r-- 1 dmurin users 276 Jul 25 18:05 part12jul25q13_level_00_.doc -rw-rw-r-- 1 dmurin users 154 Jul 25 18:05 part12jul25q13_level_00_.sel -rw-rw-r-- 1 dmurin users 918527 Jul 25 22:55 part12jul25q13_level_01_000000.sel -rw-rw-r-- 1 dmurin users 21888 Jul 25 22:55 part12jul25q13_level_01_000000.xmp -rw-rw-r-- 1 dmurin users 808903 Jul 25 22:55 part12jul25q13_level_01_000001.sel -rw-rw-r-- 1 dmurin users 21888 Jul 25 22:55 part12jul25q13_level_01_000001.xmp -rw-rw-r-- 1 dmurin users 425077 Jul 25 22:55 part12jul25q13_level_01_000002.sel -rw-rw-r-- 1 dmurin users 21888 Jul 25 22:55 part12jul25q13_level_01_000002.xmp -rw-rw-r-- 1 dmurin users 845326 Jul 25 22:55 part12jul25q13_level_01_000003.sel -rw-rw-r-- 1 dmurin users 21888 Jul 25 22:55 part12jul25q13_level_01_000003.xmp -rw-rw-r-- 1 dmurin users 497 Jul 25 22:55 part12jul25q13_level_01_.doc -rw-rw-r-- 1 dmurin users 308 Jul 25 22:55 part12jul25q13_level_01_.sel -rw-rw-r-- 1 dmurin users 531009 Jul 26 05:53 part12jul25q13_level_02_000000.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000000.xmp -rw-rw-r-- 1 dmurin users 419965 Jul 26 05:53 part12jul25q13_level_02_000001.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000001.xmp -rw-rw-r-- 1 dmurin users 255316 Jul 26 05:53 part12jul25q13_level_02_000002.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000002.xmp -rw-rw-r-- 1 dmurin users 286840 Jul 26 05:53 part12jul25q13_level_02_000003.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000003.xmp -rw-rw-r-- 1 dmurin users 487912 Jul 26 05:53 part12jul25q13_level_02_000004.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000004.xmp -rw-rw-r-- 1 dmurin users 427917 Jul 26 05:53 part12jul25q13_level_02_000005.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000005.xmp -rw-rw-r-- 1 dmurin users 281089 Jul 26 05:53 part12jul25q13_level_02_000006.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000006.xmp -rw-rw-r-- 1 dmurin users 307785 Jul 26 05:53 part12jul25q13_level_02_000007.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 05:53 part12jul25q13_level_02_000007.xmp -rw-rw-r-- 1 dmurin users 934 Jul 26 05:53 part12jul25q13_level_02_.doc -rw-rw-r-- 1 dmurin users 616 Jul 26 05:53 part12jul25q13_level_02_.sel -rw-rw-r-- 1 dmurin users 160247 Jul 26 10:16 part12jul25q13_level_03_000000.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000000.xmp -rw-rw-r-- 1 dmurin users 231318 Jul 26 10:16 part12jul25q13_level_03_000001.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000001.xmp -rw-rw-r-- 1 dmurin users 235720 Jul 26 10:16 part12jul25q13_level_03_000002.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000002.xmp -rw-rw-r-- 1 dmurin users 241471 Jul 26 10:16 part12jul25q13_level_03_000003.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000003.xmp -rw-rw-r-- 1 dmurin users 203841 Jul 26 10:16 part12jul25q13_level_03_000004.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000004.xmp -rw-rw-r-- 1 dmurin users 214420 Jul 26 10:16 part12jul25q13_level_03_000005.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000005.xmp -rw-rw-r-- 1 dmurin users 199510 Jul 26 10:16 part12jul25q13_level_03_000006.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000006.xmp -rw-rw-r-- 1 dmurin users 214562 Jul 26 10:16 part12jul25q13_level_03_000007.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000007.xmp -rw-rw-r-- 1 dmurin users 142142 Jul 26 10:16 part12jul25q13_level_03_000008.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000008.xmp -rw-rw-r-- 1 dmurin users 151727 Jul 26 10:16 part12jul25q13_level_03_000009.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000009.xmp -rw-rw-r-- 1 dmurin users 138450 Jul 26 10:16 part12jul25q13_level_03_000010.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000010.xmp -rw-rw-r-- 1 dmurin users 149739 Jul 26 10:16 part12jul25q13_level_03_000011.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000011.xmp -rw-rw-r-- 1 dmurin users 169051 Jul 26 10:16 part12jul25q13_level_03_000012.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000012.xmp -rw-rw-r-- 1 dmurin users 114310 Jul 26 10:16 part12jul25q13_level_03_000013.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000013.xmp -rw-rw-r-- 1 dmurin users 217402 Jul 26 10:16 part12jul25q13_level_03_000014.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000014.xmp -rw-rw-r-- 1 dmurin users 213923 Jul 26 10:16 part12jul25q13_level_03_000015.sel -rw-rw-r-- 1 dmurin users 21888 Jul 26 10:16 part12jul25q13_level_03_000015.xmp -rw-rw-r-- 1 dmurin users 1814 Jul 26 10:16 part12jul25q13_level_03_.doc -rw-rw-r-- 1 dmurin users 1232 Jul 26 10:16 part12jul25q13_level_03_.sel drwxrwxr-x 14 dmurin users 126 Jul 25 16:27 partfiles -rw-rw-r-- 1 dmurin users 2997833 Jul 25 16:27 partlist.sel -rw-rw-r-- 1 dmurin users 382 Jul 25 16:13 runXmippCL2D.log -rw-rw-r-- 1 dmurin users 77 Jul 25 16:13 thread000.log -rw-rw-r-- 1 dmurin users 77 Jul 25 16:13 thread001.log -rw-rw-r-- 1 dmurin users 307 Jul 25 16:27 xmipp.log -rw-rw-r-- 1 dmurin users 959977 Jul 26 10:25 xmipp.std
Am I missing something?
Updated by Dmitry Lyumkis over 12 years ago
JH ran a CL2D job, specifying nodes=1:ppn=8. This worked without problems. When specifying nodes=2:ppn=4, xmipp portion stalls. Are there different openmpi installations on the different nodes?
Updated by Sargis Dallakyan over 12 years ago
- Status changed from In Test to Closed
Thank you Dmitry. I had /usr/local moved from the work nodes and mounted them from the head node instead. This was causing this issue. I copied the following files from /usr/local_old/sbin/ to /usr/local/sbin/ and now it's working fine again:
pbs_demux pbs_iff pbs_mom qnoded