Author Topic: how to stop the calculation  (Read 10107 times)

golden

  • Full Member
  • ***
  • Posts: 34
  • Karma: +0/-0
how to stop the calculation
« on: January 23, 2012, 05:35:30 PM »
Hi i ran a calculation using jobex;

Code: [Select]
nohup jobex -ri -c 200 -energy 8 actual -r > jobex.out &
for the moment it seems to be running but it hasn't written anything new for ages...  I tired to stop the calculation using creating a file called stop using "touch stop"  but it seems to be not responding at the moment.
Tried to kill using top but it seems to be not working as well... my next option is to restart the computer, but I do not like to go there... Is there any other way to kill the calculation?  ???

Thank you very much.


antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: how to stop the calculation
« Reply #1 on: January 23, 2012, 06:20:24 PM »
Hi,

When you used "top", which Turbomole executable was running at the moment? And what signal did you use to kill it? (9 should do it, if the default 15 does not help). I've never seen a TM executable to hang so badly that kill did not work.

By the way, was the "actual -r" part really included in your command line or is that a typo? (actual -r is a separate command that should be executed in a TM job directory, not used as a jobex switch).

Regards,
Antti

golden

  • Full Member
  • ***
  • Posts: 34
  • Karma: +0/-0
Re: how to stop the calculation
« Reply #2 on: January 23, 2012, 09:27:05 PM »
Quote
When you used "top", which Turbomole executable was running at the moment? And what signal did you use to kill it? (9 should do it, if the default 15 does not help). I've never seen a TM executable to hang so badly that kill did not work.

rdgrad_mpi  was running at the time i issued the command in top to kill
yes I think I used 15 (which is the default )
Quote
Kill PID 3061 with signal [15]:

As suggested tried using 9 to kill the job as follows;
Quote
Kill PID 3061 with signal [15]: 9
 
it still did not work   ???

Quote
By the way, was the "actual -r" part really included in your command line or is that a typo? (actual -r is a separate command that should be executed in a TM job directory, not used as a jobex switch).

please correct me on this as I used the actual -r in the command line .. as:
Quote
nohup jobex -ri -c 200 -energy 8 actual -r > jobex.out &
If it's not surposed to be in command line how should i issue "actual -r" ? and where ?

I really appreciate the help given .

Thanks

antti_karttunen

  • Sr. Member
  • ****
  • Posts: 227
  • Karma: +1/-0
Re: how to stop the calculation
« Reply #3 on: January 24, 2012, 08:31:49 AM »
Huh, sounds strange. Maybe if you list all TM-related processes and try to "kill" them starting from the bottom? (use kill -9 PID). For example:

[antti@compute-0-4 ~]$ ps -fu antti
UID        PID  PPID  C STIME TTY          TIME CMD
antti    10353 10333  0 09:24 ?        00:00:00 -ksh /opt/gridengine/default/spool/compute-0-4/job_scripts/16438
antti    10646 10353  0 09:24 ?        00:00:00 /bin/sh /joy/chem/turbomole/tm63/scripts/jobex -ri
antti    10759 10646  0 09:24 ?        00:00:00 /bin/sh /joy/chem/turbomole/tm63/bin/em64t-unknown-linux-gnu_mpi/ridft -l /joy/chem/turbomole/tm63/bin/em64t-unknown-linux-gnu_mpi
antti    10975 10759  0 09:24 ?        00:00:00 mpirun.mpich -f ./hp_mpi_appfile
antti    10978 10975  0 09:24 ?        00:00:00 /joy/chem/turbomole/tm63/mpirun_scripts/em64t-unknown-linux-gnu_mpi/HPMPI/bin/mpid ...
antti    11092 10978  0 09:24 ?        00:00:00 /joy/chem/turbomole/tm63/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi
antti    11093 10978 96 09:24 ?        00:00:23 /joy/chem/turbomole/tm63/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi
antti    11094 10978 97 09:24 ?        00:00:23 /joy/chem/turbomole/tm63/bin/em64t-unknown-linux-gnu_mpi/ridft_mpi

For the jobex command line, I would use the following:
nohup jobex -ri -c 200 -energy 8 > jobex.out &
(although -energy 8 sounds a bit tight for DFT)

actual -r you can give out in the TM job directory any time. For example, if your jobex job has crashed, you could issue actual -r (after fixing the real problem):

[antti@sandels tm_test]$ actual -r
ridft step seems to have been in serious trouble
[antti@sandels tm_test]$

Hope this helps,
Antti

golden

  • Full Member
  • ***
  • Posts: 34
  • Karma: +0/-0
Re: how to stop the calculation
« Reply #4 on: January 24, 2012, 04:50:47 PM »
Hi,

Quote
Huh, sounds strange. Maybe if you list all TM-related processes and try to "kill" them starting from the bottom? (use kill -9 PID). For example:

Thank you very much it worked really nicely.

 :) :) :)