Skip NavigationSkip to Content

Troubleshooting

Submission errors

If your script is accepted into the queue system you will be given a job id number in the response from msub.  This number is necessary to trace any problems associated with the job and needs to be included in any problem reports.  If you do not receive a job id number from msub then your script was not accepted by the system.

Debugging scripts

  • Any script will be able to be run through its interpreter (bash, csh, perl, etc) as an ordinary linux script.  Try running the script from the command line on the head node to see if any errors are reported.
  • Adding comments at various places throughout the script using echo or printf commands will aid in debugging where the problem occurs.
  • Adding env and ulimit commands at the beginning and end of the script may provide useful information.
  • Specifying a non-existent file or directory in the stagein directive will not cause the script to fail but will place it in a waiting state.  Torque will periodically try to transfer the file again so fixing the missing path may fix the problem.
  • Specifying a non-existent directory in the stageout directive will put the job in a hold completed but holding state.  The files may be delivered when the directory is created.
  • Requesting resources that the system cannot deliver will cause the job to be placed in the queue but never executed.  The checkjob command will indicate that this is the problem if your job never starts.

Common mistakes

  • Specifying a non-existent file or directory in the stagein directive will not cause the script to fail but will place it in a waiting state.  Torque will periodically try to transfer the file again so fixing the missing path may fix the problem.
  • Specifying a non-existent directory in the stageout directive will put the job in a hold completed but holding state.  The files may be delivered when the directory is created.
  • Requesting resources that the system cannot deliver will cause the job to be placed in the queue but never executed.  The checkjob command will indicate that this is the problem if your job never starts.