What is ntuple.pmon.gz, and Why Does it go Missing?


Advanced search

Message boards : Number crunching : What is ntuple.pmon.gz, and Why Does it go Missing?

Author Message
Nick Name
Send message
Joined: 22 Jun 14
Posts: 13
Credit: 135,078
RAC: 0
    
Message 5817 - Posted: 14 Dec 2016, 3:06:06 UTC

I posted about this error several days ago here, but as there's been no response thought I'd start my own thread. At the time I posted about this I had crunched 20 jobs. One validated, 18 were invalid and the last one I manually aborted. Every failed job had this error, with a time stamp which I have removed:

------
PyJobTransforms.trfExe.preExecute 2016-11-25 13:25:46,420 INFO Now writing wrapper for substep executor EVNTtoHITS
PyJobTransforms.trfExe._writeAthenaWrapper 2016-11-25 13:25:46,420 INFO Valgrind not engaged
PyJobTransforms.trfExe.preExecute 2016-11-25 13:25:46,421 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
Guest Log: PyJobTransforms.trfExe.execute 2016-11-25 13:25:46,421 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
Guest Log: PyJobTransforms.trfExe.execute 2016-11-25 13:34:21,094 INFO EVNTtoHITS executor returns 33
Guest Log: PyJobTransforms.trfExe.postExecute 2016-11-25 13:34:21,124 WARNING Failed to process expected perfMon stats file ntuple.pmon.gz: [Errno 2] No such file or directory: 'ntuple.pmon.gz'
Guest Log: PyJobTransforms.trfExe.validate 2016-11-25 13:34:21,124 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (33) (Error code 65)
PyJobTransforms.trfExe.validate 2016-11-25 13:34:21,141 INFO Scanning logfile log.EVNTtoHITS for errors
Guest Log: PyJobTransforms.transform.execute 2016-11-25 13:34:21,215 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: "PyG4AtlasAlg FATAL Standard std::exception is caught"
------
At this point the task terminates.

I don't know which of these is the real problem. Is it "Valgrind not engaged", and that leads to the errors that follow? Why is ntuple.pmon.gz missing; how is it generated? Is it supposed to be bundled with the task or downloaded during the run? This is important because many of my invalid tasks did validate after a third or fourth resend, but several didn't. Most of the ones that didn't had this error.

This affects both Mac and Windows (7 in my case). I didn't see it on any Linux hosts but I don't know if that's because they're immune or it just happened that way. It affects VirtualBox 5.0.2 thru at least 5.1.8. In my case, I had ATLAS as a backup to Cosmology where I've been running VirtualBox jobs for months without any real problem. I had successfully run ATLAS in the past, so seeing this many invalid tasks was an unpleasant surprise.

A lot more work would get done if this problem were solved. There's a lot of wasted time on machines that are capable but have this error, and a lot of wasted bandwidth resending these jobs until they get to a good host or finally bomb out.
____________
Team USA form

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 5827 - Posted: 15 Dec 2016, 8:29:03 UTC - in response to Message 5817.

Unfortunately the WU in the old thread have already been deleted from the database, can you try to run some more now and see if you still see the same problem?

Nick Name
Send message
Joined: 22 Jun 14
Posts: 13
Credit: 135,078
RAC: 0
    
Message 5828 - Posted: 15 Dec 2016, 16:25:50 UTC - in response to Message 5827.
Last modified: 15 Dec 2016, 16:42:47 UTC

Thank you for the response.

I will try some when I get some other things clear. Right now I have another VM project clogging things up.

In the meantime take a look at this one. Three machines including mine, all with this error.

http://atlasathome.cern.ch/workunit.php?wuid=5915574

[edit]
Here are a couple more I found.

http://atlasathome.cern.ch/workunit.php?wuid=5948664
This was cancelled after three tries. These machines are all returning valid work.

http://atlasathome.cern.ch/workunit.php?wuid=5950771
This one was sent to four hosts before successfully completing. One of these is a Linux machine.
http://atlasathome.cern.ch/show_host_detail.php?hostid=55341
Every machine it failed on is returning valid work.

I found these in just a few minutes manually searching users on the forum. I think someone who can properly query the database can easily find many examples of this problem.[/edit]
____________
Team USA form

Message boards : Number crunching : What is ntuple.pmon.gz, and Why Does it go Missing?