Nothing but validate errors


Advanced search

Message boards : Number crunching : Nothing but validate errors

1 · 2 · 3 · Next
Author Message
Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 0
    
Message 5491 - Posted: 14 Oct 2016, 3:36:58 UTC
Last modified: 14 Oct 2016, 3:37:42 UTC

Checked a few other users and we all seem to be getting nothing but validate errors after 7 or so minutes.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5492 - Posted: 14 Oct 2016, 5:15:32 UTC - in response to Message 5491.
Last modified: 14 Oct 2016, 5:28:38 UTC

same thing here; on both of my PCs - on single core as well as on multi-core. What's going wrong?

Also, often in the status column of the BOINC manager is says "Postponed: VM Environment needed to be cleaned up".

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5493 - Posted: 14 Oct 2016, 5:52:15 UTC

what also catches my eye:

like is was the case before with the faulty WUs from application version 1.05 (which was withdrawn meanwhile), the crunching process does not use any CPU.
This can be watched when opening the "processes" tab in the Windows task manager - there are a lot of VBoxHeadless.exe, but non of them shows CPU usage.

As said, same thing was the case with the faulty WUs under application 1.05. Now obviously this is true also under application 1.04.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5494 - Posted: 14 Oct 2016, 10:10:57 UTC

no news on this?

No other crunchers experiencing these problems?

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 5495 - Posted: 14 Oct 2016, 13:14:34 UTC

On all machines, nothing but validation errors - all multicore tasks.
I had to suspend this project under these circumstances.
Before you change anything, you NEED to test things in house, then release it to the beta testers and only then release it to the regular crunchers. PLEASE.

Michael.
____________
President of Rechenkraft.net

rbpeake
Send message
Joined: 27 Jun 14
Posts: 86
Credit: 8,794,961
RAC: 0
    
Message 5496 - Posted: 14 Oct 2016, 13:55:47 UTC - in response to Message 5495.

On all machines, nothing but validation errors - all multicore tasks.
I had to suspend this project under these circumstances.
Before you change anything, you NEED to test things in house, then release it to the beta testers and only then release it to the regular crunchers. PLEASE.

Michael.

Agreed. There needs to be some procedure put in place to avoid this happening in the future.
The loss of simulation time to ATLAS is bad enough, but so is the frustration to BOINC participants, some of whom may move to other projects and be lost to ATLAS.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5497 - Posted: 14 Oct 2016, 14:02:51 UTC - in response to Message 5495.
Last modified: 14 Oct 2016, 14:04:30 UTC

On all machines, nothing but validation errors - all multicore tasks.

Well, as said in my posting above, I had these problems with multicore PLUS single core tasks, even on two different systems.

For the time being, I have stopped ATLAS on both machines and have switched to WCG.
Still, it would be great if ATLAS is functioning well very soon.

Profile Jim1348
Send message
Joined: 15 Nov 14
Posts: 111
Credit: 1,319,885
RAC: 0
    
Message 5498 - Posted: 14 Oct 2016, 14:26:27 UTC

I don't mind the errors so much, but VirtualBox has caused me other problems on a regular basis, both on Linux and Windows. I get regular error messages about crashes on Ubuntu 16.04, which are usually harmless, but sometimes actually hang up the machine. That was running ATLAS on one machine and vLHC on another, and is not good. And on Win7 64-bit, Cosmology would run fine for a while but then cause crashes, or at least interfere with a video recording program that I use.

I have had to give up VBox, though the concept is nice.

Profile Phil1966
Send message
Joined: 14 Jun 14
Posts: 39
Credit: 1,185,758
RAC: 0
    
Message 5499 - Posted: 14 Oct 2016, 16:20:25 UTC

Same here :(
Almost WU's gone in errors today :(((
Not fun to spend so much money in crunching for nothing.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5500 - Posted: 14 Oct 2016, 16:43:31 UTC - in response to Message 5499.

A few minutes ago, I re-tried two of the single-core WUs.
Still same problem: they get uploaded after a few minutes, and then, of course, show a "validation error".

As far as I could see from the project page, both many single-core and also multi-core WUs are provided and downloaded by crunchers.
I am wondering is, whether they work well on many systems, or whether a large number of crunchers still has not noticed what is going on.

Would be nice if a cruncher with working WUs could post here just a few words.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 5501 - Posted: 14 Oct 2016, 17:05:42 UTC

Hi all,

Today there is a general infrastructure problem with some ATLAS services that our WU use (as well as all the tasks running on the Grid). Experts are currently repairing the probelem.

rbpeake
Send message
Joined: 27 Jun 14
Posts: 86
Credit: 8,794,961
RAC: 0
    
Message 5502 - Posted: 14 Oct 2016, 17:06:23 UTC - in response to Message 5500.

...
I am wondering is, whether they work well on many systems, or whether a large number of crunchers still has not noticed what is going on.

Would be nice if a cruncher with working WUs could post here just a few words.

I am still running some bad work units because I do not have immediate access to some machines. And I might not have checked today at all.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 5503 - Posted: 14 Oct 2016, 17:45:48 UTC

I think things are fixed now, I have one WU now running for 15 mins and it's using 100% CPU. Sorry for the inconvenience, also many ATLAS experts are at a conference in San Francisco this week so it took some time to resolve the issue.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5504 - Posted: 14 Oct 2016, 18:03:50 UTC - in response to Message 5501.

Hi all,
Today there is a general infrastructure problem with some ATLAS services that our WU use (as well as all the tasks running on the Grid). Experts are currently repairing the probelem.

thanks, David, for the information :-)

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5505 - Posted: 14 Oct 2016, 19:22:57 UTC

I now again tried two single-core WUs, they have been running for some 70 minutes already, full CPU usage, full RAM usage. So I am optimistic.
Lateron, I will try the multicore WUs on the other machine and see what happens.

47an
Send message
Joined: 14 Jan 15
Posts: 24
Credit: 10,732,116
RAC: 0
    
Message 5506 - Posted: 14 Oct 2016, 20:41:04 UTC
Last modified: 14 Oct 2016, 20:43:02 UTC

looks fine here with (mt)

Thanks David

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5507 - Posted: 15 Oct 2016, 6:39:32 UTC - in response to Message 5505.

Lateron, I will try the multicore WUs on the other machine and see what happens.

The multicore WUs that were running last night turned out being okay.
So it seems that we're back to "normal" :-)

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 0
    
Message 5513 - Posted: 18 Oct 2016, 18:18:51 UTC

On my proxy I currently observe a lot of malformed requests generated by the ATLAS VMs:

GET /frontierATLAS/Frontier/type=frontier_request...


Normal requests have the following format:

GET http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/Frontier/type=frontier_request...


The WUs with validation errors from last week were preceded by a couple of those errors too.
This may be an indicator for upcoming errors.

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 0
    
Message 5515 - Posted: 18 Oct 2016, 19:23:10 UTC

The validation errors are back.
https://lhcathome.cern.ch/ATLAS/result.php?resultid=7448893
https://lhcathome.cern.ch/ATLAS/result.php?resultid=7449019

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 5516 - Posted: 19 Oct 2016, 10:52:29 UTC

I, too, had a few WUs with the validation error since yesterday.

What also seems to be back: Multicore-WUs which proceed "normally" until the percentage reaches some value above 90%, and then the processing speed drops dramatically. In fact, as I had seen from previous cases, they finally reach 99,99% afer more than a day (sometimes after 2 days), but they never reach 100%.

What's going wrong?

1 · 2 · 3 · Next

Message boards : Number crunching : Nothing but validate errors