Report never ending tasks here


Advanced search

Message boards : Number crunching : Report never ending tasks here

1 · 2 · 3 · 4 . . . 14 · Next
Author Message
Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3025 - Posted: 23 Sep 2015, 9:33:30 UTC

In addition to the clearly detectable 'kernel panic' issue and WUs which get invalidated on machines that are otherwise returning useful results, a third class of error is related to WUs which show nor kernel panic but never end.

Please let us systematically collect exactly those WU types here:

http://atlasathome.cern.ch/result.php?resultid=1927407

Michael.
____________
President of Rechenkraft.net

Profile lkraider
Avatar
Send message
Joined: 22 May 15
Posts: 17
Credit: 81,631
RAC: 0
    
Message 3037 - Posted: 25 Sep 2015, 3:33:24 UTC
Last modified: 25 Sep 2015, 3:35:03 UTC

I just noticed an EXT4 filesystem mount error:



The task elapsed time was 133h40m.

http://atlasathome.cern.ch/result.php?resultid=2244520

Profile Francis Butts
Send message
Joined: 8 Nov 14
Posts: 3
Credit: 244,837
RAC: 0
    
Message 3067 - Posted: 29 Sep 2015, 14:47:32 UTC

The following w/u is showing 100% completion on the computer in use. However, it will not "end;" accumulated time has been increasing for about 4 days now.
Name 2TXMDm7sZzmnDDn7oo6G73TpABFKDmABFKDm2fNKDmABFKDmIo6dOo_0
Workunit 1737059
Created 25 Sep 2015, 4:34:39 UTC
Sent 25 Sep 2015, 7:06:24 UTC
Report deadline 9 Oct 2015, 7:06:24 UTC
Received ---
Server state In progress
Outcome ---
Client state New
Exit status 0 (0x0)
Computer ID 8457
Any idea what may be happening?

Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 0
    
Message 3071 - Posted: 29 Sep 2015, 16:30:40 UTC

Talk about never ending tasks!!!

Even after hours with no tasks available there are still 5222 tasks

out in the field!!! according to the server status page.


Can the server abort all those tasks and therebye become available

for the rest of us??

Rasputin42
Send message
Joined: 7 Jul 14
Posts: 25
Credit: 80,709
RAC: 0
    
Message 3072 - Posted: 29 Sep 2015, 17:21:29 UTC - in response to Message 3071.

Well, some people are still crunching them.
Would you like your tasks being taken away from you?

JSE
Send message
Joined: 14 Feb 15
Posts: 137
Credit: 9,505,329
RAC: 0
    
Message 3073 - Posted: 29 Sep 2015, 17:33:07 UTC - in response to Message 3072.

Tom, that is a bit rigorous. Think about the other participants. Everyone tries to do their best to participate in the project. You can't suddenly abort their WU's.

Better would be if the Atlas@home team has new WU's ready earlier. They should set a trigger when the 'to be send' queue is going to run dry, instead of waiting for a message from the crowd saying we are out of WU's.

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3140 - Posted: 20 Oct 2015, 8:45:43 UTC

http://atlasathome.cern.ch/result.php?resultid=2541657

Michael.
____________
President of Rechenkraft.net

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3143 - Posted: 21 Oct 2015, 13:44:00 UTC

Today, two more:

http://atlasathome.cern.ch/result.php?resultid=2548460
http://atlasathome.cern.ch/result.php?resultid=2551831

Michael.
____________
President of Rechenkraft.net

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3144 - Posted: 21 Oct 2015, 13:48:44 UTC - in response to Message 3140.

http://atlasathome.cern.ch/result.php?resultid=2541657

How can this task now even be validated with only one WU being returned and another one aborted?

Michael.
____________
President of Rechenkraft.net

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3146 - Posted: 22 Oct 2015, 9:49:34 UTC

Another one: http://atlasathome.cern.ch/result.php?resultid=2555783

Michael.
____________
President of Rechenkraft.net

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3147 - Posted: 22 Oct 2015, 9:54:54 UTC - in response to Message 3143.

http://atlasathome.cern.ch/result.php?resultid=2548460

The corresponding task has now been validated, but I do not understand why, because there are only three tasks, one is mine (which I had to abort), one was NOT validated and the third then was validated:

http://atlasathome.cern.ch/workunit.php?wuid=1909106

But on what basis? Where is the second valid task?

Michael.
____________
President of Rechenkraft.net

Profile PDW
Send message
Joined: 7 Feb 15
Posts: 78
Credit: 2,842,304
RAC: 0
    
Message 3148 - Posted: 22 Oct 2015, 10:48:39 UTC - in response to Message 3147.

http://atlasathome.cern.ch/result.php?resultid=2548460

The corresponding task has now been validated, but I do not understand why, because there are only three tasks, one is mine (which I had to abort), one was NOT validated and the third then was validated:

http://atlasathome.cern.ch/workunit.php?wuid=1909106

But on what basis? Where is the second valid task?

Michael.

These tasks have a minimum quorum of 1.
They are not being validated by other users, the first one to get 'it right' according to whatever the project thinks 'is right' is all they do.

A job will fail if no-one gets 'it right' after the specified number of task attempts/errors.

Rasputin42 asked in the other thread, but got no answer, what is the point of reporting these various errors ? Who is doing anything with them ?

I would hope that the admins are looking at errors and their causes and possible fixes on an ongoing basis. I see no evidence that they are doing anything with these reported tasks, they might be, but they are very quiet on the matter. Only when it was catastrophic and all tasks failed did anything happen.

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3151 - Posted: 23 Oct 2015, 7:19:04 UTC - in response to Message 3148.

These tasks have a minimum quorum of 1.
They are not being validated by other users, the first one to get 'it right' according to whatever the project thinks 'is right' is all they do.

A job will fail if no-one gets 'it right' after the specified number of task attempts/errors.

Thank you - I did not know that. But one more question: What are the criteria to decide whether a job 'is right' or not?

Rasputin42 asked in the other thread, but got no answer, what is the point of reporting these various errors ? Who is doing anything with them ?

I would hope that the admins are looking at errors and their causes and possible fixes on an ongoing basis. I see no evidence that they are doing anything with these reported tasks, they might be, but they are very quiet on the matter. Only when it was catastrophic and all tasks failed did anything happen.

Well, the only thing I can contribute is to sort these errors into categories to make it more accessible to analysis for the project team.
If they do not do anything with it, then - sooner or later - participant numbers might decline. That in turn might ultimately become a 'boomerang' in case that the funding for the CERN compute cluster is reduced/cancelled.

But, anyway, here is a new set of 'never ending WUs':

http://atlasathome.cern.ch/result.php?resultid=2563712
http://atlasathome.cern.ch/result.php?resultid=2566037

Michael.
____________
President of Rechenkraft.net

lancone
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 May 14
Posts: 219
Credit: 53,963
RAC: 0
    
Message 3152 - Posted: 23 Oct 2015, 8:23:33 UTC - in response to Message 3151.

Hello,

Thank you - I did not know that. But one more question: What are the criteria to decide whether a job 'is right' or not?


A job is validated if all events assigned to that job are processed and the result shipped back to the BOINC server.

There are large variations in job duration for various reasons : power/configuration of the host computers, fluctuations in the processing of events, etc.

Regards

Daykay
Send message
Joined: 14 Oct 15
Posts: 4
Credit: 33,392
RAC: 0
    
Message 3156 - Posted: 25 Oct 2015, 13:46:23 UTC

I was so looking forward to seeing the results of my first completed tasks for this project but alas I've discovered two more never ending tasks:

http://atlasathome.cern.ch/result.php?resultid=2493839
http://atlasathome.cern.ch/result.php?resultid=2494096

Profile PDW
Send message
Joined: 7 Feb 15
Posts: 78
Credit: 2,842,304
RAC: 0
    
Message 3157 - Posted: 25 Oct 2015, 15:01:52 UTC - in response to Message 3156.

I was so looking forward to seeing the results of my first completed tasks for this project but alas I've discovered two more never ending tasks:

http://atlasathome.cern.ch/result.php?resultid=2493839
http://atlasathome.cern.ch/result.php?resultid=2494096

Have you enabled VT-x in your BIOS ?
Your CPU does have it, but it needs to be turned on in the BIOS.

Also 4Gb of memory is probably not enough to run one Atlas task by itself let alone two at the same time.

Tom*
Send message
Joined: 28 Jun 14
Posts: 118
Credit: 8,761,428
RAC: 0
    
Message 3158 - Posted: 25 Oct 2015, 17:33:36 UTC

Also Oracle Support says that

VERR_LDR_MISMATCH_NATIVE
is usually

a permission problem and or an anti-virus interfering with Virtualbox.

The other constraints PDW mentioned also apply

Profile Michael H.W. Weber
Send message
Joined: 10 Jan 15
Posts: 108
Credit: 1,552,848
RAC: 0
    
Message 3159 - Posted: 26 Oct 2015, 9:28:31 UTC

Another one:

http://atlasathome.cern.ch/result.php?resultid=2597142

Almost 10 hrs of processing where 3-5 hrs is normal.

Michael.
____________
President of Rechenkraft.net

Daykay
Send message
Joined: 14 Oct 15
Posts: 4
Credit: 33,392
RAC: 0
    
Message 3161 - Posted: 26 Oct 2015, 13:00:54 UTC - in response to Message 3157.

Thanks PDW, virtualization has now been activated in BIOS.

I have a new unit started now so results will hopefully be forthcoming, potential memory shortcomings notwithstanding. One issue at a time ;)

Hand
Send message
Joined: 27 Oct 15
Posts: 1
Credit: 32,850
RAC: 0
    
Message 3194 - Posted: 30 Oct 2015, 0:09:58 UTC

im having the same issue
BOINC says 100% since yesterday
http://atlasathome.cern.ch/result.php?resultid=2604253

1 · 2 · 3 · 4 . . . 14 · Next

Message boards : Number crunching : Report never ending tasks here