CRITICAL error TransformValidationException


Advanced search

Message boards : Number crunching : CRITICAL error TransformValidationException

1 · 2 · Next
Author Message
Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6216 - Posted: 12 Mar 2017, 3:43:31 UTC

I have started to have a TransformValidationException exception, similar to a problem being reported at LHC@Home with the new ATLAS Simulator 1.01:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4146&postid=29198#29198

On ATLAS@Home, the same problem occurred on my 2 machines:
http://atlasathome.cern.ch/result.php?resultid=8503011
http://atlasathome.cern.ch/result.php?resultid=8500584

Since the Exception has occurred on both sites, it may not be related to the new version 1.01 recently introduced in LHC@Home, but too some other issue. Several people have had that same error in LHC@Home.

Most users seem to have the same Exception:
MPI für Physik: http://atlasathome.cern.ch/result.php?resultid=8503656
WLCG Performance-Test Cluster: http://atlasathome.cern.ch/result.php?resultid=8503444
kane: http://atlasathome.cern.ch/result.php?resultid=8503597
Yeti: http://atlasathome.cern.ch/result.php?resultid=8503848
____________
We are the product of random evolution.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6217 - Posted: 12 Mar 2017, 5:24:40 UTC

when I woke up this morning, I also detected that for the past 8 hours or so, all ATLAS tasks were running only for about 10 minutes, then stopped and showed "validation error".

What's going on?

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6218 - Posted: 12 Mar 2017, 9:23:12 UTC - in response to Message 6216.

I have started to have a TransformValidationException exception

although, I now checked the stderr of my faulty tasks, but in none of them I found what you are having:

Guest Log: PyJobTransforms.transform.execute 2017-03-12 06:30:12,098 CRITICAL Transform executor raised TransformValidationException ...

So, my problem must be a different one ???

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 0
    
Message 6219 - Posted: 12 Mar 2017, 9:45:24 UTC

In correlation to this error the following pattern can be observed:

VM request:
"GET http://ccsqfatlasli01.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RUHJxincMCQnydAoNcY33c-R1VVJwDMYmrIMmGOboE4pNMVTcLcjfV0HJMcTHMTje2d-fx9-NJ97Tz8U1RElPCcgOMDIwiEfWF6ykgEMcAHp8LmA_ HTTP/1.0"

HTTP_Status: 408 (timeout)

"GET /frontierATLAS/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RUHJxincMCQnydAoNcY33c-R1VVJwDMYmrIMmGOboE4pNMVTcLcjfV0HJMcTHMTje2d-fx9-NJ97Tz8U1RElPCcgOMDIwiEfWF6ykgEMcAHp8LmA_ HTTP/1.1"

HTTP_Status: 400 (bad request; no wonder as protocol and host are missing); repeated 3x

Workunit stops and reports an error.

A request to ccsqfatlasli01.in2p3.fr:23128 is very unusual for ATLAS but it happens from time to time (always with this error).

Nevertheless:
nc -z -v -w 5 ccsqfatlasli01.in2p3.fr 23128
nc: connect to ccsqfatlasli01.in2p3.fr port 23128 (tcp) timed out: Operation now in progress

It seems that traffic is blocked somewhere at CERN as my firewall accepts packets to this system/port.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6220 - Posted: 12 Mar 2017, 19:33:03 UTC - in response to Message 6219.

So let's wait till Monday then, and hopefully some explanation will be shared and we can put our machines back to work :).

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6221 - Posted: 13 Mar 2017, 8:24:30 UTC - in response to Message 6220.

Hi,

We are suffering infrastructure problems due to tasks running on the ATLAS grid which are overloading database servers.

This host ccsqfatlasli01.in2p3.fr is one of the servers that WU contact while running to get conditions data (basically data describing the geometry and status of the ATLAS detector), and this service is not working at the moment. I'm checking if we should wait for this service to be recovered or if there is an alternative one we can use.

Fuzzy Duck
Send message
Joined: 3 Dec 15
Posts: 33
Credit: 5,074,231
RAC: 0
    
Message 6222 - Posted: 13 Mar 2017, 8:50:27 UTC - in response to Message 6221.

So should we set Atlas to no new tasks and crunch another project?

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 0
    
Message 6223 - Posted: 13 Mar 2017, 9:00:15 UTC - in response to Message 6221.

Hi,

We are suffering infrastructure problems due to tasks running on the ATLAS grid which are overloading database servers.

This host ccsqfatlasli01.in2p3.fr is one of the servers that WU contact while running to get conditions data (basically data describing the geometry and status of the ATLAS detector), and this service is not working at the moment. I'm checking if we should wait for this service to be recovered or if there is an alternative one we can use.
Requests to this server/port are very rare according to my logs.
If this is a regular server/port that WUs contact, why is it not mentioned in the FAQ portlist?

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6224 - Posted: 13 Mar 2017, 9:29:48 UTC - in response to Message 6223.

If this is a regular server/port that WUs contact, why is it not mentioned in the FAQ portlist?


Connections to this host should be rare and in fact should only happen when the primary squid server at db-atlas-squid.ndgf.org is down (which happened this weekend). Can you remind me where the FAQ portlist is and I will correct it?

gyllic
Send message
Joined: 9 Dec 14
Posts: 15
Credit: 272,319
RAC: 0
    
Message 6225 - Posted: 13 Mar 2017, 9:47:13 UTC - in response to Message 6224.

If this is a regular server/port that WUs contact, why is it not mentioned in the FAQ portlist?


Connections to this host should be rare and in fact should only happen when the primary squid server at db-atlas-squid.ndgf.org is down (which happened this weekend). Can you remind me where the FAQ portlist is and I will correct it?


http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use

computezrmle
Send message
Joined: 29 Oct 14
Posts: 54
Credit: 1,137,404
RAC: 0
    
Message 6226 - Posted: 13 Mar 2017, 10:02:06 UTC - in response to Message 6224.

I refer to this FAQ which is linked from the consolidated server:
http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use

Connections to this host should be rare and in fact should only happen when the primary squid server at db-atlas-squid.ndgf.org is down
This is more and more confusing as I can not find any request to db-atlas-squid.ndgf.org in my logs.
Shouldn´t this be handled by CVMFS (inside the VM)?
Perhaps through atlas-condb.cern.ch as it is described here:
https://cernvm.cern.ch/portal/cvmfs/examples

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6227 - Posted: 13 Mar 2017, 11:43:37 UTC

from what I could see from a random check of some member's PCs, many of them are still downloading new WUs and uploading faulty WUs ("validation error") in about 5-10 minutes intervals.
So there must have been thousands, if not several hundert thousands of faulty WUs returned to CERN since Saturday night (when the problem came up).

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6228 - Posted: 13 Mar 2017, 14:11:13 UTC - in response to Message 6226.

I refer to this FAQ which is linked from the consolidated server:
http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use


Thanks, I will try to get this page updated with ATLAS info.

Connections to this host should be rare and in fact should only happen when the primary squid server at db-atlas-squid.ndgf.org is down
This is more and more confusing as I can not find any request to db-atlas-squid.ndgf.org in my logs.
Shouldn´t this be handled by CVMFS (inside the VM)?
Perhaps through atlas-condb.cern.ch as it is described here:
https://cernvm.cern.ch/portal/cvmfs/examples


This is not the squid for CVMFS, it's a squid in front of ATLAS conditions databases. If the squid fails or doesn't have the data cached, the task tries to read from another service called Frontier which is a http frontend to the databases. Over the weekend both the squids and Frontier services were hit hard and were brought down.

I think the situation has improved now - I have one WU which has been running for almost 1 hour using full CPU.

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6229 - Posted: 13 Mar 2017, 16:04:41 UTC - in response to Message 6227.

from what I could see from a random check of some member's PCs, many of them are still downloading new WUs and uploading faulty WUs ("validation error") in about 5-10 minutes intervals.
So there must have been thousands, if not several hundert thousands of faulty WUs returned to CERN since Saturday night (when the problem came up).

this now obviously led to the situation that no new tasks are available. Neither for single-core nor for multi-core.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6231 - Posted: 14 Mar 2017, 3:19:04 UTC

The Server Status says 0 Unsent for both applications. Should we still wait before resuming crunching ATLAS tasks?

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6232 - Posted: 14 Mar 2017, 8:05:05 UTC - in response to Message 6231.

The Server Status says 0 Unsent for both applications. Should we still wait before resuming crunching ATLAS tasks?

Well, in fact you cannot crunch ATLAS tasks as long as there are no ones available for download, anyway.

David Cameron
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 13 May 14
Posts: 252
Credit: 2,028,556
RAC: 1
    
Message 6233 - Posted: 14 Mar 2017, 8:18:56 UTC - in response to Message 6232.

We have a small issue submitting WU to both LHC and ATLAS at the same time, so they are all going to LHC at the moment. I'll try to get some more WU here but in the meantime you can try ATLAS@LHC@Home.

The infrastructure problems affecting all the WU in the last 2 days have been fixed so WU should work now if you get any.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6234 - Posted: 14 Mar 2017, 13:35:06 UTC

I think there may still be some other issue on the ATLAS tasks over at LHC@Home.
The following 2 WUs that failed on my machine also failed on another machine:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60483240
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521

Erich
Send message
Joined: 18 Dec 15
Posts: 253
Credit: 1,942,248
RAC: 0
    
Message 6236 - Posted: 14 Mar 2017, 16:48:45 UTC - in response to Message 6233.

We have a small issue submitting WU to both LHC and ATLAS at the same time, so they are all going to LHC at the moment. I'll try to get some more WU here ...

David, any chance that new WUs will be available here still today?

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 44
Credit: 509,829
RAC: 0
    
Message 6237 - Posted: 14 Mar 2017, 18:04:59 UTC

David, any chance that new WUs will be available here still today?

I now getting and running multi-core WUs from ATLAS@Home.

1 · 2 · Next

Message boards : Number crunching : CRITICAL error TransformValidationException