Long running application failed to init containers due to anthentication errors

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Long running application failed to init containers due to anthentication errors

Paul Lam
Hi,

I’m running Flink applications on YARN 2.6.0-cdh5.6.0 and get a situation. After running for a while (could be longer than 7 days) the application might
need to rescale up or recover from a node failure but it is not able to allocate new containers. All the incoming containers would fail to localize resources
and create log aggregation dirs for lack of credentials, so the Flink application never gets the requested containers. It seems that the credentials in the
container launch context somehow disappears.

I find this looks very similar to FLINK-6376[1] and YARN-2704[2], but both of them should have been fixed. The Flink AM gets the hdfs delegation token from
 the client, put it into the container launch context and will not refresh it afterwards. But IMHO, if the token is expired, the exception should be “token expired”
or “token not found in cache”, but now what I get is “client cannot authenticate via [token, kerberos]”. 

This happens very randomly, and I have been struggling with it for couples of days. Any help would be greatly appreciated. Thanks a lot!

[2] https://issues.apache.org/jira/browse/YARN-2704

Best,
Paul Lam




failed_to_init_containers.log.md (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Long running application failed to init containers due to anthentication errors

Billy Watson
So just a hunch because we’ve been dealing with something similar. When the failure occurs, has the resource manager also failed over just recently or in the previous 24 hours?

One thing to try: catch this exception and manually fail to the new master/resource manager. 

- Billy Watson

On Thu, Nov 29, 2018 at 21:16 Paul Lam <[hidden email]> wrote:
Hi,

I’m running Flink applications on YARN 2.6.0-cdh5.6.0 and get a situation. After running for a while (could be longer than 7 days) the application might
need to rescale up or recover from a node failure but it is not able to allocate new containers. All the incoming containers would fail to localize resources
and create log aggregation dirs for lack of credentials, so the Flink application never gets the requested containers. It seems that the credentials in the
container launch context somehow disappears.

I find this looks very similar to FLINK-6376[1] and YARN-2704[2], but both of them should have been fixed. The Flink AM gets the hdfs delegation token from
 the client, put it into the container launch context and will not refresh it afterwards. But IMHO, if the token is expired, the exception should be “token expired”
or “token not found in cache”, but now what I get is “client cannot authenticate via [token, kerberos]”. 

This happens very randomly, and I have been struggling with it for couples of days. Any help would be greatly appreciated. Thanks a lot!

[2] https://issues.apache.org/jira/browse/YARN-2704

Best,
Paul Lam


--
William Watson