Question about Yarn rolling upgrade

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about Yarn rolling upgrade

Aihua Xu
Hi all,

I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade NodeManager. When I submit a yarn job, RM fails with the following exception:

 Application application_1549408943468_0001 failed 2 times due to Error launching appattempt_1549408943468_0001_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.io.EOFException; Host Details : local host is: "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: "hadoopbencha22-sjc1.prod.uber.internal":8041;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.Client.call(Client.java:1349)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy87.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy88.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.EOFException
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
... 20 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
... 23 more

and NM with
2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring password) with true cause: (null)

I'm wondering if it's a known issue and anybody has an insight for it.

Thanks,
Aihua


Reply | Threaded
Open this post in threaded view
|

Re: Question about Yarn rolling upgrade

Rohith Sharma K S-2
Hi Aihua,

Could you give more clarity on when job is submitted like a) before starting upgrade b) after RM upgrade and before NM upgrade c) after YARN upgrade fully?
Typically, order of upgrade suggested is NM's first and RM second.

Reg the NM warn messages you might be hitting https://issues.apache.org/jira/browse/HADOOP-11692.

Doesn't any subsequent jobs succeeded post upgrade?
-Rohith Sharma K S

On Thu, 7 Feb 2019 at 03:20, Aihua Xu <[hidden email]> wrote:
Hi all,

I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade NodeManager. When I submit a yarn job, RM fails with the following exception:

 Application application_1549408943468_0001 failed 2 times due to Error launching appattempt_1549408943468_0001_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.io.EOFException; Host Details : local host is: "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: "hadoopbencha22-sjc1.prod.uber.internal":8041;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.Client.call(Client.java:1349)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy87.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy88.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.EOFException
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
... 20 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
... 23 more

and NM with
2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring password) with true cause: (null)

I'm wondering if it's a known issue and anybody has an insight for it.

Thanks,
Aihua


Reply | Threaded
Open this post in threaded view
|

Re: Question about Yarn rolling upgrade

Aihua Xu
Hi Rohith,

Thanks for your suggestion. I was tracing the issue and found out it's caused by the incompatibility from these two changes. The tokens have been changed.

YARN-668. Changed NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use protobuf object as the payload. Contributed by Junping Du.
YARN-2615. Changed ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use protobuf as payload. Contributed by Junping Du

I was testing new RM with old NM.

Followup on the the order of Yarn upgrade. I checked the HWX blog about rolling upgrade and it's suggesting to upgrade RM first.  But you are saying we should NM first and RM second? Can you confirm?

Thanks,
Aihua
 


On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S <[hidden email]> wrote:
Hi Aihua,

Could you give more clarity on when job is submitted like a) before starting upgrade b) after RM upgrade and before NM upgrade c) after YARN upgrade fully?
Typically, order of upgrade suggested is NM's first and RM second.

Reg the NM warn messages you might be hitting https://issues.apache.org/jira/browse/HADOOP-11692.

Doesn't any subsequent jobs succeeded post upgrade?
-Rohith Sharma K S

On Thu, 7 Feb 2019 at 03:20, Aihua Xu <[hidden email]> wrote:
Hi all,

I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade NodeManager. When I submit a yarn job, RM fails with the following exception:

 Application application_1549408943468_0001 failed 2 times due to Error launching appattempt_1549408943468_0001_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.io.EOFException; Host Details : local host is: "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: "hadoopbencha22-sjc1.prod.uber.internal":8041;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.Client.call(Client.java:1349)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy87.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy88.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.EOFException
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
... 20 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
... 23 more

and NM with
2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring password) with true cause: (null)

I'm wondering if it's a known issue and anybody has an insight for it.

Thanks,
Aihua


Reply | Threaded
Open this post in threaded view
|

Re: Question about Yarn rolling upgrade

Rohith Sharma K S-2
The above JIRA mentioned breaks but those are fixed in 2.6 itself. The only one JIRA I see is YARN-8310 which is fixed in 2.10. Looking from stack trace which you have mentioned, it doesn't seems related to your issue. May be try applying a patch and run a job.
Otherwise, lets create a JIRA and discuss there in detail.

-Rohith Sharma K S

On Thu, 7 Feb 2019 at 22:52, Aihua Xu <[hidden email]> wrote:
Hi Rohith,

Thanks for your suggestion. I was tracing the issue and found out it's caused by the incompatibility from these two changes. The tokens have been changed.

YARN-668. Changed NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use protobuf object as the payload. Contributed by Junping Du.
YARN-2615. Changed ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use protobuf as payload. Contributed by Junping Du

I was testing new RM with old NM.

Followup on the the order of Yarn upgrade. I checked the HWX blog about rolling upgrade and it's suggesting to upgrade RM first.  But you are saying we should NM first and RM second? Can you confirm?

Thanks,
Aihua
 


On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S <[hidden email]> wrote:
Hi Aihua,

Could you give more clarity on when job is submitted like a) before starting upgrade b) after RM upgrade and before NM upgrade c) after YARN upgrade fully?
Typically, order of upgrade suggested is NM's first and RM second.

Reg the NM warn messages you might be hitting https://issues.apache.org/jira/browse/HADOOP-11692.

Doesn't any subsequent jobs succeeded post upgrade?
-Rohith Sharma K S

On Thu, 7 Feb 2019 at 03:20, Aihua Xu <[hidden email]> wrote:
Hi all,

I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade NodeManager. When I submit a yarn job, RM fails with the following exception:

 Application application_1549408943468_0001 failed 2 times due to Error launching appattempt_1549408943468_0001_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.io.EOFException; Host Details : local host is: "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: "hadoopbencha22-sjc1.prod.uber.internal":8041;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.Client.call(Client.java:1349)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy87.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy88.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.EOFException
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
... 20 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
... 23 more

and NM with
2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring password) with true cause: (null)

I'm wondering if it's a known issue and anybody has an insight for it.

Thanks,
Aihua


Reply | Threaded
Open this post in threaded view
|

Re: Question about Yarn rolling upgrade

Aihua Xu
Hi Rohith,

I should have mentioned that we were using CDH5.7.2-2.6 which have those two patches reverted and that causes the incompatibility. Yes. I have to backport YARN-8310 to fix another issue. 

BTW: should we upgrade NM first as you mentioned before?

Thanks,
Aihua



On Thu, Feb 7, 2019 at 9:41 PM Rohith Sharma K S <[hidden email]> wrote:
The above JIRA mentioned breaks but those are fixed in 2.6 itself. The only one JIRA I see is YARN-8310 which is fixed in 2.10. Looking from stack trace which you have mentioned, it doesn't seems related to your issue. May be try applying a patch and run a job.
Otherwise, lets create a JIRA and discuss there in detail.

-Rohith Sharma K S

On Thu, 7 Feb 2019 at 22:52, Aihua Xu <[hidden email]> wrote:
Hi Rohith,

Thanks for your suggestion. I was tracing the issue and found out it's caused by the incompatibility from these two changes. The tokens have been changed.

YARN-668. Changed NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use protobuf object as the payload. Contributed by Junping Du.
YARN-2615. Changed ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use protobuf as payload. Contributed by Junping Du

I was testing new RM with old NM.

Followup on the the order of Yarn upgrade. I checked the HWX blog about rolling upgrade and it's suggesting to upgrade RM first.  But you are saying we should NM first and RM second? Can you confirm?

Thanks,
Aihua
 


On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S <[hidden email]> wrote:
Hi Aihua,

Could you give more clarity on when job is submitted like a) before starting upgrade b) after RM upgrade and before NM upgrade c) after YARN upgrade fully?
Typically, order of upgrade suggested is NM's first and RM second.

Reg the NM warn messages you might be hitting https://issues.apache.org/jira/browse/HADOOP-11692.

Doesn't any subsequent jobs succeeded post upgrade?
-Rohith Sharma K S

On Thu, 7 Feb 2019 at 03:20, Aihua Xu <[hidden email]> wrote:
Hi all,

I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade NodeManager. When I submit a yarn job, RM fails with the following exception:

 Application application_1549408943468_0001 failed 2 times due to Error launching appattempt_1549408943468_0001_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.io.EOFException; Host Details : local host is: "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: "hadoopbencha22-sjc1.prod.uber.internal":8041;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.Client.call(Client.java:1349)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy87.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy88.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.EOFException
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
... 20 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
... 23 more

and NM with
2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring password) with true cause: (null)

I'm wondering if it's a known issue and anybody has an insight for it.

Thanks,
Aihua