转发: namdenode question consultation and advice

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

转发: namdenode question consultation and advice

白 瑶瑶


发件人: 白 瑶瑶 代表 白 瑶瑶 <[hidden email]>
发送时间: 2018年10月18日 10:33
主题: namdenode question consultation and advice
 

Hi :

   My production Hadoop cluster (HA) has recently had a problem with two namenode hanging up frequently, causing errors that I couldn't resolve,The same is true of the namenode in the active state when the following error occurs after the crash, and the namenode in the standby state cannot be switched. The error is as follows:


    

2018-10-18 15:51:36,311 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.117.29.24
2018-10-18 15:51:36,311 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2018-10-18 15:51:36,311 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 3420935
2018-10-18 15:51:38,738 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 19 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 10 SyncTimes(ms): 180 2525 
2018-10-18 15:51:38,765 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/hadoop/tmp/dfs/name/current/edits_inprogress_0000000000003420935 -> /data/hadoop/tmp/dfs/name/current/edits_0000000000003420935-0000000000003420953
2018-10-18 15:51:38,765 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 3420954
2018-10-18 15:51:44,767 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:45,768 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:46,769 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:47,770 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:48,771 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10005 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:49,771 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11006 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:50,773 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12007 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:51,774 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13008 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:52,774 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14009 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:53,776 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15010 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:54,777 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16011 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:55,778 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:56,780 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:57,781 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:58,767 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 3420954 failed for required journal (JournalAndStream(mgr=QJM to [10.117.29.25:8485, 10.117.29.24:8485, 10.117.29.23:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1237)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1206)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
2018-10-18 15:51:58,768 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2018-10-18 15:51:58,773 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at kvmserver25/10.117.29.25
************************************************************/
2018-10-18 16:04:13,143 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = kvmserver25/10.117.29.25
 
     
 I want to ask, under what circumstances will this mistake occur, or what good suggestions do you have?

   thank you.

  

 BAI

Reply | Threaded
Open this post in threaded view
|

Re: 转发: namdenode question consultation and advice

Gurmukh Singh

Your disk seems to be an issue, which is causing Journal node timeout.


Do, benchmarks on the disks for namenode, zk and JQM


On 3/12/18 2:08 pm, 白 瑶瑶 wrote:


发件人: 白 瑶瑶 代表 白 瑶瑶 [hidden email]
发送时间: 2018年10月18日 10:33
主题: namdenode question consultation and advice
 

Hi :

   My production Hadoop cluster (HA) has recently had a problem with two namenode hanging up frequently, causing errors that I couldn't resolve,The same is true of the namenode in the active state when the following error occurs after the crash, and the namenode in the standby state cannot be switched. The error is as follows:


    

2018-10-18 15:51:36,311 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.117.29.24
2018-10-18 15:51:36,311 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2018-10-18 15:51:36,311 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 3420935
2018-10-18 15:51:38,738 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 19 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 10 SyncTimes(ms): 180 2525 
2018-10-18 15:51:38,765 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/hadoop/tmp/dfs/name/current/edits_inprogress_0000000000003420935 -> /data/hadoop/tmp/dfs/name/current/edits_0000000000003420935-0000000000003420953
2018-10-18 15:51:38,765 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 3420954
2018-10-18 15:51:44,767 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:45,768 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:46,769 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:47,770 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:48,771 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10005 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:49,771 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11006 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:50,773 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12007 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:51,774 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13008 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:52,774 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14009 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:53,776 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15010 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:54,777 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16011 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:55,778 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:56,780 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:57,781 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:58,767 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 3420954 failed for required journal (JournalAndStream(mgr=QJM to [10.117.29.25:8485, 10.117.29.24:8485, 10.117.29.23:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1237)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1206)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
2018-10-18 15:51:58,768 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2018-10-18 15:51:58,773 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at kvmserver25/10.117.29.25
************************************************************/
2018-10-18 16:04:13,143 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = kvmserver25/10.117.29.25
 
     
 I want to ask, under what circumstances will this mistake occur, or what good suggestions do you have?

   thank you.

  

 BAI