NameNode timeout to writing to QJM after using G1GC in a large heap

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

NameNode timeout to writing to QJM after using G1GC in a large heap

Quanlong Huang
Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:
  1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19. at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23. at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24. at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28. at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29. at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31. at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36. at java.security.AccessController.doPrivileged(Native Method)
  37. at javax.security.auth.Subject.doAs(Subject.java:422)
  38. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
Could anyone give some thoughts?

Thanks,
Quanlong Huang
Reply | Threaded
Open this post in threaded view
|

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Erik Krogen-2

Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is usually a better idea than relying on the pause monitor in my experience.


That being said, given that the NN continued to produce log messages during this time, it doesn't seem to be a GC pause issue.


From: Quanlong Huang <[hidden email]>
Sent: Sunday, January 13, 2019 7:28:46 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:
  1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19. at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23. at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24. at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28. at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29. at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31. at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36. at java.security.AccessController.doPrivileged(Native Method)
  37. at javax.security.auth.Subject.doAs(Subject.java:422)
  38. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
Could anyone give some thoughts?

Thanks,
Quanlong Huang
Reply | Threaded
Open this post in threaded view
|

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Quanlong Huang
Hi Erik,

Thanks for your response! However, it's quite a pity that the GC log was overwritten after NameNode restart... We should have set the GC log file name to "gc-%t.log" instead of "gc.log".

One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M files): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html

Our cluster is much larger than the threshold (50M files) but we set MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to write to JournalNodes. Do you have any suggestions for this?

Thanks,
Quanlong

On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <[hidden email]> wrote:

Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is usually a better idea than relying on the pause monitor in my experience.


That being said, given that the NN continued to produce log messages during this time, it doesn't seem to be a GC pause issue.


From: Quanlong Huang <[hidden email]>
Sent: Sunday, January 13, 2019 7:28:46 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:
  1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19. at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23. at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24. at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28. at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29. at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31. at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36. at java.security.AccessController.doPrivileged(Native Method)
  37. at javax.security.auth.Subject.doAs(Subject.java:422)
  38. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
Could anyone give some thoughts?

Thanks,
Quanlong Huang
Reply | Threaded
Open this post in threaded view
|

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Wei-Chiu Chuang-3
In reply to this post by Quanlong Huang
The JvmPauseMonitor warning "No GCs detected" usually suggests the host is running out of resources. I've seen it happening on hosts that run other master servies in addition to NameNodes.

I've also seen network disconnection (somebody restarted network switch) that results in the QJM timeout messages like that. I've also seen cases where everything works perfectly fine, but the Linux kernel decides not to send the response from JN back to the NN process.

If this happens again, try bump the QJM timeouts from 20 seconds to 40 seconds. 

On Sun, Jan 13, 2019 at 8:29 PM Quanlong Huang <[hidden email]> wrote:
Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:
  1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19. at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23. at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24. at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28. at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29. at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31. at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36. at java.security.AccessController.doPrivileged(Native Method)
  37. at javax.security.auth.Subject.doAs(Subject.java:422)
  38. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
Could anyone give some thoughts?

Thanks,
Quanlong Huang
Reply | Threaded
Open this post in threaded view
|

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Erik Krogen-2
In reply to this post by Quanlong Huang

In my personal experience with MaxGCPauseMillis, it did not have a strong effect on the GC pause behavior, but this was on a slightly old JVM (I believe 8u31). I found that the most important setting for large clusters (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries, which was necessary to bump to about 2-3x the default to avoid lots of huge (>20 sec) pauses during periods of high churn like startup and failover. I still don't understand how the NameNode could have continued to produce logs if a GC pause really was the cause of your issue, but maybe you can try adjusting this.


From: Quanlong Huang <[hidden email]>
Sent: Monday, January 14, 2019 6:02:12 PM
To: Erik Krogen
Cc: [hidden email]; [hidden email]; [hidden email]; [hidden email]
Subject: Re: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi Erik,

Thanks for your response! However, it's quite a pity that the GC log was overwritten after NameNode restart... We should have set the GC log file name to "gc-%t.log" instead of "gc.log".

One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M files): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html

Our cluster is much larger than the threshold (50M files) but we set MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to write to JournalNodes. Do you have any suggestions for this?

Thanks,
Quanlong

On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <[hidden email]> wrote:

Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is usually a better idea than relying on the pause monitor in my experience.


That being said, given that the NN continued to produce log messages during this time, it doesn't seem to be a GC pause issue.


From: Quanlong Huang <[hidden email]>
Sent: Sunday, January 13, 2019 7:28:46 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:
  1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19. at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23. at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24. at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28. at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29. at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31. at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36. at java.security.AccessController.doPrivileged(Native Method)
  37. at javax.security.auth.Subject.doAs(Subject.java:422)
  38. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
Could anyone give some thoughts?

Thanks,
Quanlong Huang
Reply | Threaded
Open this post in threaded view
|

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Quanlong Huang
Hi Wei-Chiu, Sajid and Erik,

Great thanks for all your responses! Please forgive me to reply in one email in order to summarize our environment.

We're using hadoop 2.6-cdh5.7.3 and oracle jdk 1.8.0_121 in a cluster with >300M blocks. NameNodes work normally using CMS in a 120GB heap and never timeout writing to JournalNodes. So I think current timeout setting (20s) for QJM should be ok for using G1GC in a 180GB heap. (@Wei-Chiu, @Sajid) Please correct me if it's possible to have poorer performance using G1GC in a 180GB heap than using CMS in a 120GB heap.

Thanks for Erik's GC configs and we'll investigate it. However, It's a production cluster with heavy workloads. We want to be clear of the crash before trying again G1GC or any new GC configs.

I have some follow-up questions:

a) Are there any tools in the community used to benchmark the NameNode? So we can use them to simulate the heavy workloads and reproduce the crash.

b) Is it possible that the timeout is due to RPC resource starvation? Does code paths writing to JournalNodes share anything (e.g. thread pool) with code paths processing clients/Datanodes RPC requests?
The NameNode is deployed in a dedicated machine. Hardware resource usages (e.g. CPU, network/disk IO) are low in the timeout period. Only the RPC metrics looks abnormal. However, it may be the cause or the result. Here's a snapshot of the RPC metrics: https://drive.google.com/file/d/1_ZuJnURIVv-nb5B9_Lj7bcxt8oPabRn6/view?usp=sharing

c) Are there any retry policies for the NameNode to write to JournalNodes?
In the logs, it looks like NameNode does not retry but waiting for 20 seconds.

We'd be appreciated if any of these questions could have your answer!

Thanks,
Quanlong

On Wed, Jan 16, 2019 at 12:17 AM Erik Krogen <[hidden email]> wrote:

In my personal experience with MaxGCPauseMillis, it did not have a strong effect on the GC pause behavior, but this was on a slightly old JVM (I believe 8u31). I found that the most important setting for large clusters (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries, which was necessary to bump to about 2-3x the default to avoid lots of huge (>20 sec) pauses during periods of high churn like startup and failover. I still don't understand how the NameNode could have continued to produce logs if a GC pause really was the cause of your issue, but maybe you can try adjusting this.


From: Quanlong Huang <[hidden email]>
Sent: Monday, January 14, 2019 6:02:12 PM
To: Erik Krogen
Cc: [hidden email]; [hidden email]; [hidden email]; [hidden email]
Subject: Re: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi Erik,

Thanks for your response! However, it's quite a pity that the GC log was overwritten after NameNode restart... We should have set the GC log file name to "gc-%t.log" instead of "gc.log".

One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M files): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html

Our cluster is much larger than the threshold (50M files) but we set MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to write to JournalNodes. Do you have any suggestions for this?

Thanks,
Quanlong

On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <[hidden email]> wrote:

Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is usually a better idea than relying on the pause monitor in my experience.


That being said, given that the NN continued to produce log messages during this time, it doesn't seem to be a GC pause issue.


From: Quanlong Huang <[hidden email]>
Sent: Sunday, January 13, 2019 7:28:46 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:
  1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19. at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21. at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23. at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24. at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25. at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28. at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29. at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31. at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33. at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36. at java.security.AccessController.doPrivileged(Native Method)
  37. at javax.security.auth.Subject.doAs(Subject.java:422)
  38. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
Could anyone give some thoughts?

Thanks,
Quanlong Huang
Reply | Threaded
Open this post in threaded view
|

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Clay Baenziger (BLOOMBERG/ 731 LEX)
In reply to this post by Quanlong Huang
Hi Quanlong,

I have appreciated the HDFS Namenode testing presented by LinkedIn at https://github.com/linkedin/dynamometer. I have not tried it yet, so look forward to hear if you are able to use it for your GC tuning. It is designed to replay queries opposed to NN start-up or fail-over.

-Clay

From: [hidden email] At: 01/16/19 07:12:40
To: [hidden email], [hidden email], [hidden email]
Cc: [hidden email], [hidden email], [hidden email], [hidden email]
Subject: Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Hi Wei-Chiu, Sajid and Erik,

Great thanks for all your responses! Please forgive me to reply in one email in order to summarize our environment.

We're using hadoop 2.6-cdh5.7.3 and oracle jdk 1.8.0_121 in a cluster with >300M blocks. NameNodes work normally using CMS in a 120GB heap and never timeout writing to JournalNodes. So I think current timeout setting (20s) for QJM should be ok for using G1GC in a 180GB heap. (@Wei-Chiu, @Sajid) Please correct me if it's possible to have poorer performance using G1GC in a 180GB heap than using CMS in a 120GB heap.

Thanks for Erik's GC configs and we'll investigate it. However, It's a production cluster with heavy workloads. We want to be clear of the crash before trying again G1GC or any new GC configs.

I have some follow-up questions:

a) Are there any tools in the community used to benchmark the NameNode? So we can use them to simulate the heavy workloads and reproduce the crash.

b) Is it possible that the timeout is due to RPC resource starvation? Does code paths writing to JournalNodes share anything (e.g. thread pool) with code paths processing clients/Datanodes RPC requests?
The NameNode is deployed in a dedicated machine. Hardware resource usages (e.g. CPU, network/disk IO) are low in the timeout period. Only the RPC metrics looks abnormal. However, it may be the cause or the result. Here's a snapshot of the RPC metrics: https://drive.google.com/file/d/1_ZuJnURIVv-nb5B9_Lj7bcxt8oPabRn6/view?usp=sharing

c) Are there any retry policies for the NameNode to write to JournalNodes?
In the logs, it looks like NameNode does not retry but waiting for 20 seconds.

We'd be appreciated if any of these questions could have your answer!

Thanks,
Quanlong

On Wed, Jan 16, 2019 at 12:17 AM Erik Krogen <[hidden email]> wrote:

In my personal experience with MaxGCPauseMillis, it did not have a strong effect on the GC pause behavior, but this was on a slightly old JVM (I believe 8u31). I found that the most important setting for large clusters (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries, which was necessary to bump to about 2-3x the default to avoid lots of huge (>20 sec) pauses during periods of high churn like startup and failover. I still don't understand how the NameNode could have continued to produce logs if a GC pause really was the cause of your issue, but maybe you can try adjusting this.


From: Quanlong Huang <[hidden email]>
Sent: Monday, January 14, 2019 6:02:12 PM
To: Erik Krogen
Cc: [hidden email]; [hidden email]; [hidden email]; [hidden email]
Subject: Re: NameNode timeout to writing to QJM after using G1GC in a large heap
 
Hi Erik,

Thanks for your response! However, it's quite a pity that the GC log was overwritten after NameNode restart... We should have set the GC log file name to "gc-%t.log" instead of "gc.log".

One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M files): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html

Our cluster is much larger than the threshold (50M files) but we set MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to write to JournalNodes. Do you have any suggestions for this?

Thanks,
Quanlong

On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <[hidden email]> wrote: