HDFS DirectoryScanner is bothering me

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

HDFS DirectoryScanner is bothering me

Yen-Onn Hiu
hi all,

I am on hadoop hdfs version of Hadoop 2.6.0-cdh5.8.0. I discovered that the DirectoryScanner is keep causing java GC and slow down the hadoop nodes. Digging into the log file I discovered this:

2018-11-29 13:34:37,995 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3896197, missing metadata files:214, missing block files:214, missing blocks in memory:103, mismatched blocks:0

Reading from internet posting, there are postings saying this is from DirectoryScanner which will be executed in every 6 hours. This directory scanning caused GC hiccup in all nodes and caused performance issues on the cluster. 

Question: when I am doing the hdfs dfsadmin -report. It does not say that I have any corrupted files. Also, I did the hdfs fsck / onto the directory and it does not yield any problems. How can I know what is the missing block files, missing blocks in memory and missing metadata files?


Thanks!

--
Hiu Yen Onn


Reply | Threaded
Open this post in threaded view
|

Re: HDFS DirectoryScanner is bothering me

Wei-Chiu Chuang-3
Do you have a heapdump? Without a heapdump it's not easy to definitely point to DirectoryScanner for GC issues.

That said, I did notice DirectoryScanner holding global lock for quite a few seconds periodically, but that's unrelated. to GC.

On Thu, Nov 29, 2018 at 12:56 AM Yen-Onn Hiu <[hidden email]> wrote:
hi all,

I am on hadoop hdfs version of Hadoop 2.6.0-cdh5.8.0. I discovered that the DirectoryScanner is keep causing java GC and slow down the hadoop nodes. Digging into the log file I discovered this:

2018-11-29 13:34:37,995 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3896197, missing metadata files:214, missing block files:214, missing blocks in memory:103, mismatched blocks:0

Reading from internet posting, there are postings saying this is from DirectoryScanner which will be executed in every 6 hours. This directory scanning caused GC hiccup in all nodes and caused performance issues on the cluster. 

Question: when I am doing the hdfs dfsadmin -report. It does not say that I have any corrupted files. Also, I did the hdfs fsck / onto the directory and it does not yield any problems. How can I know what is the missing block files, missing blocks in memory and missing metadata files?


Thanks!

--
Hiu Yen Onn


Reply | Threaded
Open this post in threaded view
|

Re: HDFS DirectoryScanner is bothering me

Yen-Onn Hiu
hi Wei-Chiu,

Good to hear from you!

I have some facts that I have gathered in hoping that I can show you something solid. 

Screen Shot 2561-12-05 at 08.41.42.png

Screen Shot 2561-12-05 at 09.21.17.png

This is the graphite metrics that I have gathered from Java Jvmmetrics/GcTimeMillis. From time to time, I could see GC hiccup of 3 to 4 seconds from every datanodes, every 6 hours (which is matching the interval of dfs.datanode.directoryscan.interval, defaulted to 21600 sec). Following on this hiccup, when I traced the log file I could clearly see this line

 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3894663, missing metadata files:122, missing block files:122, missing blocks in memory:57, mismatched blocks:0 

By looking at the GC log which I have enabled, I can clearly seeing there was a great time spent to do the object copy. 

[GC cleanup 4452M->2474M(8192M), 0.0124395 secs]

 [Times: user=0.05 sys=0.00, real=0.01 secs]

2018-11-29T01:53:23.204+0700: 40161.800: [GC concurrent-cleanup-start]

2018-11-29T01:53:23.214+0700: 40161.810: [GC concurrent-cleanup-end, 0.0104121 secs]

2018-11-29T01:53:29.107+0700: 40167.702: [GC pause (G1 Evacuation Pause) (young) 40167.702: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 65846, predicted base time: 30.37 ms, remaining time: 169.63 ms, target pause time: 200.00 ms]

 40167.702: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 131 regions, survivors: 31 regions, predicted young region time: 296.21 ms]

 40167.702: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 131 regions, survivors: 31 regions, old: 0 regions, predicted pause time: 326.58 ms, target pause time: 200.00 ms]

 40168.874: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 138 regions, reclaimable: 496648080 bytes (5.78 %), threshold: 10.00 %]

, 1.1718634 secs]

   [Parallel Time: 1166.4 ms, GC Workers: 8]

      [GC Worker Start (ms): Min: 40167703.2, Avg: 40167703.3, Max: 40167703.3, Diff: 0.1]

      [Ext Root Scanning (ms): Min: 0.6, Avg: 0.8, Max: 1.0, Diff: 0.4, Sum: 6.2]

      [Update RS (ms): Min: 68.1, Avg: 68.2, Max: 68.3, Diff: 0.2, Sum: 545.9]

         [Processed Buffers: Min: 24, Avg: 52.6, Max: 74, Diff: 50, Sum: 421]

      [Scan RS (ms): Min: 23.9, Avg: 24.2, Max: 24.7, Diff: 0.8, Sum: 193.6]

      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

      [Object Copy (ms): Min: 1013.7, Avg: 1015.8, Max: 1020.4, Diff: 6.7, Sum: 8126.1]

      [Termination (ms): Min: 51.8, Avg: 56.8, Max: 58.8, Diff: 7.0, Sum: 454.6]

         [Termination Attempts: Min: 53698, Avg: 56791.9, Max: 58670, Diff: 4972, Sum: 454335]

      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.5]

      [GC Worker Total (ms): Min: 1165.8, Avg: 1165.9, Max: 1166.0, Diff: 0.2, Sum: 9327.0]

      [GC Worker End (ms): Min: 40168869.1, Avg: 40168869.2, Max: 40168869.2, Diff: 0.1]

   [Code Root Fixup: 0.0 ms]

   [Code Root Purge: 0.0 ms]

   [Clear CT: 0.2 ms]

   [Other: 5.2 ms]

      [Choose CSet: 0.0 ms]

      [Ref Proc: 2.4 ms]

      [Ref Enq: 0.1 ms]

      [Redirty Cards: 0.8 ms]

      [Humongous Register: 0.2 ms]

      [Humongous Reclaim: 0.0 ms]

      [Free CSet: 0.8 ms]

   [Eden: 524.0M(284.0M)->0.0B(356.0M) Survivors: 124.0M->52.0M Heap: 2809.1M(8192.0M)->2387.1M(8192.0M)]

 [Times: user=8.90 sys=0.19, real=1.17 secs]


This is my java opts


-XX:+UseG1GC -Xmx8g -Xms8g -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=8 -XX:+UnlockExperimentalVMOptions -XX:+ParallelRefProcEnabled -XX:InitiatingHeapOccupancyPercent=50 -XX:G1MixedGCLiveThresholdPercent=70 -XX:G1HeapWastePercent=10 -Xloggc:/var/log/hadoop-hdfs/datanode-gc.log -XX:G1LogLevel=finest -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintAdaptiveSizePolicy -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8004


Together with this, I have generated a heapdump, since the dump file is too big (357MB), I think I could not do a attach to this email. However, I have attached some print screens. Not sure if what I have guessed is correct? The big objects that referring from the GC log might could have been FsDatasetImpl, and ReplicaMap. I am just speculating those java objects. 


Screen Shot 2561-12-05 at 08.53.07.png

Screen Shot 2561-12-05 at 08.53.35.png
Screen Shot 2561-12-05 at 09.08.10.png

Screen Shot 2561-12-05 at 08.58.20.png


Please let me know if anything that I can do? I have gone through this link (https://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/) many times to understand the behaviour of DirectoryScanner, BlockScanner, VolumeScanner. Here are some of my questions that I am still want to know.


1. > INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3894663, missing metadata files:122, missing block files:122, missing blocks in memory:57, mismatched blocks:0 

Also, I would like to know what is the cause of having missing metadata files, missing block files, missing blocks in memory. 


2. What is the result of prolonging the dfs.datanode.directoryscan.interval from 6 hours to 12 hours. Will it give any harms to the system?


3. Will Block scanning fixing this? But it has been defaulted to 3 weeks to run it once. Can I trigger the scan explicitly from command line? 


Thank you so much for your attention. 



 



On Tue, Dec 4, 2018 at 8:35 PM Wei-Chiu Chuang <[hidden email]> wrote:
Do you have a heapdump? Without a heapdump it's not easy to definitely point to DirectoryScanner for GC issues.

That said, I did notice DirectoryScanner holding global lock for quite a few seconds periodically, but that's unrelated. to GC.

On Thu, Nov 29, 2018 at 12:56 AM Yen-Onn Hiu <[hidden email]> wrote:
hi all,

I am on hadoop hdfs version of Hadoop 2.6.0-cdh5.8.0. I discovered that the DirectoryScanner is keep causing java GC and slow down the hadoop nodes. Digging into the log file I discovered this:

2018-11-29 13:34:37,995 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3896197, missing metadata files:214, missing block files:214, missing blocks in memory:103, mismatched blocks:0

Reading from internet posting, there are postings saying this is from DirectoryScanner which will be executed in every 6 hours. This directory scanning caused GC hiccup in all nodes and caused performance issues on the cluster. 

Question: when I am doing the hdfs dfsadmin -report. It does not say that I have any corrupted files. Also, I did the hdfs fsck / onto the directory and it does not yield any problems. How can I know what is the missing block files, missing blocks in memory and missing metadata files?


Thanks!

--
Hiu Yen Onn




--
Hiu Yen Onn




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Screen Shot 2561-12-05 at 08.41.42.png (46K) Download Attachment
GCeasy-report-datanode-gc.pdf (1M) Download Attachment