What do you think about HDFS using GFS2 (shared disk file system) or GPFS (parallel filesystem) rather than local file system?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

What do you think about HDFS using GFS2 (shared disk file system) or GPFS (parallel filesystem) rather than local file system?

Daegyu Han
Hi all,

As far as I know, HDFS is designed to target local file systems like ext4 or xfs.

Is it a bad approach to use SAN technology as storage for HDFS?

Thank you,
Daegyu
Reply | Threaded
Open this post in threaded view
|

Re: What do you think about HDFS using GFS2 (shared disk file system) or GPFS (parallel filesystem) rather than local file system?

Wei-Chiu Chuang-2
Not familiar with GPFS, but looking at IBM's website, GPFS has a client that emulates Hadoop RPC

So you can just use GPFS like HDFS. It may be the quickest way to approach this use case and is supported.
Not sure about the performance though.


High-throughput Storage Area Network (SAN) and other shared storage solutions can present remote block devices to virtual machines in a flexible and performant manner that is often indistinguishable from a local disk. An Apache Hadoop workload provides a uniquely challenging IO profile to these storage solutions, and this can have a negative impact on the utility and stability of the Cloudera Enterprise cluster, and to other work that is utilizing the same storage backend.

Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O.

On Sat, Aug 17, 2019 at 2:12 AM Daegyu Han <[hidden email]> wrote:
Hi all,

As far as I know, HDFS is designed to target local file systems like ext4 or xfs.

Is it a bad approach to use SAN technology as storage for HDFS?

Thank you,
Daegyu
Reply | Threaded
Open this post in threaded view
|

Re: What do you think about HDFS using GFS2 (shared disk file system) or GPFS (parallel filesystem) rather than local file system?

Daegyu Han
Thank you for your response,

My question was intended to be a kernel-level file system for HDFS that only local file systems (ext4, xfs) can be used.

Thank you,
Daegyu

2019년 8월 17일 (토) 오후 7:28, Wei-Chiu Chuang <[hidden email]>님이 작성:
Not familiar with GPFS, but looking at IBM's website, GPFS has a client that emulates Hadoop RPC

So you can just use GPFS like HDFS. It may be the quickest way to approach this use case and is supported.
Not sure about the performance though.


High-throughput Storage Area Network (SAN) and other shared storage solutions can present remote block devices to virtual machines in a flexible and performant manner that is often indistinguishable from a local disk. An Apache Hadoop workload provides a uniquely challenging IO profile to these storage solutions, and this can have a negative impact on the utility and stability of the Cloudera Enterprise cluster, and to other work that is utilizing the same storage backend.

Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O.

On Sat, Aug 17, 2019 at 2:12 AM Daegyu Han <[hidden email]> wrote:
Hi all,

As far as I know, HDFS is designed to target local file systems like ext4 or xfs.

Is it a bad approach to use SAN technology as storage for HDFS?

Thank you,
Daegyu