EOFException when using S3AFileSystem with random input policy

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

EOFException when using S3AFileSystem with random input policy

Dave Christianson

I'm seeing a problem using the S3AFileSystem with the ParquetInputFormat that causes a non-transient EOF for certain files. I have traced what looks like the source of the problem to the use of the "random" input policy in order to support seek behavior required by Parquet. 

I've written a sample program that illustrates the problem given a path in S3 - not using Parquet, works on any file > 1024K:

final Configuration conf = new Configuration();
conf.set("fs.s3a.readahead.range", "1K");
conf.set("fs.s3a.experimental.input.fadvise", "random");

final FileSystem fs = FileSystem.get(path.toUri(), conf);
// forward seek reading across readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1023, temp); // <-- works
}
// forward seek reading from end of readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1024, temp); // <-- throws EOFException
}
I'm wondering two things:
- is this a known problem that I simply haven't found a ticket or question for - if not, what are the steps to discuss/contribute a fix (I have a potential solution in S3AInputStream.seekInStream) - is the random inputpolicy not expected to work fully - as it stands seek, especially backwards seek against s3 seems - different? - although for certain use cases it could prevent having to download the entire file to local storage
Regards, Dave