Finding the average over a set of values that are created and deleted

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding the average over a set of values that are created and deleted

Daniel Santos
Hello,

I have the following task :

An application that stores files, enables a user to add and delete files. When such an event occurs I append to a file in a hdfs the following record when there was a file added :

userid image-uuid size_in_bytes

and the following when a file was removed

-userid image-uuid size_in_bytes

When calculating the average in the reducer, I will have to subtract the size of the removed file and decrease the total to find the average without that file.

Deletions are infrequent events.

I thought of, in the reducer keeping a hash map in memory that tracks deletions while I am iterating the value list, so that I can correct the final total and count in the end of the iteration.

Oh, and this just reminds me that I will have only one reducer for the single ‘avg' key the mapper emits.

What do you think ?

Regards
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Finding the average over a set of values that are created and deleted

Vinod Kumar Vavilapalli-2
How big are your images? Depending on that, one of the following could be better solutions
 (1) Put both images and the image meta-data in HBase
 (2) Put the images on HDFS and track the image meta-data in HBase.

Thanks
+Vinod

> On Aug 9, 2019, at 7:33 AM, Daniel Santos <[hidden email]> wrote:
>
> Hello,
>
> I have the following task :
>
> An application that stores files, enables a user to add and delete files. When such an event occurs I append to a file in a hdfs the following record when there was a file added :
>
> userid image-uuid size_in_bytes
>
> and the following when a file was removed
>
> -userid image-uuid size_in_bytes
>
> When calculating the average in the reducer, I will have to subtract the size of the removed file and decrease the total to find the average without that file.
>
> Deletions are infrequent events.
>
> I thought of, in the reducer keeping a hash map in memory that tracks deletions while I am iterating the value list, so that I can correct the final total and count in the end of the iteration.
>
> Oh, and this just reminds me that I will have only one reducer for the single ‘avg' key the mapper emits.
>
> What do you think ?
>
> Regards
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]