Computation time with Hadoop for kmeans

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Computation time with Hadoop for kmeans

Jérémy C

Good afternoon,


I programmed kmeans with Hadoop in R using Rhadoop in a cluster of 3 machines (3 virtual machines on a single one. Each machine represents one core and has 1,5 GB of RAM). 

The purpose is to compare computation time between a Hadoop cluster and a local machine (without Hadoop) for kmeans. 


I simulated data with gaussian distribution. With 2 millions data, computation time with Hadoop is still much more higher than time taken without Hadoop. Can computation time with Hadoop be lower than time without Hadoop? 

If yes, how can I do it? As I am working on a single machine with 3 VM, I am wondering if it is possible to see the advantages of doing computations with Hadoop.


Thank you.

Jeremy 


Reply | Threaded
Open this post in threaded view
|

Re: Computation time with Hadoop for kmeans

Jeff Hubbs
J�r�my -

Much of the whole point behind Hadoop is that with each worker node added to the cluster you widen the disk I/O and CPU-to-RAM pipelines and increase the number of cores operating at once. What you've essentially done is take a single machine and add on a lot of Hadoop overhead and some VM overhead to get in the way.

Another point behind Hadoop is to move the computation to where the data is. That idea is pretty much stuffed as far as your rig is concerned because all the data and all the computation have nowhere else to move to.

Also, I don't seem to have useful Hadoop cluster worker nodes until each has at least 8GiB RAM; I can't imagine what you can get out of just 1.5GiB RAM.

I'd advise you to grab some real hardware and try this again.

On 1/30/19 10:48 AM, J�r�my C wrote:

Good afternoon,


I programmed kmeans with Hadoop in R using Rhadoop in a cluster of 3 machines (3 virtual machines on a single one. Each machine represents one core and has 1,5 GB of RAM).�

The purpose is to compare computation time between a Hadoop cluster and a local machine (without Hadoop) for kmeans.�


I simulated data with gaussian distribution. With 2 millions data, computation time with Hadoop is still much more higher than time taken without Hadoop. Can computation time with Hadoop be lower than time without Hadoop?�

If yes, how can I do it? As I am working on a single machine with 3 VM, I am wondering if it is possible to see the advantages of doing computations with Hadoop.


Thank you.

Jeremy�