Help !! Hadoop installation to One machine has 24 CPU 16 disk (Each one 2 TB)

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Help !! Hadoop installation to One machine has 24 CPU 16 disk (Each one 2 TB)

dgoker

Hi

I installed the hadoop to one server which has following configurations

 24 CPU,
 72 GB RAM
 17 Disk (2 TB)

 
All configuration belongs to Hadoop and Pig are is default settings. ın
order to run process efficiently waht should be the following configuration
settings. The settings i find on forums usually 4 CPU machines and clustered
system.


What do you suggest me following settings?


mapred.tasktracker.reduce.tasks.maximum   ?
mapred.map.tasks ?
mapred.reduce.tasks ?
dfs.datanode.handler.count ?



--
View this message in context: http://old.nabble.com/Help-%21%21-Hadoop-installation-to-One-machine-has-24-CPU-16-disk-%28Each-one-2-TB%29-tp26322618p26322618.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Is the job tracker a master node?

Raymond Jennings III
I am running with the NameNode and JobTracker on separate machines.  Does the JobTracker node need to be specified in the conf/master file?  I am not running it as a slave node so I do not have it in the cond/slaves file.  Thanks!


     
Reply | Threaded
Open this post in threaded view
|

Re: Help !! Hadoop installation to One machine has 24 CPU 16 disk (Each one 2 TB)

Edward Capriolo
In reply to this post by dgoker
On Thu, Nov 12, 2009 at 12:17 PM, dgoker <[hidden email]> wrote:

>
> Hi
>
> I installed the hadoop to one server which has following configurations
>
>  24 CPU,
>  72 GB RAM
>  17 Disk (2 TB)
>
>
> All configuration belongs to Hadoop and Pig are is default settings. ın
> order to run process efficiently waht should be the following configuration
> settings. The settings i find on forums usually 4 CPU machines and clustered
> system.
>
>
> What do you suggest me following settings?
>
>
> mapred.tasktracker.reduce.tasks.maximum   ?
> mapred.map.tasks ?
> mapred.reduce.tasks ?
> dfs.datanode.handler.count ?
>
>
>
> --
> View this message in context: http://old.nabble.com/Help-%21%21-Hadoop-installation-to-One-machine-has-24-CPU-16-disk-%28Each-one-2-TB%29-tp26322618p26322618.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

I see that no one took a stab at this one yet so I will give it a go.
You have a very interesting configuration here that is not often seen
by hadoop.

Firstly, there are some subtle issues with single node deployments.
One issue is memory contention. (another is lack of ability to
replicate data)

The hadoop namenode wants to keep is i-node table in main memory. Your
task Trackers want to chew up lots of memory for its map/reduce jobs.
These two processes (and others) will fight for memory and the loser
will be everyone.

Your first consideration may be using ulimit or some other system for
resource control. You would want to use this on the datanode and
trasktracker to try and keep them under control.

You could accomplish more fine grained control with linux vserver,
http://www.linux-vserver.org/, or solaris zones, or (token virt
product here).

(I am guessing by 24 CPU you mean 24 cores?)
From there you can do some off hand calculations like

2x cores NameNode 2GB-8GB RAM
2x cores JobTracker 2GB-8GB RAM
2x cores NameNode 2GB-8GB RAM

24core - 6core = 18 Core
72GB - 6GB|24GB = 66GB |  48 GB

So you have roughly 18 cores and 48 GB of ram to dedicate to DataNode
and TaskTracker. So punch these numbers into some other tuning guides
on the internet.

Usually
mapred.tasktracker.reduce.tasks.maximum
and
mapred.tasktracker.map.tasks.maximum

Are calculated on the number of cores a system has. There are a number
of formulas based on the guide you are reading.

You can also just plug the values into cloudera-configuration and see
what it comes up with.
http://www.cloudera.com/hadoop-config-faq

http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/

Configuration is more of iterative process and your workload changes
this. Depending on how much time you have to dev you should start as
simple as possible and evolve your configuration.
Reply | Threaded
Open this post in threaded view
|

RE: Is the job tracker a master node?

Jeff Zhang
In reply to this post by Raymond Jennings III
The conf/master contains the second name node not master node (the file name
is a bit confusing)

You can configure your name node in core-site.xml and configure your job
tracker in mapred-site.xml


Jeff Zhang



-----Original Message-----
From: Raymond Jennings III [mailto:[hidden email]]
Sent: 2009年11月13日 9:05
To: [hidden email]
Subject: Is the job tracker a master node?

I am running with the NameNode and JobTracker on separate machines.  Does
the JobTracker node need to be specified in the conf/master file?  I am not
running it as a slave node so I do not have it in the cond/slaves file.
Thanks!


     

Reply | Threaded
Open this post in threaded view
|

I thought map and reduce could not overlap?

Raymond Jennings III
I thought there was a barrier that ensured the map phase would finish before the reduce phase started but I see on the sample hadoop word count app:

09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%

That looks loke they are overlapping?



     
Reply | Threaded
Open this post in threaded view
|

Re: I thought map and reduce could not overlap?

David Howell
The first 2/3 of the reduce phase (as reported by the progress meters)
are all about getting the map results from the map tasktracker to the
reduce tasktracker and sorting them. The real reduce happens in the
last third, and that part won't start until all of the maps are done.

On Sat, Nov 14, 2009 at 10:05 AM, Raymond Jennings III
<[hidden email]> wrote:

> I thought there was a barrier that ensured the map phase would finish before the reduce phase started but I see on the sample hadoop word count app:
>
> 09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
> 09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
> 09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
> 09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
> 09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
> 09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
> 09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
> 09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%
>
> That looks loke they are overlapping?
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: I thought map and reduce could not overlap?

timrobertson100
In reply to this post by Raymond Jennings III
My understanding is the following:
As map tasks finish, it starts to pipe the output of the map to the
reducer machines, but it does not do the reduce yet.  During this
stage if you look at the running reducers, you will see it say
something like "copying 4 of 45".  Once all the maps have finished and
copied, you will see Reduce at 33%.  Once all the maps have finished,
the copying will finish afterwards, then the sorting, and then the
reduce starts.

Basically this overlap is just it beginning to copy the data that is
ready onto the reducer machines.

Cheers

Tim


On Sat, Nov 14, 2009 at 5:05 PM, Raymond Jennings III
<[hidden email]> wrote:

> I thought there was a barrier that ensured the map phase would finish before the reduce phase started but I see on the sample hadoop word count app:
>
> 09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
> 09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
> 09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
> 09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
> 09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
> 09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
> 09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
> 09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%
>
> That looks loke they are overlapping?
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: I thought map and reduce could not overlap?

Kevin Weil
In reply to this post by Raymond Jennings III
The first third of the reduce phase is really the shuffle, where map  
outputs get sent to and collected at their respective refucers. You'll  
see this transfer happening, and the "reduce" creeping up towards 33%,  
towards the end of your map phase.  The 33% mark is where the real  
barrier is.

Kevin

On Nov 14, 2009, at 8:05 AM, Raymond Jennings III  
<[hidden email]> wrote:

> I thought there was a barrier that ensured the map phase would  
> finish before the reduce phase started but I see on the sample  
> hadoop word count app:
>
> 09/11/14 10:58:50 INFO mapred.JobClient:  map 79% reduce 18%
> 09/11/14 10:58:54 INFO mapred.JobClient:  map 79% reduce 19%
> 09/11/14 10:58:55 INFO mapred.JobClient:  map 80% reduce 19%
> 09/11/14 10:58:58 INFO mapred.JobClient:  map 80% reduce 20%
> 09/11/14 10:59:00 INFO mapred.JobClient:  map 81% reduce 20%
> 09/11/14 10:59:04 INFO mapred.JobClient:  map 82% reduce 20%
> 09/11/14 10:59:05 INFO mapred.JobClient:  map 82% reduce 21%
> 09/11/14 10:59:08 INFO mapred.JobClient:  map 82% reduce 22%
>
> That looks loke they are overlapping?
>
>
>
>