pcp
[Top] [All Lists]

[DRAFT] PCP and Hadoop (or any other large cluster of commodity machines

To: pcp@xxxxxxxxxxx
Subject: [DRAFT] PCP and Hadoop (or any other large cluster of commodity machines)
From: Paul Smith <psmith@xxxxxxxxxx>
Date: Mon, 17 Jan 2011 12:20:21 +1100
Hi all,

I was talking to Nathan today at work and discussing about Aconex need to move 
to using frameworks such as the Hadoop Map Reduce stack (see [1] if anyone is 
interested) and it's child projects (particularly HBase - a column-oriented 
store). The conversation prompted me that I/we hadn't discussed with the 
broader PCP community about how it is best to use PCP in such an environment.  

So, I thought I'd outline my own thoughts about how I see this sort of working; 
sort of a 'vision'.

To put some background context here, a Hadoop cluster is made from many 
commodity machines, each providing a portion of the service.  Redundancy is now 
done at the node level rather than individual components like RAID-ing disks. 

Each commodity node usually has a dual role, providing both a DataNode(DN) 
layer (logical block manager for this node) and a TaskTracker (TT – individual 
MapReduce job segments running on this node). There also exists a master 
NameNode(NN) - a global master distributing and ensuring replication of data 
blocks across all these machines via the DN (this is known as the HDFS layer) 
and a JobTracker (JT - splits jobs into individual tasks and distributes these 
around all the TT's).

There's obviously a crapload of overhead here, but when you scale this sucker 
up, things get interesting.   Facebook, Twitter, Yahoo to name a few all have 
massive clusters of these machines - Facebook I think has the largest Hadoop 
cluster (see [2]) in terms of data size (2000 machines providing 21 Pb of 
storage), with Yahoo's larger in terms of actual # nodes.  The network layer 
becomes this massive stream of data.  Sorting 100 TB in under 3 hours was done 
by Hadoop (see [3]).  

Aconex will be nowhere near in this league in terms of # nodes, but many people 
are doing some very interesting things with even a handful (half-dozen or more) 
nodes.  Very easy scalability to just keep adding cheap commodity hardware.

This is where PCP comes in, and I wanted to think about a vision where Facebook 
etc would consider using PCP to monitor their cluster.  Right now, in the Java 
space, Ganglia seems to be the main provider for this sort of thing, but I do 
think that PCP is superior (just needs marketing, more on that in a future 
post).  As I understand it, PCP is mostly used for 'smaller number of hosts' 
working together, the 1-host-per-archive rule could be tricky here with so many 
hosts?

I will start off by proposing some questions and then discuss strawman-style 
arguments of how it might/should/may-need-to-be-changed-to-work.  I would value 
peoples thoughts on these.

How would one collect PCP metrics from such a large # nodes?
------------------------------------------------------------

I had discussed with Nathan about maybe getting pmcd to broadcast itself via 
ZeroConf (see [4]).  New nodes that kickstart themselves just broadcast "Hey 
I'm here" and pmlogger starts logging automatically.  In the Hadoop stack, each 
node is given a logical node name, with a network-distance format 
("/datacentre1/rack1/node3") so that the NN can properly replicate blocks 
_within_ a rack, then across racks, and then hope to replicate it outside the 
data centre for fault tolerance.  As more nodes/racks/data-centre's appear, 
this gets interesting still.

Nathan mentioned about SGI already having to deal with something like this with 
a hierarchy of pmdas, although that's about as deep a conversation as we got.  
I could imagine a higher-level pmda sucking all data for an individual rack, 
summarising and providing it upstream for a cluster-level pmda.   I would 
really like to understand more about this approach.

Where in the cluster would a pmlogger-host live?
-------------------------------------------------

Imagine the 2000 node cluster, could a single host running pmlogger monitor 
this?  When the cluster is running flat out and the lights are dimming, network 
bandwith may be limited, so possible for dropped/laggy connections to specific 
hosts.  Perhaps the hierarchical pmlogger thing works better here, with a 
pmlogger host per rack (although that's eating up rack space).  Anyone got good 
ideas on a way for this?

In a sea of nodes, how could one use PCP to visualize the cluster?
-----------------------------------------------------------------

Nathan managed to hookup a basic prototype of the clusterviz (?) tool that 
allowed the topology to be mapped out in 3D.  The NameNode can enumerate all 
the nodes currently participating or the ZeroConf method could be used, and 
combining the network-distance name format it would be pretty simple to create 
a 3D mapping of a set of racks joined together with this tool.  I would have 
thought visualising 'heatmap' style connections between the nodes and the 
switches in their rack, and the upstream link to switch linking the racks 
together.  I would have thought that's useful.  I don't have enough background 
on how this clusterviz thing was used at SGI, but I'd sure think it's useful 
somehow in this context.

pmie across so many machines?
-----------------------------

So you've got a theory that such and such a metric is important as to why 
something is wrong, but trying to find out where within the cluster is causing 
that, given each node has it's own archive is tricky.    Imagine a scenario 
where we have a theory that a particular node, or small set of nodes has dodgy 
network cards and is causing lots of M/R task aborts.  Running pmie across the 
archives for the cluster to locate which hosts have a higher incident of 
failure for such and such a metric could prove interesting in the 2000-node 
(and therefore 2000 archives per day) case?  This is where the current PCP 
archive format of a 1-host-per-archive is a bit limiting, unless I'm misreading 
things. 

This is where perhaps exporting PCP data from archives into a Column-oriented 
store such as HBase may have real advantages for querying data from large # 
hosts in one spot.  


[1] Hadoop - hadoop.apache.org
[2] Facebook's Hadoop cluster - 
http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html
[3] 100 TB sort in <3hours - http://sortbenchmark.org/
[4] ZeroConf - http://www.zeroconf.org/
<Prev in Thread] Current Thread [Next in Thread>
  • [DRAFT] PCP and Hadoop (or any other large cluster of commodity machines), Paul Smith <=