Hi all,
I was talking to Nathan today at work and discussing about Aconex need to move
to using frameworks such as the Hadoop Map Reduce stack (see [1] if anyone is
interested) and it's child projects (particularly HBase - a column-oriented
store). The conversation prompted me that I/we hadn't discussed with the
broader PCP community about how it is best to use PCP in such an environment.
So, I thought I'd outline my own thoughts about how I see this sort of working;
sort of a 'vision'.
To put some background context here, a Hadoop cluster is made from many
commodity machines, each providing a portion of the service. Redundancy is now
done at the node level rather than individual components like RAID-ing disks.
Each commodity node usually has a dual role, providing both a DataNode(DN)
layer (logical block manager for this node) and a TaskTracker (TT – individual
MapReduce job segments running on this node). There also exists a master
NameNode(NN) - a global master distributing and ensuring replication of data
blocks across all these machines via the DN (this is known as the HDFS layer)
and a JobTracker (JT - splits jobs into individual tasks and distributes these
around all the TT's).
There's obviously a crapload of overhead here, but when you scale this sucker
up, things get interesting. Facebook, Twitter, Yahoo to name a few all have
massive clusters of these machines - Facebook I think has the largest Hadoop
cluster (see [2]) in terms of data size (2000 machines providing 21 Pb of
storage), with Yahoo's larger in terms of actual # nodes. The network layer
becomes this massive stream of data. Sorting 100 TB in under 3 hours was done
by Hadoop (see [3]).
Aconex will be nowhere near in this league in terms of # nodes, but many people
are doing some very interesting things with even a handful (half-dozen or more)
nodes. Very easy scalability to just keep adding cheap commodity hardware.
This is where PCP comes in, and I wanted to think about a vision where Facebook
etc would consider using PCP to monitor their cluster. Right now, in the Java
space, Ganglia seems to be the main provider for this sort of thing, but I do
think that PCP is superior (just needs marketing, more on that in a future
post). As I understand it, PCP is mostly used for 'smaller number of hosts'
working together, the 1-host-per-archive rule could be tricky here with so many
hosts?
I will start off by proposing some questions and then discuss strawman-style
arguments of how it might/should/may-need-to-be-changed-to-work. I would value
peoples thoughts on these.
How would one collect PCP metrics from such a large # nodes?
------------------------------------------------------------
I had discussed with Nathan about maybe getting pmcd to broadcast itself via
ZeroConf (see [4]). New nodes that kickstart themselves just broadcast "Hey
I'm here" and pmlogger starts logging automatically. In the Hadoop stack, each
node is given a logical node name, with a network-distance format
("/datacentre1/rack1/node3") so that the NN can properly replicate blocks
_within_ a rack, then across racks, and then hope to replicate it outside the
data centre for fault tolerance. As more nodes/racks/data-centre's appear,
this gets interesting still.
Nathan mentioned about SGI already having to deal with something like this with
a hierarchy of pmdas, although that's about as deep a conversation as we got.
I could imagine a higher-level pmda sucking all data for an individual rack,
summarising and providing it upstream for a cluster-level pmda. I would
really like to understand more about this approach.
Where in the cluster would a pmlogger-host live?
-------------------------------------------------
Imagine the 2000 node cluster, could a single host running pmlogger monitor
this? When the cluster is running flat out and the lights are dimming, network
bandwith may be limited, so possible for dropped/laggy connections to specific
hosts. Perhaps the hierarchical pmlogger thing works better here, with a
pmlogger host per rack (although that's eating up rack space). Anyone got good
ideas on a way for this?
In a sea of nodes, how could one use PCP to visualize the cluster?
-----------------------------------------------------------------
Nathan managed to hookup a basic prototype of the clusterviz (?) tool that
allowed the topology to be mapped out in 3D. The NameNode can enumerate all
the nodes currently participating or the ZeroConf method could be used, and
combining the network-distance name format it would be pretty simple to create
a 3D mapping of a set of racks joined together with this tool. I would have
thought visualising 'heatmap' style connections between the nodes and the
switches in their rack, and the upstream link to switch linking the racks
together. I would have thought that's useful. I don't have enough background
on how this clusterviz thing was used at SGI, but I'd sure think it's useful
somehow in this context.
pmie across so many machines?
-----------------------------
So you've got a theory that such and such a metric is important as to why
something is wrong, but trying to find out where within the cluster is causing
that, given each node has it's own archive is tricky. Imagine a scenario
where we have a theory that a particular node, or small set of nodes has dodgy
network cards and is causing lots of M/R task aborts. Running pmie across the
archives for the cluster to locate which hosts have a higher incident of
failure for such and such a metric could prove interesting in the 2000-node
(and therefore 2000 archives per day) case? This is where the current PCP
archive format of a 1-host-per-archive is a bit limiting, unless I'm misreading
things.
This is where perhaps exporting PCP data from archives into a Column-oriented
store such as HBase may have real advantages for querying data from large #
hosts in one spot.
[1] Hadoop - hadoop.apache.org
[2] Facebook's Hadoop cluster -
http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html
[3] 100 TB sort in <3hours - http://sortbenchmark.org/
[4] ZeroConf - http://www.zeroconf.org/
|