Hadoop, a new beginning

23Aug10

I’m going to start this blog like the majority of blogs start out.

Well it’s been a while since I’ve last posted, I’ve been busy,blah blah blah…And now to the good stuff.

In my new position as a Server Lead for a start up, I have come up with some interesting challenges. I decided that to break away from my love of the RDMS and go with something that scales horizontally…with out breaking. I’ve spent the last week reading up and watching videos on Hadoop. I think it would be interesting for others to see how my transition is going. I’ll try to keep this blog updated, not only for others, but as a reference to myself on useful links, tips, and explanations.

To start off I would like to explain exactly what I plan on accomplishing. Basically I have a RESTful web service that is sending in protobuf serialized data. Currently I use a django web service to with a few mappers to massage and validate the data a bit and throw it right into Postgres. I think will need to access this data for historical purposes, so something similar to web analytic reports, as well as few looks at the freshest data available.

In the Hadoop world I thought ‘HBase’ would be a good candidate for the fresh data access I need. Where as Hive seems to be more of the normal data warehousing/report building framework I would need.

Because I’m so new to Hadoop, I’m in days ahead, if not hours, I’ll look back at the horrible mistakes I’ve made and correct these entries accordingly.

Over the weekend I read the ‘Hadoop: The Definitive Guide’ by Tom White. I highly recommend it. It actually is a fairly easy read and can easily flip around to chapters, i was more interested in HBase than the details of MapReduce, so I read that chapter first.

Tom White is a consultant at Cloudera. Their site has the best reference material that I have seen for Hadoop. They also have a distribution you can download. It is a nice collection of various packages of Hadoop. But what is even better, they have a VM you can just throw up and get to testing Hadoop in minutes after the installation.

Today what I hope to get done is get the VM installed here on my work macbook, use osmosis to get some real data into the HDFS, and write some code to mimic some of the aggregation queries I’m doing in SQL right now. I’m not sure if I will use Hive today for this, but we shall see. I’m now going through this blog (http://blog.lars-francke.de/2010/07/22/processing-openstreetmap-data-with-hive/) which has already pointed me to a few useful tools for getting data into HDFS from postgres.

This is an exciting time for me. I love learning new things.



No Responses Yet to “Hadoop, a new beginning”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.