DISQUS

andrewmccall.com: Hbase for storing Users?

  • Jonathan Gray · 6 months ago
    Basic secondary indexing on HBase is done as you describe. Create an additional table for each index where the row id is the indexed field. This is also included as an integrated feature using TransactionalHBase which will take care of managing the secondary tables for you. It uses OCC (optimistic concurrency control) for safety.

    In my own usage, I manage the secondary tables at the application level. This is faster but less safe.

    I have plans to add a less safe, but fast server-side implementation of this in the future for my own purposes. But I also heard there's a chance OCC will be pluggable for the current implementation, in which case I'd just use that. Sign up to the mailing list, 0.20.0 release coming up soon and that will be determined for that release.
  • andrewmccall · 6 months ago
    Thanks for that Jonathan, good to know I'm more or less on the right path. Now that you mention it I remember reading about it in the doc but since forgot. Looked again and saw this:

    http://hadoop.apache.org/hbase/docs/current/api...

    Which I'll look into and post about if it's useful.
  • Tim Sell · 5 months ago
    Just an idea.
    Have you considered using a search server for the indexing/searching and hbase for storing?
    You'd have to keep them in sync of course, but solr is quite useful. You can optimize for just returning ids, by indexing fields and not storing them and there is progress on sharding if you really need it. It doesn't scale to billions of rows of course, but it unlikely that will be a problem for users. You can do exact matches on any of the fields, and of course utilise full text searches where appropriate.
  • andrewmccall · 5 months ago
    TIm, That sound interesting, I've not played with Solr but I am creating a lucene index using some of the field in some of the tables - the cluster is running some highly customised nutch jobs based on the code here: http://github.com/andrewmccall/nutchbase. I considered putting the user Ids in a luncene index and using that to find users, but I was a bit reticent to implement it because I felt there was too much I didn't know.

    Thinking about it again, I may just look at both implementations in more depth because it may be a better way to go especially as indexes start to pile up.