Varud

Socially Proximate Predictions

Flower

Archive for January, 2010

Pros and Cons of MongoDB

I was recently asked by somebody to answer some questions regarding MongoDB.  Unfortunately, I have yet to use it in production, but Ara, Zach and I have put it through quite a few paces at this point …
  • Nature of Use:
    • Would be useful if you can mention the nature of application (for ex. reporting or analytics ?) you are using MongoDB for?
    • We use MongoDB for high volume logging.  After what we need is logged, we use Python/PyMongo to transform the data into chunks suitable for Postgres.  Postgres is our central data store used for our Django application and all its associated models.

    • What were the other NoSQL storage solutions were evaluated and why MongoDB was chosen against the others?
    • Cassandra was the other one that we got pretty far with.  In terms of maturity and scalability, Cassandra appeared to be the winner.  However, Cassandra has extremely limited query capabilities that weren’t sufficient for us.  In addition, MongoDB has plans to focus on scalability which suited our needs fine.

  • Robustness:
    • How long you have been running MongoDB in production ?
    • Have not run it in production yet.

      Did you encounter any issues on stability front (any crashes or restart needed) ?

      One issue is how best to keep it ‘living’ without human intervention.  So far, the tools have been very straightforward and simpler than solutions for other products.  However, we haven’t tested the quality of backups under high load nor have we really pressured the system in the wild.  We architected MongoDB in our system so that we could lose it and all we would lose is incoming data while it was down, not historical data or reporting capabilities (which is ok for us for a few hours).

  • Performance:
    • What has been your experience on performance side like (queries/sec for the hardware configuration being used)?
    • We hit 30 inserts per second on a high cpu (the lowest 64 bit) Amazon ec2 instance.  However, the bottleneck was in our Python listener, so we don’t know how much higher MongoDB could go.  We suspect quite alot as the load average was under .2 during this test.

    • Did the performance degraded when datasize grew?
    • We haven’t sufficiently tested this yet.

  • Scalability:
    • What is the rough datasize (number of records, number of collections, size on the disk?) Mongo is being used for?
    • The goal is to hit 1k inserts/second with real time processing (i.e. using their upsert functionality which is something like INSERT … ELSE UPDATE) and to hold onto 10M+ records in a collection.  If we weren’t confident in that being possible, we would not have chosen MongoDB.

      Does all the data sit in one MongoDB server or you are using MongoDB in a clustered environment ?. If being used in sharded environment, would like to know your experience because MongoDB does not support auto-sharding out of the box?

      We are using sharding, but again, we have not pushed it to the limit.  Although it does not support auto-sharding, manually setting up a shard is pretty straightforward.  This is one of the advantages Cassandra has.

  • DataReplication/Persistence:
    • Did you use data-replication in Mongo? What has been the general experience with it?
    • We are planning to use replication but are not.  As referenced above, we have the option of losing MongoDB for a few hours and not incurring a major business penalty.

    • Regarding persistence of data, did you encounter any issues given that MongoDB does lazy writes to the file system?
    • No, but again it has not been pushed enough for me to feel confident that this is a non-issue.  We are planning using XFS however which does have journaling to account for problems at the file block level.

  • Search:
    • Did your application required text-searches on the documents stored in Mongo? Since MongoDB does not support text-search out of the box, how did you take care of search?
    • We aren’t using full text search.  Our goal with regards to that is to setup Sphinx or something similar when we need something like that.  That seems like the right architectural solution.

  • Support:
    • Regarding resolving issues related to Mongo, did you rely on the open-source community or signed up for the paid-support? What has been your experience ?
    • Community.

  • Client-side tools:
    • Which libraries did you use talking to MongoDB server ? We have web-app to be running in Python and there are two libraries available for Python.
    • PyMongo.

    • Would be great if you can share(pointers) to client-side tools you are using with MongoDB ?
    • The Mongo interface is a bit chunky (the way it uses JSON for everything), so often I just use PyMongo since all of our real code uses that anyway.  Our plan is to only have a small number of collections so any necessary queries would happen through our code, not in an ad hoc way requiring a client gui or something like that.

Django 1.2 Alpha – Template Threading

Django 1.2 Alpha 1 was recently released to developers worldwide.  I haven’t been able to play around with it yet but I am reading through the announced changes and plan to write a series of articles for people making the leap from 1.1 to 1.2 – since that’s what I’ll be doing this Spring.

First of all, this is a giant release.  I don’t expect it to go smoothly and I can pretty much guarantee that some major third-party packages will be broken even when Django 1.2 is released as stable.  One of the major changes is that template node bytecode will now be cached in memory (I think – at least that’s how I understand it).  Most people will say, ‘cached is faster than not cached … this is great’.

Unfortunately, what this really means is that the web server will cache the code and run it across all the threads that share that process memory pool.  Now, imagine you are on a default Apache installation on a modern OS.  These days, that Apache will be running in a multithreaded configuration.  That means that each thread (end user) will hit that bytecode in a shared fashion.  If you (or a third party developer) have written any custom template tags, this can be a problem.

Thankfully, the fantastic Django docs point this out and explain why this matters.  For the lazy, I’ll reproduce version 1.1 compatible template tag code that could drive the cycle tag:

{% cycle 'row1' 'row2' %}
And the Python code behind it:
class CycleNode(Node):
    def __init__(self, cyclevars):
        self.cycle_iter = itertools.cycle(cyclevars)
    def render(self, context):
        return self.cycle_iter.next()

To take their example, if you write a tag that cycles different styles for list items, and two threads hit that tag node, you might get cycling that crosses the thread boundaries.  Typically, one client request is getting one request, and another the other.  One client in that example would get two odd styles, and the other would get two even styles – even though in Django 1.1, since the template tag node was not cached, each user would get an odd, even cycle of their styles – the expected behavior.

What this means for people with custom tags?  In English, this is only really an issue if context can’t be global.  Keep in mind that the variable passed will still be thread-safe (they are stored in the thread, not the template node code).  If you are using a template tag that depends on the context of the template at a given moment, then you need to worry and follow their advice of testing render_context, if not, it’s ok.

Keep posted for a discussion on the Messaging API next.