Markings

Custom Fields on South 0.7

If you've upgraded to South 0.7, you'll notice that custom model fields are no longer supported.

There's a long, convoluted, discussion about supporting custom fields with introspection rules, but that's unnecessary for most custom fields. If you're just extending a standard field like CharField, follow their tutorial example.

Import:

from south.modelsinspector import add_introspection_rules

in your fields.py file (which should be holding the custom field MyCustomField, in my case in the util app).

Then, at the bottom, under your field definition, put: add_introspection_rules([], ["^util.fields.MyCustomField"])

AckMate replaces Ack in Project

It looks like I'm behind the times. AckMate has replaced Ack in Project - time to upgrade...

ackrc file

Great tip, add this to your ~/.ackrc file for ack:

--ignore-dir=migrations

Then, when you run Ack in Project from TextMate, you won't get hits on your migration directories - which typically aren't what you're looking for anyway.

Pros and Cons of MongoDB

I was recently asked by somebody to answer some questions regarding MongoDB. Unfortunately, I have yet to use it in production, but Ara, Zach and I have put it through quite a few paces at this point ... Nature of Use: Would be useful if you can mention the nature of application (for ex. reporting or analytics ?) you are using MongoDB for?

We use MongoDB for high volume logging. After what we need is logged, we use Python/PyMongo to transform the data into chunks suitable for Postgres. Postgres is our central data store used for our Django application and all its associated models.

What were the other NoSQL storage solutions were evaluated and why MongoDB was chosen against the others?

Cassandra was the other one that we got pretty far with. In terms of maturity and scalability, Cassandra appeared to be the winner. However, Cassandra has extremely limited query capabilities that weren't sufficient for us. In addition, MongoDB has plans to focus on scalability which suited our needs fine.

Robustness: How long you have been running MongoDB in production ?

Have not run it in production yet.

Did you encounter any issues on stability front (any crashes or restart needed) ?

One issue is how best to keep it 'living' without human intervention. So far, the tools have been very straightforward and simpler than solutions for other products. However, we haven't tested the quality of backups under high load nor have we really pressured the system in the wild. We architected MongoDB in our system so that we could lose it and all we would lose is incoming data while it was down, not historical data or reporting capabilities (which is ok for us for a few hours).

Performance: What has been your experience on performance side like (queries/sec for the hardware configuration being used)?

We hit 30 inserts per second on a high cpu (the lowest 64 bit) Amazon ec2 instance. However, the bottleneck was in our Python listener, so we don't know how much higher MongoDB could go. We suspect quite alot as the load average was under .2 during this test.

Did the performance degraded when datasize grew?

We haven't sufficiently tested this yet.

Scalability: What is the rough datasize (number of records, number of collections, size on the disk?) Mongo is being used for?

The goal is to hit 1k inserts/second with real time processing (i.e. using their upsert functionality which is something like INSERT ... ELSE UPDATE) and to hold onto 10M+ records in a collection. If we weren't confident in that being possible, we would not have chosen MongoDB.

Does all the data sit in one MongoDB server or you are using MongoDB in a clustered environment ?. If being used in sharded environment, would like to know your experience because MongoDB does not support auto-sharding out of the box?

We are using sharding, but again, we have not pushed it to the limit. Although it does not support auto-sharding, manually setting up a shard is pretty straightforward. This is one of the advantages Cassandra has.

DataReplication/Persistence: Did you use data-replication in Mongo? What has been the general experience with it?

We are planning to use replication but are not. As referenced above, we have the option of losing MongoDB for a few hours and not incurring a major business penalty.

Regarding persistence of data, did you encounter any issues given that MongoDB does lazy writes to the file system?

No, but again it has not been pushed enough for me to feel confident that this is a non-issue. We are planning using XFS however which does have journaling to account for problems at the file block level.

Search: Did your application required text-searches on the documents stored in Mongo? Since MongoDB does not support text-search out of the box, how did you take care of search?

We aren't using full text search. Our goal with regards to that is to setup Sphinx or something similar when we need something like that. That seems like the right architectural solution.

Support: Regarding resolving issues related to Mongo, did you rely on the open-source community or signed up for the paid-support? What has been your experience ?

Community.

Client-side tools: Which libraries did you use talking to MongoDB server ? We have web-app to be running in Python and there are two libraries available for Python.

PyMongo.

Would be great if you can share(pointers) to client-side tools you are using with MongoDB ?

The Mongo interface is a bit chunky (the way it uses JSON for everything), so often I just use PyMongo since all of our real code uses that anyway. Our plan is to only have a small number of collections so any necessary queries would happen through our code, not in an ad hoc way requiring a client gui or something like that.

Django 1.2 Alpha - Template Threading

Django 1.2 Alpha 1 was recently released to developers worldwide. I haven't been able to play around with it yet but I am reading through the announced changes and plan to write a series of articles for people making the leap from 1.1 to 1.2 - since that's what I'll be doing this Spring.

First of all, this is a giant release. I don't expect it to go smoothly and I can pretty much guarantee that some major third-party packages will be broken even when Django 1.2 is released as stable. One of the major changes is that template node bytecode will now be cached in memory (I think - at least that's how I understand it). Most people will say, 'cached is faster than not cached ... this is great'.

Unfortunately, what this really means is that the web server will cache the code and run it across all the threads that share that process memory pool. Now, imagine you are on a default Apache installation on a modern OS. These days, that Apache will be running in a multithreaded configuration. That means that each thread (end user) will hit that bytecode in a shared fashion. If you (or a third party developer) have written any custom template tags, this can be a problem.

Thankfully, the fantastic Django docs point this out and explain why this matters. For the lazy, I'll reproduce version 1.1 compatible template tag code that could drive the cycle tag:

{% cycle 'row1' 'row2' %}

And the Python code behind it:

class CycleNode(Node):
    def __init__(self, cyclevars):
        self.cycle_iter = itertools.cycle(cyclevars)
    def render(self, context):
        return self.cycle_iter.next()

To take their example, if you write a tag that cycles different styles for list items, and two threads hit that tag node, you might get cycling that crosses the thread boundaries. Typically, one client request is getting one request, and another the other. One client in that example would get two odd styles, and the other would get two even styles - even though in Django 1.1, since the template tag node was not cached, each user would get an odd, even cycle of their styles - the expected behavior.

What this means for people with custom tags? In English, this is only really an issue if context can't be global. Keep in mind that the variable passed will still be thread-safe (they are stored in the thread, not the template node code). If you are using a template tag that depends on the context of the template at a given moment, then you need to worry and follow their advice of testing render_context, if not, it's ok.

Keep posted for a discussion on the Messaging API next.

Cassandra learning

I've been reading up on NoSQL databases for our new deployment. We're down to two: Cassandra and MongoDB. I wish I'd been more thorough about the decision to get it down to those two but suffice to say that we only got rid of the others (Voldemort, CouchDB, etc...) if they didn't support sharding, if they didn't have Python libraries, if they weren't 'mature', or for a few other specific reasons. We didn't just deny solutions out of hand.

I'm just focused on Cassandra right now because my colleague, Ara, is focused on MongoDB. We will be jousting later on about which software is best. Pros:

Shards can handle datasets larger than the memory available (unlike Redis which can't handle more data than it has RAM). This is a pro only in our case where we're expecting many GB of data.
Favors Availability and Partitioning over Consistency - although it is Eventually Consistent.
Fully supports replication, partitioning, self-repair, etc... without application-level logic.
Supports asynchronous write where the node takes the right and returns control to the client while the the node takes care of forwarding the write appropriately. The write is logged locally for fault tolerance.
Data is split locally between Memtable (on RAM) and SSTable(on disk) for low latency and low volatility.
'Bloom Filter' allows very fast checking of uniqueness (i.e. keys) without having to touch the Data File - for increased speed to check whether a key exists.
Supported Python client library maintained by Digg.
Write is non-blocking - no read required.
Writes are atomic within a ColumnFamily.
'Remove' functionality uses 'Tombstones' to mark a record as ready for deletion so that deletes are asynchronous.

Cons:

'Schema' changes require restarting service.
No Commercial support.
Writes are favored over reads (which is good for typical scenarios, but worth considering for some people).
Loss of Libido

Speeding up rarely used Leopard laptop

I have an old G4 laptop that comes out sometimes when my girlfriend fiance is using the shiny new MacBook Air.

Unfortunately, every time I use it, the hard drive thrashes and the CPU goes to 90%. This is not good when you're simply trying to read nytimes.com.

What you'll probably notice if you look at the activity monitor or the output of ps from the terminal is the locate command running. This is trying to update the locate database - which makes running 'locate' from the command line faster and I'm presuming is the backend for Spotlight?

Just pop into the terminal and from /etc/periodic run:

sudo mv weekly/310.locate monthly/

sudo mv /etc/periodic/weekly/310.locate /etc/periodic/monthly/

Now, locate will only run monthly when you haven't opened it in a long time. If you don't care about locate - you can also just delete the file entirely - if it's just a browser machine with no new files - it doesn't matter anyway.

Inglourious Basterds

To even conceive of this movie strikes one as deeply contingent: a revisionist historical plot about an anti-Nazi band of assassins bent on revenge for the atrocities committed during WWII punctuated by low comedy and grand action.

Quentin Tarantino does a masterful job of creating an homage to 30s film, a movie divided into vignettes painting characters in broad strokes, men - good and bad, a classic setting of occupied France, and a grand finale so full of deep satisfaction for the viewer it's hard not to grin with a thorough sense of exaltation at the meting out of deeply deserved retribution.

The characters are: the Apache/Appalachian mongrel ready to lead his Jewish charges into battle (Brad Pitt), the stunning Film Noir heroine whose family was murdered by the Gestapo, a Negro projectionist reminiscent of Sidney Poitier, two Jewish machine gunners/head bashers (among other heroes), the evil Nazi 'Hunter', a German turncoat actress, and the Nazi high command. [Some might find these descriptions offensive but I'm trying to stay true to the dialogue]

It's an affecting movie that allows the viewer to feel a sense of schadenfreude. The opening introduces us early on to the very real, very straightforward atrocities committed by the Nazis on a regular basis. All of the action takes place in France - allowing one to feel a sense of normalcy unavailable to other theaters of the war closed to thorough mental examination. The Concentration Camps, Normandy, Dresden - these are zones of European slaughter on such a scale that it renders the feeling person's senses numb.

At this point, we are introduced to a team of Americans sent behind enemy lines to terrorize the Nazi troops. The critical mark of their attacks is scalping the dead to instill fear in the other units. Early on we are desensitized to the murder - scalps are taken, swastikas are carved into foreheads, skulls are crushed with baseball bats. All of this is a reminder of the base anger of war - death comes on a wave of righteous hate. We are red in tooth and claw regardless of who killed first.

We are then moved forward to the main storyline. A plot has been hatched by the Allies and independently by the Jewish Heroine who is dropped into the fortuitous situation of hosting a gala cinema opening at her theater for the German high command.

Twists and turns bring us to the final conclusion which really does leave the viewer almost revelatory at the outcome.

My main concern is how the early violence is used to prepare people for happiness at murder. The villains are so beyond reform that there isn't a scintilla of restraint in the viewer's vicarious thrill at the slaughtering of all who are hated. This is unlike a horror movie or a drama - the viewer is deeply invested in the outcome - and roots for it.

Is this war? I now see that we cannot live in a state without war. It is the final act of hatred and condemnation and it will be with us as long as humanity knows right and wrong.

defaultdict to count items in Django

One of the great things I discovered today is defaultdict(). This allows one to create dictionaries with a count for each item in a very compact and powerful way. Look at this:

>>> for item in NoticeQueueBatchByUser.objects.filter(user=3).values('label','on_site'):
... d[item['label']] += 1
...
>>> d.items()
[(u'pagedisplay_violation', 4)]
>>>
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for item in ExampleModel.objects.values('label','other_info'):
... d[item['label']] += 1
...
>>> d.items()
[(u'key_1', 4),(u'another_key',2)]
>>>

This will give you the number of times each label appears in ExampleModel.objects.values('label','other_info')

Boycott of Fox

One of the happy things I saw today is that the boycott of Fox (and in particular Glenn Beck) is actually picking up steam. Companies like WalMart, UPS Stores, etc... have decided to join the campaign.

I don't think you need to sign up here:

http://foxnewsboycott.com/

But perhaps just contact one or two companies with which you do business and let them know that you're dissatisfied with their support of this low-brow vitriol. And yes, it is their responsibility to vet their advertising outlets.

I remember seeing Glenn Beck when he was on CNN Airport while waiting to re-enter the country (yes, real Americans explore the world) as he was telling Arabs to get out of the country and accusing them en masse for all America's ills.

Sadly, Glenn Beck reaches the most vulnerable people in our society - those with limited IQ. We need to help them.