Recently I've been working on a web 2.0ish community site written in Django. As is frequently the case with such sites I often need to create lists or collections of psuedo-randomly selected items. For example on a user's profile page there may be a box showing a few of the user's friends, another box with some of the user's
pictures, or a box with random comments the user has made.
Ideally in this type of situation I would simply preform a query with what ever rules I need and a random "order by" statement such as the following:
Which generates SQL similar to:
In an ideal world this works great, I only get back as much data as I need and the site stays interesting because what the user sees is constantly changing/rotating. Unfortunately unless you are using PostgreSQL or have less than a few hundred rows in the pictures table you are not in an ideal world. Under both MySQL 4 and 5 random ordering is horrifyingly inefficient. Adding that "ORDER BY rand()" can make a query that took a few milliseconds to run instead take tens of seconds. The problem gets even worse when you are working with a table containing millions of rows and thousands of rows which meet the criteria of your query. What is one to do?
pictures, or a box with random comments the user has made.
Ideally in this type of situation I would simply preform a query with what ever rules I need and a random "order by" statement such as the following:
pictures = User.picture_set.filter(private=False).order_by('?')[:6]
Which generates SQL similar to:
SELECT * FROM `pictures`
WHERE private = 0 AND user_id = 5
ORDER BY rand()
LIMIT 6
In an ideal world this works great, I only get back as much data as I need and the site stays interesting because what the user sees is constantly changing/rotating. Unfortunately unless you are using PostgreSQL or have less than a few hundred rows in the pictures table you are not in an ideal world. Under both MySQL 4 and 5 random ordering is horrifyingly inefficient. Adding that "ORDER BY rand()" can make a query that took a few milliseconds to run instead take tens of seconds. The problem gets even worse when you are working with a table containing millions of rows and thousands of rows which meet the criteria of your query. What is one to do?