How to store smart-list rules in a relational database - sql

The system I'm building has smart groups. By smart groups, I mean groups that update automatically based on these rules:
Include all people that are associated with a given client.
Include all people that are associated with a given client and have these occupations.
Include a specific person (i.e., by ID)
Each smart groups can combine any number of these rules. So, for example, a specific smart list might have these specific rules:
Include all people that are associated with client 1
Include all people that are associated with client 5
Include person 6
Include all people associated with client 10, and who have occupations 2, 6, and 9
These rules are OR'ed together to form the group. I'm trying to think about how to best store this in the database given that, in addition to supporting these rules, I'd like to be able to add other rules in the future without too much pain.
The solution I have in mind is to have a separate model for each rule type. The model would have a method on it that returns a queryset that can be combined with other rules' querysets to, ultimately, come up with a list of people. The one downside of this that I can see is that each rule would have its own database table. Should I be concerned about this? Is there, perhaps, a better way to store this information?

Why not use Q objects?
rule1 = Q(client = 1)
rule2 = Q(client = 5)
rule3 = Q(id = 6)
rule4 = Q(client = 10) & (Q(occupation = 2) | Q(occupation = 6) | Q(occupation = 9))
people = Person.objects.filter(rule1 | rule2 | rule3 | rule4)
and then store their pickled strings into the database.
rule = rule1 | rule2 | rule3 | rule4
pickled_rule_string = pickle.dumps(rule)
Rule.objects.create(pickled_rule_string=pickled_rule_string)

Here are the models we implemented to deal with this scenario.
class ConsortiumRule(OrganizationModel):
BY_EMPLOYEE = 1
BY_CLIENT = 2
BY_OCCUPATION = 3
BY_CLASSIFICATION = 4
TYPES = (
(BY_EMPLOYEE, 'Include a specific employee'),
(BY_CLIENT, 'Include all employees of a specific client'),
(BY_OCCUPATION, 'Include all employees of a speciified client ' + \
'that have the specified occupation'),
(BY_CLASSIFICATION, 'Include all employees of a specified client ' + \
'that have the specified classifications'))
consortium = models.ForeignKey(Consortium, related_name='rules')
type = models.PositiveIntegerField(choices=TYPES, default=BY_CLIENT)
negate_rule = models.BooleanField(default=False,
help_text='Exclude people who match this rule')
class ConsortiumRuleParameter(OrganizationModel):
""" example usage: two of these objects one with "occupation=5" one
with "occupation=6" - both FK linked to a single Rule
"""
rule = models.ForeignKey(ConsortiumRule, related_name='parameters')
key = models.CharField(max_length=100, blank=False)
value = models.CharField(max_length=100, blank=False)
At first I was resistant to this solution as I didn't like the idea of storing references to other objects in a CharField (CharField was selected, because it is the most versatile. Later on, we might have a rule that matches any person whose first name starts with 'Jo'). However, I think this is the best solution for storing this kind of mapping in a relational database. One reason this is a good approach is that it's relatively easy to clean hanging references. For example, if a company is deleted, we only have to do:
ConsortiumRuleParameter.objects.filter(key='company', value=str(pk)).delete()
If the parameters were stored as serialized objects (e.g., Q objects as suggested in a comment), this would be a lot more difficult and time consuming.

Related

Pandas: Subset of subset with multiple conditions

I need to grab a subset of the following using multiple conditions:
Event Type must contain the string 'Outreach'
AND any other field can contain the string 'STEM' - case insensitive.
Data Sample:
Title Event Type Presenter Description Tags
STEM event STEM Gloria Bubbles Craft
Robots Outreach STEM - John EV3 Bots
School STEM Outreach Billy Robots Craft
Code:
cond = df['Event Type'].str.contains('Outreach')
stemA = df[cond]
This gets me all the outreach events.
cond = df['Event Type'].str.contains('Outreach') & (df['Presenter'].str.contains('STEM') | df['Tags'].str.contains('STEM') | df['Description'].str.contains('STEM') | df['Title'].str.contains('STEM'))
stem[cond]
I was hoping for a grep-like solution. The above gets me less than grep does on the command line and I know this result is wrong from looking at the data.
IIUC, this should work for you
cols_to_include = df.columns[df.columns != 'Event Type']
a = df[cols_to_include].astype(str).sum(axis=1)
df[df['Event Type'].str.contains('Outreach') & (a.str.contains('STEM', regex=True))]

Selecting related model: Left join, prefetch_related or select_related?

Considering I have the following relationships:
class House(Model):
name = ...
class User(Model):
"""The standard auth model"""
pass
class Alert(Model):
user = ForeignKey(User)
house = ForeignKey(House)
somevalue = IntegerField()
Meta:
unique_together = (('user', 'property'),)
In one query, I would like to get the list of houses, and whether the current user has any alert for any of them.
In SQL I would do it like this:
SELECT *
FROM house h
LEFT JOIN alert a
ON h.id = a.house_id
WHERE a.user_id = ?
OR a.user_id IS NULL
And I've found that I could use prefetch_related to achieve something like this:
p = Prefetch('alert_set', queryset=Alert.objects.filter(user=self.request.user), to_attr='user_alert')
houses = House.objects.order_by('name').prefetch_related(p)
The above example works, but houses.user_alert is a list, not an Alert object. I only have one alert per user per house, so what is the best way for me to get this information?
select_related didn't seem to work. Oh, and surely I know I can manage this in multiple queries, but I'd really want to have it done in one, and the 'Django way'.
Thanks in advance!
The solution is clearer if you start with the multiple query approach, and then try to optimise it. To get the user_alerts for every house, you could do the following:
houses = House.objects.order_by('name')
for house in houses:
user_alerts = house.alert_set.filter(user=self.request.user)
The user_alerts queryset will cause an extra query for every house in the queryset. You can avoid this with prefetch_related.
alerts_queryset = Alert.objects.filter(user=self.request.user)
houses = House.objects.order_by('name').prefetch_related(
Prefetch('alert_set', queryset=alerts_queryset, to_attrs='user_alerts'),
)
for house in houses:
user_alerts = house.user_alerts
This will take two queries, one for houses and one for the alerts. I don't think you require select related here to fetch the user, since you already have access to the user with self.request.user. If you want you could add select_related to the alerts_queryset:
alerts_queryset = Alert.objects.filter(user=self.request.user).select_related('user')
In your case, user_alerts will be an empty list or a list with one item, because of your unique_together constraint. If you can't handle the list, you could loop through the queryset once, and set house.user_alert:
for house in houses:
house.user_alert = house.user_alerts[0] if house.user_alerts else None

Filtering model with HABTM relationship

I have 2 models - Restaurant and Feature. They are connected via has_and_belongs_to_many relationship. The gist of it is that you have restaurants with many features like delivery, pizza, sandwiches, salad bar, vegetarian option,… So now when the user wants to filter the restaurants and lets say he checks pizza and delivery, I want to display all the restaurants that have both features; pizza, delivery and maybe some more, but it HAS TO HAVE pizza AND delivery.
If I do a simple .where('features IN (?)', params[:features]) I (of course) get the restaurants that have either - so or pizza or delivery or both - which is not at all what I want.
My SQL/Rails knowledge is kinda limited since I'm new to this but I asked a friend and now I have this huuuge SQL that gets the job done:
Restaurant.find_by_sql(['SELECT restaurant_id FROM (
SELECT features_restaurants.*, ROW_NUMBER() OVER(PARTITION BY restaurants.id ORDER BY features.id) AS rn FROM restaurants
JOIN features_restaurants ON restaurants.id = features_restaurants.restaurant_id
JOIN features ON features_restaurants.feature_id = features.id
WHERE features.id in (?)
) t
WHERE rn = ?', params[:features], params[:features].count])
So my question is: is there a better - more Rails even - way of doing this? How would you do it?
Oh BTW I'm using Rails 4 on Heroku so it's a Postgres DB.
This is an example of a set-iwthin-sets query. I advocate solving these with group by and having, because this provides a general framework.
Here is how this works in your case:
select fr.restaurant_id
from features_restaurants fr join
features f
on fr.feature_id = f.feature_id
group by fr.restaurant_id
having sum(case when f.feature_name = 'pizza' then 1 else 0 end) > 0 and
sum(case when f.feature_name = 'delivery' then 1 else 0 end) > 0
Each condition in the having clause is counting for the presence of one of the features -- "pizza" and "delivery". If both features are present, then you get the restaurant_id.
How much data is in your features table? Is it just a table of ids and names?
If so, and you're willing to do a little denormalization, you can do this much more easily by encoding the features as a text array on restaurant.
With this scheme your queries boil down to
select * from restaurants where restaurants.features #> ARRAY['pizza', 'delivery']
If you want to maintain your features table because it contains useful data, you can store the array of feature ids on the restaurant and do a query like this:
select * from restaurants where restaurants.feature_ids #> ARRAY[5, 17]
If you don't know the ids up front, and want it all in one query, you should be able to do something along these lines:
select * from restaurants where restaurants.feature_ids #> (
select id from features where name in ('pizza', 'delivery')
) as matched_features
That last query might need some more consideration...
Anyways, I've actually got a pretty detailed article written up about Tagging in Postgres and ActiveRecord if you want some more details.
This is not "copy and paste" solution but if you consider following steps you will have fast working query.
index feature_name column (I'm assuming that column feature_id is indexed on both tables)
place each feature_name param in exists():
select fr.restaurant_id
from
features_restaurants fr
where
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'pizza')
and
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'delivery')
group by
fr.restaurant_id
Maybe you're looking at it backwards?
Maybe try merging the restaurants returned by each feature.
Simplified:
pizza_restaurants = Feature.find_by_name('pizza').restaurants
delivery_restaurants = Feature.find_by_name('delivery').restaurants
pizza_delivery_restaurants = pizza_restaurants & delivery_restaurants
Obviously, this is a single instance solution. But it illustrates the idea.
UPDATE
Here's a dynamic method to pull in all filters without writing SQL (i.e. the "Railsy" way)
def get_restaurants_by_feature_names(features)
# accepts an array of feature names
restaurants = Restaurant.all
features.each do |f|
feature_restaurants = Feature.find_by_name(f).restaurants
restaurants = feature_restaurants & restaurants
end
return restaurants
end
Since its an AND condition (the OR conditions get dicey with AREL). I reread your stated problem and ignoring the SQL. I think this is what you want.
# in Restaurant
has_many :features
# in Feature
has_many :restaurants
# this is a contrived example. you may be doing something like
# where(name: 'pizza'). I'm just making this condition up. You
# could also make this more DRY by just passing in the name if
# that's what you're doing.
def self.pizza
where(pizza: true)
end
def self.delivery
where(delivery: true)
end
# query
Restaurant.features.pizza.delivery
Basically you call the association with ".features" and then you use the self methods defined on features. Hopefully I didn't misunderstand the original problem.
Cheers!
Restaurant
.joins(:features)
.where(features: {name: ['pizza','delivery']})
.group(:id)
.having('count(features.name) = ?', 2)
This seems to work for me. I tried it with SQLite though.

Left join query in Django with multiple joins against same table

Not sure how to accomplish this in Django.
Models:
class LadderPlayer(models.Model):
player = models.ForeignKey(User, unique=True)
position = models.IntegerField(unique=True)
class Match(models.Model):
date = models.DateTimeField()
challenger = models.ForeignKey(LadderPlayer)
challengee = models.ForeignKey(LadderPlayer)
Would like to query to get all info about a player in one shot, including any challenges they have issued or challenges against them. This SQL works:
select lp.position,
lp.player_id,
sc1.challengee_id challenging,
sc2.challenger_id challenged_by
from ladderplayer lp left join challenge sc1 on lp.player_id = sc1.challenger_id
left join challenge sc2 on lp.player_id = sc2.challengee_id
Which returns something like this, if player 3 has challenged player 2:
position player_id challenging challenged_by
---------- ---------- ----------- -------------
1 1
2 2 3
3 3 2
No idea how to do in Django ORM....any way to do this?
Actually you should probably change your models a bit, since there's a many-to-many relation from LadderPlayer to itself using Match as an intermediate table. Check out django's documentation on this topic. Then you should be able to make the queries you want using django's orm! Also have a look at symmetrical/asymmetrical many-to-many relationships!
Well, I did more digging and it looks like in Django 1.2 this is doable via the "raw()" method on the Query Manager thing. So this is the code using my query above:
ladder_players = LadderPlayer.objects.raw("""select lp.id, lp.position,lp.player_id,
sc1.challengee_id challenging,
sc2.challenger_id challenged_by
from ladderplayer lp left join challenge sc1 on lp.player_id = sc1.challenger_id
left join challenge sc2 on lp.player_id = sc2.challengee_id order by position""")
And in the template, you can refer to the "calculated" join fields:
{% for p in ladder_players %}
{{p.challenging}} {{p.challenged_by}}
...
etc.
Seems to work as I needed....
#lazerscience is absolutely correct. You should tweak your models, since you are setting up a de facto many-to-many relationship; doing so will allow you to leverage more features of the admin interface & so forth.
Additionally, regardless, there is no need to go to raw(), since this can be done entirely via normal usage of the Django ORM.
Something like:
class LadderPlayer(models.Model):
player = models.ForeignKey(User, unique=True)
position = models.IntegerField(unique=True)
challenges = models.ManyToManyField("self", symmetrical=False, through='Match')
class Match(models.Model):
date = models.DateTimeField()
challenger = models.ForeignKey(LadderPlayer)
challengee = models.ForeignKey(LadderPlayer)
should be all you need to change in the models. You then should be able to do a query like
player_of_interest = LadderPlayer.objects.filter(pk=some_id)
matches_of_interest = \
Match.objects.filter(Q(challenger__pk=some_id)|Q(challengee__pk=some_id))
to get all the information of interest about the player in question. Note that you'll need to have from django.db.models import Q to use that.
If you want exactly the same info you're presenting with your example query, I believe it'd be easiest to split the queries into separate ones for getting the challenger & challengee lists -- for example, something like:
challengers = LadderPlayer.objects.filter(challenges__challengee__pk=poi_id)
challenged_by = LadderPlayer.objects.filter(challenges__challenger__pk=poi_id)
will get the two relevant query sets for the player of interest (w/ a primary key of poi_id).
If there's some particular reason you don't want the de facto many-to-many relationship to become a de jure one, you can change those to something along the lines of
challenger = LadderPlayer.objects.filter(match__challengee__pk=poi_id)
challenged_by = LadderPlayer.objects.filter(match__challenger_pk=poi_id)
So the suggestion for the model change is merely to help leverage existing tools, and to make explicit a relationship which you are currently having occur implicitly.
Based on how you want use it, you might want to do something like
pl_tuple = ()
for p in LadderPlayer.objects.all():
challengers = LadderPlayer.objects.filter(challenges__challengee__pk=p.id)
challenged_by = LadderPlayer.objects.filter(challenges__challenger__pk=p.id)
pl_tuple += (p.id, p.position, challengers, challenged_by)
context_dict['ladder_players'] = pl_tuple
in your view to prepare the data for your template.
Regardless, you should probably be doing your query through the Django ORM instead of using raw() in this case.

Django - Count a subset of related models - Need to annotate count of active Coupons for each Item

I have a Coupon model that has some fields to define if it is active, and a custom manager which returns only live coupons. Coupon has an FK to Item.
In a query on Item, I'm trying to annotate the number of active coupons available. However, the Count aggregate seems to be counting all coupons, not just the active ones.
# models.py
class LiveCouponManager(models.Manager):
"""
Returns only coupons which are active, and the current
date is after the active_date (if specified) but before the valid_until
date (if specified).
"""
def get_query_set(self):
today = datetime.date.today()
passed_active_date = models.Q(active_date__lte=today) | models.Q(active_date=None)
not_expired = models.Q(valid_until__gte=today) | models.Q(valid_until=None)
return super(LiveCouponManager,self).get_query_set().filter(is_active=True).filter(passed_active_date, not_expired)
class Item(models.Model):
# irrelevant fields
class Coupon(models.Model):
item = models.ForeignKey(Item)
is_active = models.BooleanField(default=True)
active_date = models.DateField(blank=True, null=True)
valid_until = models.DateField(blank=True, null=True)
# more fields
live = LiveCouponManager() # defined first, should be default manager
# views.py
# this is the part that isn't working right
data = Item.objects.filter(q).distinct().annotate(num_coupons=Count('coupon', distinct=True))
The .distinct() and distinct=True bits are there for other reasons - the query is such that it will return duplicates. That all works fine, just mentioning it here for completeness.
The problem is that Count is including inactive coupons that are filtered out by the custom manager.
Is there any way I can specify that Count should use the live manager?
EDIT
The following SQL query does exactly what I need:
SELECT data_item.title, COUNT(data_coupon.id) FROM data_item LEFT OUTER JOIN data_coupon ON (data_item.id=data_coupon.item_id)
WHERE (
(is_active='1') AND
(active_date <= current_timestamp OR active_date IS NULL) AND
(valid_until >= current_timestamp OR valid_until IS NULL)
)
GROUP BY data_item.title
At least on sqlite. Any SQL guru feedback would be greatly appreciated - I feel like I'm programming by accident here. Or, even better, a translation back to Django ORM syntax would be awesome.
In case anyone else has the same problem, here's how I've gotten it to work:
Items = Item.objects.filter(q).distinct().extra(
select={"num_coupons":
"""
SELECT COUNT(data_coupon.id) FROM data_coupon
WHERE (
(data_coupon.is_active='1') AND
(data_coupon.active_date <= current_timestamp OR data_coupon.active_date IS NULL) AND
(data_coupon.valid_until >= current_timestamp OR data_coupon.valid_until IS NULL) AND
(data_coupon.data_id = data_item.id)
)
"""
},).order_by(order_by)
I don't know that I consider this a 'correct' answer - it completely duplicates my custom manager in a possibly non portable way (I'm not sure how portable current_timestamp is), but it does work.
Are you sure your custom manager actually get's called? You set your manager as Model.live, but you query the normal manager at Model.objects.
Have you tried the following?
data = Data.live.filter(q)...