Creating a simple join on a table with django - sql

I have a data model in django called MainData which was created on top of a table called "my_data".
I want to perfom a simple calculation on top of this table via django API. The query is as follow:
select main.id,
sum(main.num - secondary.num) as result
from (select * from my_data
where some_value > 10) as main,
my_data as secondary
where
main.id != secondary.id and
main.a > secondary.a
group by main.id
The MainData model has all the relevant fields (num, id, a and some_value).
How can I implement this query via django? (I'm trying to avoid using direct SQL)
Thanks for the help

Try this question, it seems similar:
Django equivalent for count and group by

That's a tough one. Django doesn't really provide an easy way to join a table with itself without resorting to SQL. Here's the best I can come up with without using SQL. The results returned in the results variable are equivalent to those you would get with the SQL query you provided; however, the execution is somewhat less efficient.
from django.db.models import Sum, Count
from your_app.models import MyData
results = []
for m in MyData.objects.filter(some_value__gt=10):
result = MyData.objects.filter(a__lt=m.a).aggregate(
count=Count('num'), sum=Sum('num'))
if result['count']:
results.append({
'id': m.id,
'result': result['count'] * m.num - result['sum']})
print results
You should also have a look at the extra() and raw() methods Django provides for QuerySets, which would allow you to make complex queries such as yours efficiently while still keeping things Djangoesque.

Related

Can I leverage BigQuery (BQ) partition via a join?

I am a Tableau designer, and we are building some views that get filtered by category a lot. Because of this, we tried to create a category_id that would serve as partition. The problem seems to be that if I filter data category only, the partition doesn't get used and the total table GB and cost gets hit.
Our team is trying to see if this could be minimized by using a nested query as follows:
SELECT *
FROM table a
INNER JOIN (
SELECT DISTINCT category_id, category
FROM table
) b
ON a.category_id = b.category_id
WHERE b.category = 'Category A'
The idea is that we could show the user b.category, they select it in Tableau and then the inner join would kick off the partition and limit the bytes returned. When I try this in the BQ interface, the estimated returned size comes back the same.
You'll need to filter on the partitioned field before you make the inner join.
I haven't used tableau before so don't know if this is possible but just an idea. You could create a parameter which is set by the chosen category in tableau, which could be referenced in the where statement of the partitioned table?
SELECT *
FROM table a
INNER JOIN (
SELECT DISTINCT category_id, category
FROM table
Where category = #chosen_category
) b
ON a.category_id = b.category_id;
When you say that your attempts to filter only by category, the partition isn't used, have you actually tested querying the table from the console to test whether the partition is being used or not. If it isn't then you need to look at the partition, but if it is, then you would need to take another look at your Tableau query.
VizQL (Viz query language) is Tableau's sql parser that converts your Tableau viz into SQL for execution, so whilst you cannot really modify the outgoing SQL, you can at least capture it and test which enables you to identify poor performing calculations and/or vizzes, as well as optimise the backend for the queries that Tableau will send.
I've written an article about this here: https://datawonders.atlassian.net/wiki/spaces/TABLEAU/pages/1290600449/Let+s+Talk+Errors+Tuning+6+minute+read
The thing about Tableau is that it treats the source as a derived table, with all filters being placed at the upper-level of the query immediately before the stream,
so your query:
Select *
From table a
Join (
Select Distinct Category_ID, Category
From table
)b On a.category_id = b.category_id
Where b.category = 'Category A'
Will actually look like this (assuming you just select everything):
Select a1.*
From (
Select *
From table a
Join (
Select Distinct Category_ID, Category
From table
)b On a.category_id = b.category_id
)a1
Where a1.category = <your selected category>
So you can see from here that being two-levels deep, your Category table just won't be hit, instead everything shall be read into the spool, the join taking place in tempdb, and only the complete set is filtered immediately before streaming to Tableau.
Bad, underperforming sql it most certainly is.
And this is where the relational method of v2020.2 comes into play, as this has been designed to treat each table as a separate exclusive entity, joins are only made at execution time, so you could build a view that uses data from table a where you are using table b to provide the filtering.
As an alternative, and my preferred overall method is to switch entirely to Custom SQL, utilising this with parameters, as this will enable you to craft and test your own sql to create your own high-performance, low-loading query, but as parameters are parsed before the query is executed, you can place the filtering deep down in the query without the need for a secondary look-up table or filtered derived statement - a select distinct as you are currently using it is still going to produce a large plan, as unless the category column is indexed, the engine shall still need to read every record from the table.
So using parameters, your new query will look something like:
Select a1.*
From (
Select *
From table a
Join lookup_table b On On a.category_id = b.category_id
And b.category = <parameters.pCategory>
)a1
(I've placed the filter condition directly onto the join as this can improve performance in some circumstances, though this actually shouldn't make much difference)
And when used in conjunction with the Set parameter action, you can now use parameters as in/out updateable variables which shall update as the user interacts directly with the viz, instead of the user needing to manually update as they go. If you haven't used these before, I wrote an article about it here: https://community.tableau.com/s/news/a0A4T00000313S0UAI/psst-have-you-had-a-go-with-variables-in-tableau-yet
Steve

How can I express joining to a grouped subquery using NHibernate?

I'm trying to express a SQL query using NHibernate's Criteria API, and I'm running into difficulty because I'm thinking in a database-centric way while NHibernate is object-centric.
SQL (works great):
select outerT.id, outerT.col1, outerT.col2, outerT.col3
from tbl outerT
inner join
(select max(innerT.id)
from tbl innerT
group by innerT.col1) grpT
on outerT.id = grpT.id
Essentially, this is a self-join of a table against a subset of itself. I suppose I could try turning the self-join into a restriction:
select outerT.id, outerT.col1, outerT.col2, outerT.col3
from tbl outerT
where outerT.id in (select max(innerT.id) from tbl innerT group by innerT.col1)
But I'm not sure how to express that using NHibernate either; I'm fighting with the DetachedCriteria's ProjectionList and wanting to select only max(id) while grouping by col1.
Thanks so much for your suggestions!
I don't know if I should post this as a new answer or to add it as a comment on the original question, but I think I've resolved a similar problem in this thread:
Selecting on Sub Queries in NHibernate with Critieria API
AFAIK you cannot join to subqueries at all in NHibernate but you can re-organise the query to use either an EXISTS or IN clause to replicate the same functionality.
I realise the question asks for this to be done using the Criteria API but I thought I would post a HQL version which may give someone else some ideas.
var results = session.CreateQuery("from Product p where p.Id in (
select max(p2.id)
from Product p2
group by p2.col1
)")
I also found this JIRA issue surrounding the Criteria API and not including group by columns in the select. Currently it looks like what you want cannot be achieved using the Criteria API at all.
Group By Property without adding it to the select clause
UPDATE
Using the example from Monkey Coders post looks like you can do this:
var subquery = DetachedCriteria.For<Product>("p")
.SetProjection(Projections.ProjectionList()
.Add(Projections.GroupProperty("p.Col1"))
.Add(Restrictions.EqProperty("p2.Id", Projections.Max("p.Id"));
var query = DetachedCriteria.For<Product>("p2")
.Add(Subqueries.Exists(subquery));
Which would produce the following SQL
select *
from Product p2
where exists (
select p.col1
from Product p
group by p.col1
having p2.Id=max(p.Id)
)

SQL queries with views and subqueries

select nid, avg, std from sView1
where sid = 4891
and nid in (select distinct nid from tblref where rid = 799)
and oidin (select distinct oid from tblref where rid = 799)
and anscount > 3
This is a query I'm currently trying to run. And running it like this takes about 3-4 seconds. However, if I replace the "4891" value with a subquery saying (select distinct sid from tblref where rid = 799) the procedure just hangs, even though the subquery only returns one sid.
The query is supposed to return a dataset with averages (avg) and standard deviations (std) over a resultset which is calculated through nested views in sView1. This dataset is then run through another view to get some top-level averages and stdevs.
The averages may need to include more than 1 sid (sid identifies a dataset).
It's difficult describing it more without revealing codebase and codestructure that shouldn't be revealed ;)
Can anyone suggest why the query hangs when trying to use the subquery? (The code is rebuilt from originally using nested cursors, since I have been told that cursors are the work of the devil, and nested cursors may make me sterile)
Try this. Exists returns as soon as it finds a matching condition, select distinct will require going through the dataset and optionally sorting it to remove the duplicates.
SELECT nid,avg,std from sView1 AS SV
WHERE EXISTS (SELECT * FROM TblRef AS TR WHERE sv.sid = Tr.sid AND Sv.nid = tr.nid AND sv.oid = tr.oid AND tr.rid = 799)
AND ansCount>3
Also, it is pretty difficult to provide a meaningful answer without access to query plans and table structures. So DDL and sample data will definitely help.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

How to implement paging in NHibernate with a left join query

I have an NHibernate query that looks like this:
var query = Session.CreateQuery(#"
select o
from Order o
left join o.Products p
where
(o.CompanyId = :companyId) AND
(p.Status = :processing)
order by o.UpdatedOn desc")
.SetParameter("companyId", companyId)
.SetParameter("processing", Status.Processing)
.SetResultTransformer(Transformers.DistinctRootEntity);
var data = query.List<Order>();
I want to implement paging for this query, so I only return x rows instead of the entire result set.
I know about SetMaxResults() and SetFirstResult(), but because of the left join and DistinctRootEntity, that could return less than x Orders.
I tried "select distinct o" as well, but the sql that is generated for that (using the sqlserver 2008 dialect) seems to ignore the distinct for pages after the first one (I think this is the problem).
What is the best way to accomplish this?
In these cases, it's best to do it in two queries instead of one:
Load a page of orders, without joins
Load those orders with their products, using the in operator
There's a slightly more complex example at http://ayende.com/Blog/archive/2010/01/16/eagerly-loading-entity-associations-efficiently-with-nhibernate.aspx
Use SetResultTransformer(Transformers.AliasToBean()) and get the data that is not the entity.
The other solution is that you change the query.
As I see you're returning Orders that have products which are processing.
So you could use exists statement. Check nhibernate manual at 13.11. Subqueries.