Django select only rows with duplicate field values - sql

suppose we have a model in django defined as follows:
class Literal:
name = models.CharField(...)
...
Name field is not unique, and thus can have duplicate values. I need to accomplish the following task:
Select all rows from the model that have at least one duplicate value of the name field.
I know how to do it using plain SQL (may be not the best solution):
select * from literal where name IN (
select name from literal group by name having count((name)) > 1
);
So, is it possible to select this using django ORM? Or better SQL solution?

Try:
from django.db.models import Count
Literal.objects.values('name')
.annotate(Count('id'))
.order_by()
.filter(id__count__gt=1)
This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:
dupes = Literal.objects.values('name')
.annotate(Count('id'))
.order_by()
.filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])

This was rejected as an edit. So here it is as a better answer
dups = (
Literal.objects.values('name')
.annotate(count=Count('id'))
.values('name')
.order_by()
.filter(count__gt=1)
)
This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:
Literal.objects.filter(name__in=dups)
The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.

try using aggregation
Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)

In case you use PostgreSQL, you can do something like this:
from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value
duplicate_ids = (Literal.objects.values('name')
.annotate(ids=ArrayAgg('id'))
.annotate(c=Func('ids', Value(1), function='array_length'))
.filter(c__gt=1)
.annotate(ids=Func('ids', function='unnest'))
.values_list('ids', flat=True))
It results in this rather simple SQL query:
SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1

Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:
dupes = []
uniques = []
dupes_query = MyModel.objects.values_list('field', flat=True)
for dupe in set(dupes_query):
if not dupe in uniques:
uniques.append(dupe)
else:
dupes.append(dupe)
print(set(dupes))

If you want to result only names list but not objects, you can use the following query
repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')

Related

Trying to explode an array with unnest() in Presto and failing due to extra column

I have data from a query that looks like this:
SELECT
model_features
FROM some_db
which returns:
{
"food1": 0.65892159938812,
"food2": 0.90786880254745,
"food3": 0.88357985019684,
"food4": 0.99999821186066,
"food5": 0.99237471818924,
"food6": 0.62127977609634
}
{
"food4": 0.9999965429306,
"text1": 0.82206630706787
}
...
etc.
What I am eventually trying to do is simply get a count of each of the "food1", "food2" features,
but to do so (i think) I need to trim out the unnecessary numeric data. I'm at a loss as to how to do this, as everytime I try to simply unnest
SELECT
t.concepts
FROM some_db
CROSS JOIN UNNEST(model_features) AS t(concepts)
I get this error:
Column alias list has 1 entries but 't' has 2 columns available
Anyone mind pointing me in the right direction?
Solved this for myself: the issue was I needed to avoid dropping the second column of information in order for the query to execute. This may not be the canonical best way to approach, but it worked:
SELECT
t.concepts,
t.probabilities
FROM some_db
CROSS JOIN UNNEST(model_features) AS t(concepts,probabilities)

Passing Optional List argument from Django to filter with in Raw SQL

When using primitive types such as Integer, I can without any problems do a query like this:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(pk)s ISNULL OR id %(pk)s''', params={'pk': 1})
Which would either return row with id = 1 or it would return all rows if pk parameter was equal to None.
However, when trying to use similar approach to pass a list/tuple of IDs, I always produce a SQL syntax error when passing empty/None tuple, e.g. trying:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(ids)s ISNULL OR id IN %(ids)s''', params={'ids': (1,2,3)})
works, but passing () produces SQL syntax error:
psycopg2.ProgrammingError: syntax error at or near ")"
LINE 1: SELECT count(*) FROM account WHERE () ISNULL OR id IN ()
Or if I pass None I get:
django.db.utils.ProgrammingError: syntax error at or near "NULL"
LINE 1: ...LECT count(*) FROM account WHERE NULL ISNULL OR id IN NULL
I tried putting the argument in SQL in () - (%(ids)s) - but that always breaks one or the other condition. I also tried playing around with pg_typeof or casting the argument, but with no results.
Notes:
the actual SQL is much more complex, this one here is a simplification for illustrative purposes
as a last resort - I could alter the SQL in Python based on the argument, but I really wanted to avoid that.)
At first I had an idea of using just 1 argument, but replacing it with a dummy value [-1] and then using it like
cursor.execute(sql='''SELECT ... WHERE -1 = any(%(ids)s) OR id = ANY(%(ids)s)''', params={'ids': ids if ids else [-1]})
but this did a Full table scan for non empty lists, which was unfortunate, so a no go.
Then I thought I could do a little preprocessing in python and send 2 arguments instead of just the single list- the actual list and an empty list boolean indicator. That is
cursor.execute(sql='''SELECT ... WHERE %(empty_ids)s = TRUE OR id = ANY(%(ids)s)''', params={'empty_ids': not ids, 'ids': ids})
Not the most elegant solution, but it performs quite well (Index scan for non empty list, Full table scan for empty list - but that returns the whole table anyway, so it's ok)
And finally I came up with the simplest solution and quite elegant:
cursor.execute(sql='''SELECT ... WHERE '{}' = %(ids)s OR id = ANY(%(ids)s)''', params={'ids': ids})
This one also performs Index scan for non empty lists, so it's quite fast.
From the psycopg2 docs:
Note You can use a Python list as the argument of the IN operator using the PostgreSQL ANY operator.
ids = [10, 20, 30]
cur.execute("SELECT * FROM data WHERE id = ANY(%s);", (ids,))
Furthermore ANY can also work with empty lists, whereas IN () is a SQL syntax error.

Check existence of given text in a table

I have a course code name COMP2221.
I also have a function finder(int) that can find all codes matching a certain pattern.
Like:
select * from finder(20004)
will give:
comp2211
comp2311
comp2411
comp2221
which match the pattern comp2###.
My question is how to express "whether comp2221 is in finder(20004)" in a neat way?
How to express "whether comp2221 is in finder(20004)" in a neat way?
Use an EXISTS expression and put the test into the WHERE clause:
SELECT EXISTS (SELECT FROM finder(20004) AS t(code) WHERE code = 'comp2221');
Returns a single TRUE or FALSE. Never NULL and never more than one row - even if your table function finder() returns duplicates.
Or fork the function finder() to integrate the test directly. Probably faster.

Filter results if not contained in another column

Is there an equivalent way to do the following SQL command with Django's QuerySet API?
select id, childid from mysite_nodetochild
where childid NOT IN (Select "Nodeid" from mysite_nodetochild)
I would prefer not to use raw sql if possible but I can't get a clean working version using Django's Queryset.
Try
nodetochild.objects.exclude(childid=nodetochild.objects.values_list('Nodeid', flat=True)).only('id', 'childid')
This should evaluate to, more or less:
SELECT "mysite_nodetochild"."id", "mysite_nodetochild"."childid" FROM "mysite_nodetochild" WHERE NOT ("mysite_nodetochild"."childid" = (SELECT U0."nodeid" FROM "mysite_nodetochild" U0))
Or, if you need the IN condition:
nodetochild.objects.exclude(childid__in=nodetochild.objects.values_list('Nodeid', flat=True)).only('id', 'childid')
Would evaluate to:
SELECT "mysite_nodetochild"."id", "mysite_nodetochild"."childid" FROM "mysite_nodetochild" WHERE NOT ("mysite_nodetochild"."childid" IN (SELECT U0."nodeid" FROM "mysite_nodetochild" U0))

Linq Lambda expression for below sql in vb.net

I have this existing Sql statement:
Select Count(ordid),isnull(prcsts,'NOT STARTED')
from lwp
where lwp in( Select max(Id) from lwp group by ordid)
group by prcsts
I want to convert to use linq-to-sql, but I'm can't figure out how to handle the group by expression in the sub query. How can I do this?
I am using Entity Framework where I have a method to get the list of lwp. I did only part of it.
Entitydb.lwpmethod
.GetList
.Where(Function(F) F.ID = **Max(Function(O) O.ordid**)
.GroupBy(Function(F) F.prcsts)
.Select(Function(F) New With {.A = F.Count, .B = F.Key})
.ToList
I am unable to write the group by subquery in the max function.
First off, that's not an in, that's an = since max() returns a single element. Also your sql query has lwp in the where clause, you probably typo'd id. With that in mind, what you want is something like:
.Where(row=>row.ID=Entitydb.lwpmethod.GetList()
.Where(r=>r.ordid=row.ordid)
.Max(r=>r.ID))
C# code, but you get the idea.
By the way this looks like it's selecting the last row. Why not just sort by id descendently and take the first element?