Remove duplicates in a Django query - sql

Is there a simple way to remove duplicates in the following basic query:
email_list = Emails.objects.order_by('email')
I tried using duplicate() but it was not working. What is the exact syntax for doing this query without duplicates?

This query will not give you duplicates - ie, it will give you all the rows in the database, ordered by email.
However, I presume what you mean is that you have duplicate data within your database. Adding distinct() here won't help, because even if you have only one field, you also have an automatic id field - so the combination of id+email is not unique.
Assuming you only need one field, email_address, de-duplicated, you can do this:
email_list = Email.objects.values_list('email', flat=True).distinct()
However, you should really fix the root problem, and remove the duplicate data from your database.
Example, deleting duplicate Emails by email field:
for email in Email.objects.values_list('email', flat=True).distinct():
Email.objects.filter(pk__in=Email.objects.filter(email=email).values_list('id', flat=True)[1:]).delete()
Or books by name:
for name in Book.objects.values_list('name', flat=True).distinct():
Book.objects.filter(pk__in=Artwork.objects.filter(name=name).values_list('id', flat=True)[3:]).delete()

For checking duplicate you can do a GROUP_BY and HAVING in Django as below. We are using Django annotations here.
from django.db.models import Count
from app.models import Email
duplicate_emails = Email.objects.values('email').annotate(email_count=Count('email')).filter(email_count__gt=1)
Now looping through the above data and deleting all other emails except the first one (depends on requirement or whatever).
for data in duplicates_emails:
email = data['email']
Email.objects.filter(email=email).order_by('pk')[1:].delete()

You can chain .distinct() on the end of your queryset to filter duplicates. Check out: http://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct

You may be able to use the distinct() function, depending on your model. If you only want to retrieve a single field form the model, you could do something like:
email_list = Emails.objects.values_list('email').order_by('email').distinct()
which should give you an ordered list of emails.

You can also use set()
email_list = set(Emails.objects.values_list('email', flat=True))

Use, self queryset.annotate()!
from django.db.models import Subquery, OuterRef
email_list = Emails.objects.filter(
pk__in = Emails.objects.values('emails').distinct().annotate(
pk = Subquery(
Emails.objects.filter(
emails= OuterRef("emails")
)
.order_by("pk")
.values("pk")[:1])
)
.values_list("pk", flat=True)
)
This queryset goes to make this query.
SELECT `email`.`id`,
`email`.`title`,
`email`.`body`,
...
...
FROM `email`
WHERE `email`.`id` IN (
SELECT DISTINCT (
SELECT U0.`id`
FROM `email` U0
WHERE U0.`email` = V0.`approval_status`
ORDER BY U0.`id` ASC
LIMIT 1
) AS `pk`
FROM `agent` V0
)
cheet-sheet
from django.db.models import Subquery, OuterRef
group_by_duplicate_col_queryset = Models.objects.filter(
pk__in = Models.objects.values('duplicate_col').distinct().annotate(
pk = Subquery(
Models.objects.filter(
duplicate_col= OuterRef('duplicate_col')
)
.order_by("pk")
.values("pk")[:1])
)
.values_list("pk", flat=True)
)

I used the following to actually remove the duplicate entries from from the database, hopefully this helps someone else.
adds = Address.objects.all()
d = adds.distinct('latitude', 'longitude')
for address in adds:
if i not in d:
address.delete()

you can use this raw query : your_model.objects.raw("select * from appname_Your_model group by column_name")

Related

SAP Query IMRG Measure documents

I'm learning SAP queries.
I want to get all the Measure documents from an equipement.
To do that, I use 3 tables :
EQUI, IMPTT, IMRG
The query works but I have all documents instead I only want to get the last one by Date. But I can't do that. I'm sure that I have to add a custom field, but I have tried but none of them works.
For example, my last code :
select min( IMRG~INVTS ) IMRG~RECDV
from IMRG inner join IMPTT on
IMRG~POINT = IMPTT~POINT into (INVTS, IMRGVAL)
where IMRG~POINT = IMPTT-POINT AND
IMPTT~MPOBJ = EQUI-OBJNR
and IMRG~CANCL = '' group by IMRG~MDOCM IMRG~RECDV.
ENDSELECT.
Thanks for your help.
You will need to get the date from IMRG, and the inverted timestamp field, so the MIN() of this will be the most recent - that looks correct.
However your GROUP BY looks wrong. You should be grouping on the IMPTT~POINT field so that you get one record per measurement point. Note that one Point IMPTT can have many measurements (IMRG), so something like this:
SELECT EQUI-OBJNR, IMPTT~POINT, MIN(IMRG~IMRC_INVTS)
...
GROUP BY EQUI-OBJNR, IMPTT~POINT
If I got you correctly, you are trying to get the freshest measurement of the equipment disregard of measurement point. So you can try this query, which is not so beautiful, but it just works.
SELECT objnr COUNT(*) MIN( invts )
FROM equi AS eq
JOIN imptt AS tt
ON tt~mpobj = eq~objnr
JOIN imrg AS ig
ON ig~point = tt~point
INTO (wa_objnr, count, wa_invts)
WHERE ig~cancl = ''
GROUP BY objnr.
SELECT SINGLE recdv FROM imrg JOIN imptt ON imptt~point = imrg~point INTO wa_imrgval WHERE invts = wa_invts AND imptt~mpobj = wa_objnr.
WRITE: / wa_objnr, count, wa_invts, wa_imrgval.
ENDSELECT.

Selecting rows from Parent Table only if multiple rows in Child Table match

Im building a code that learns tic tac toe, by saving info in a database.
I have two tables, Games(ID,Winner) and Turns(ID,Turn,GameID,Place,Shape).
I want to find parent by multiple child infos.
For Example:
SELECT GameID FROM Turns WHERE
GameID IN (WHEN Turn = 1 THEN Place = 1) AND GameID IN (WHEN Turn = 2 THEN Place = 4);
Is something like this possible?
Im using ms-access.
Turm - Game turn GameID - Game ID Place - Place on matrix
1=top right, 9=bottom left Shape - X or circle
Thanks in advance
This very simple query will do the trick in a single scan, and doesn't require you to violate First Normal Form by storing multiple values in a string (shudder).
SELECT T.GameID
FROM Turns AS T
WHERE
(T.Turn = 1 AND T.Place = 1)
OR (T.Turn = 2 AND T.Place = 4)
GROUP BY T.GameID
HAVING Count(*) = 2;
There is no need to join to determine this information, as is suggested by other answers.
Please use proper database design principles in your database, and don't violate First Normal Form by storing multiple values together in a single string!
The general solution to your problem can be accomplished by using a sub-query that contains a self-join between two instances of the Turns table:
SELECT * FROM Games
WHERE GameID IN
(
SELECT Turns1.GameID
FROM Turns AS Turns1
INNER JOIN Turns AS Turns2
ON Turns1.GameID = Turns2.GameID
WHERE (
(Turns1.Turn=1 AND Turns1.Place = 1)
AND
(Turns2.Turn=2 AND Turns2.Place = 4))
);
The Self Join between Turns (aliased Turns1 and Turns2) is key, because if you just try to apply both sets of conditions at once like this:
WHERE (
(Turns.Turn=1 AND Turns.Place = 1)
AND
(Turns.Turn=2 AND Turns.Place = 4))
you will never get any rows back. This is because in your table there is no way for an individual row to satisfy both conditions at the same time.
My experience using Access is that to do a complex query like this you have to use the SQL View and type the query in on your own, rather than use the Query Designer. It may be possible to do in the Designer, but it's always been far easier for me to write the code myself.
select GameID from Games g where exists (select * from turns t where
t.gameid = g.gameId and ((turn =1 and place = 1) or (turn =2 and place =5)))
This will select all the games that have atleast one turn with the coresponding criteria.
More info on exist:
http://www.techonthenet.com/sql/exists.php
I bypassed this problem by adding a column which holds the turns as a string example : "154728" and i search for it instead. I think this solution is also less demanding on the database

SQL IN and AND clause output

I have written one small query like below. It is giving me output.
select user_id
from table tf
where tf.function_id in ('1001051','1001060','1001061')
but when i am running query like below it is showing 0 out put.however i have verified manually we have user_id's where all the 3 function_id's are present.
select user_id
from table tf
where tf.function_id='1001051'
and
tf.function_id='1001060'
and
tf.function_id='1001061'
it looks very simple to use AND clause. However i am not gettng desired output. AM i doing something wrong?
Thanks in advance
Is this what you want to do?
select tf.user_id
from table tf
where tf.function_id in ('1001051', '1001060', '1001061')
group by tf.user_id
having count(distinct tf.function_id) = 3;
This returns users that have all three functions.
EDIT:
This is the query in your comment:
select tu.dealer_id, tu.usr_alias, tf.function_nm
from t_usr tu, t_usr_function tuf, t_function tf
where tu.usr_id = tuf.usr_id and tuf.function_id = tf.function_id and
tf.function_id = '1001051' and tf.function_id = '1001060' and tf.function_id = '1001061' ;
First, you should learn proper join syntax. Simple rule: Never use commas in the from clause.
I think the query you want is:
select tu.dealer_id, tu.usr_alias
from t_usr tu join
t_usr_function tuf
on tu.usr_id = tuf.usr_id
where tuf.function_id in ('1001051', '1001060', '1001061')
group by tu.dealer_id, tu.usr_alias
having count(distinct tuf.function_id) = 3;
This doesn't give you the function name. I'm not sure why you need such detail if all three functions are there for each "user" (or at least dealer/user alias combination). And, the original question doesn't request this level of detail.
Using 'AND' clause mean that the query should satisfy all of the conditions.
in your case, you need to return either when the function_id='1001051' OR function_id='1001060'.
So in brief you need to replace the AND by OR.
select user_id from table tf
where tf.function_id='1001051' OR tf.function_id='1001060' OR tf.function_id='1001061'
Thats what the IN do, it compares with either of them.
As I pointed out in the comment, AND is not the right operator since all three conditions together will not be met. Use OR instead,
select user_id from table tf
where tf.function_id='1001051' OR tf.function_id='1001060' OR tf.function_id='1001061'
You're asking for the value to be three different values at the same time. A better use would be to use OR instead of AND:
select user_id from table tf
where tf.function_id='1001051' or tf.function_id='1001060' or tf.function_id='1001061'
If all of these things are true:
tf.function_id='1001051'
tf.function_id='1001060'
tf.function_id='1001061'
Then simple algebra tells us this must also be true:
'1001051'='1001060'='1001061'
Since that clearly can't ever be true, your SQL statement's where clause will always resolve to false.
What you want to say is that any of those conditions is true (which is equivalent to in), which means you need to use or:
SELECT user_id
FROM table tf
WHERE tf.function_id = '1001051'
OR tf.function_id = '1001060'
OR tf.function_id = '1001061'
The where clause applies to each row returned by the query. In order to gather data across rows, you either need to join the table to itself enough times to create a single row that satisfies the condition you're looking for or use aggregate functions to consolidate several rows into a single row.
Self-join solution:
SELECT user_id
FROM table tf1
JOIN table tf2 ON tf1.user_id = tf2.user_id
JOIN table tf3 ON tf1.user_id = tf3.user_id
WHERE tf1.function_id = '1001051'
AND tf2.function_id = '1001060'
AND tf3.function_id = '1001061'
Aggregate solution:
SELECT user_id
FROM table tf
WHERE tf.function_id IN ('1001051', '1001060', '1001061')
GROUP BY user_id
HAVING COUNT (DISTINCT tf.function_id) = 3
Try this as this link SQL IN
select function_id, user_id from table tf
where tf.function_id in ('1001051','1001060','1001061')

Include column names in Grails SQL query results

I have a query that looks like this...
def data = session.createSQLQuery("""SELECT
a.id AS media_id,
a.uuid,
a.date_created,
a.last_updated,
a.device_date_time,
a.ex_time,
a.ex_source,
a.sequence,
a.time_zone,
a.time_zone_offset,
a.media_type,
a.size_in_bytes,
a.orientation,
a.width,
a.height,
a.duration,
b.id AS app_user_record_id,
b.app_user_identifier,
b.application_id,
b.date_created AS app_user_record_date_created,
b.last_updated AS app_user_record_last_updated,
b.instance_id,
b.user_uuid
FROM media a, app_user_record b
WHERE a.uuid = b.user_uuid
LIMIT :firstResult, :maxResults """)
.setInteger("firstResult", cmd.firstResult)
.setInteger("maxResults", cmd.maxResults)
.list()
The problem is the .list method returns an array that has no column names. Does anybody know of a way to include/add the column names from a Grails native sql query. I could obviously transform the results into a map and hard code the column names myself.
Use setResultTransformer(Criteria.ALIAS_TO_ENTITY_MAP) for the query. This would result a map of entries.
import org.hibernate.Criteria
def query = """Your query"""
def data = session.createSQLQuery(query)
.setInteger("firstResult", cmd.firstResult)
.setInteger("maxResults", cmd.maxResults)
.setResultTransformer(Criteria.ALIAS_TO_ENTITY_MAP)
.list()
data.each{println it.UUID}
I tested it and realized that earlier I used to use the column number to fetch each field instead of the column name.
NOTE
Keys are upper case. so ex_source would be EX_SOURCE in the result map.

Django 1.0/1.1 rewrite of self join

Is there a way to rewrite this query using the Django QuerySet object:
SELECT b.created_on, SUM(a.vote)
FROM votes a JOIN votes b ON a.created_on <= b.created_on
WHERE a.object_id = 1
GROUP BY 1
Where votes is a table, object_id is an int that occurs multiple times (foreign key - although that doesn't matter here), and created_on which is a datetime.
FWIW, this query allows one to get a score at any time in the past by summing up all previous votes on that object_id.
I'm pretty sure that query cannot be created with the Django ORM. The new Django aggregation code is pretty flexible, but I don't think it can do exactly what you want.
Are you sure that query works? You seem to be missing a check that b.object_id is 1.
This code should work, but it's more than one line and not that efficient.
from django.db.models import Sum
v_list = votes.objects.filter(object__id=1)
for v in v_list:
v.previous_score = votes.objects.filter(object__id=1, created_on__lte=v.created_on).aggregate(Sum('vote'))["vote__sum"]
Aggregation is only available in trunk, so you might need to update your django install before you can do this.
Aggregation isn't the issue; the problem here is that Django's ORM simply doesn't do joins on anything that isn't a ForeignKey, AFAIK.
This is what I'm using now. Ironically, the sql is broken but this is the gist of it:
def get_score_over_time(self, obj):
"""
Get a dictionary containing the score and number of votes
at all times historically
"""
import pdb; pdb.set_trace();
ctype = ContentType.objects.get_for_model(obj)
try:
query = """SELECT b.created_on, SUM(a.vote)
FROM %s a JOIN %s b
ON a.created_on <= b.created_on
WHERE a.object_id = %s
AND a.content_type_id = %s
GROUP BY 1""" % (
connection.ops.quote_name(self.model._meta.db_table),
connection.ops.quote_name(self.model._meta.db_table),
1,
ctype.id,
)
cursor = connection.cursor()
cursor.execute(query)
result_list = []
for row in cursor.fetchall():
result_list.append(row)
except models.ObjectDoesNotExist:
result_list = None
return result_list