How do I remove group rows based on other columns for a particular ID such that:
ID Att Comp Att. Inc. Att
aaa 2 0 2
aaa 2 0 2
bbb 3 1 2
bbb 3 1 2
bbb 3 0 2
becomes:
ID Att Comp Att. Inc. Att
aaa 2 0 2
bbb 3 1 2
I need to discard cases which are not just duplicate, but also infer the same data based on the columns.
Use drop_duplicates -- check out the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
I can't tell for sure from your description what you want to pay attention for for duplicates, but you can tell drop_duplicates which column(s) to look at.
This question already has answers here:
one to one distinct restriction on selection
(2 answers)
Closed 8 years ago.
I have encountered a problem like that. There is a Table A, I want to aggregate it using a `group by x order by diff(which is abs(x-y)) incrementally
x and y goes always incrementally. And x with smaller value will have the priority when two different x can paired with same y
x y diff
1 2 1
1 4 3
1 6 5
3 2 1
3 4 1
3 6 3
4.5 2 3.5
4.5 4 0.5
4.5 6 1.5
The aggregate function I want is:
take the y in each group which has the smallest difference with x(smallest diff value).
BUT that y which is taken can not be reused.(for example y=2 will be taken in (x=1) group so that can not be reused in (x=3) group)
Expected result:
x y diff
1 2 1
3 4 1
4.5 4 0.5
seems to be very tricky in plain SQL. I am using PostgreSQL. The real data will be much
complicated and longer than this idea-shooting example
If properly understood your question
test=# select * from A;
x | y | diff
---+---+------
1 | 2 | 1
1 | 4 | 3
1 | 6 | 5
3 | 2 | 1
3 | 4 | 1
3 | 6 | 3
5 | 2 | 3
5 | 4 | 1
5 | 6 | 1
(9 rows)
test=# SELECT MIN(x) AS x, y FROM A WHERE diff = 1 GROUP BY y ORDER BY x;
x | y
---+---
1 | 2
3 | 4
5 | 6
(3 rows)
SELECT MIN(x) AS x, y, MIN(diff) FROM A WHERE diff = 1 GROUP BY y ORDER BY x;
x | y | min
---+---+-----
1 | 2 | 1
3 | 4 | 1
5 | 6 | 1
(3 rows)
added MIN(diff) if not needed can be removed.
Try like this
t1 as table name
d as diff
with cte as (
select x, y,d from t1 where d=(select min(d) from t1) order by x )
select t1.x, min(t1.y), min(t1.d) from t1 inner join cte on
t1.x=cte.x and not t1.y in (select y from cte where cte.x<t1.x)
group by t1.x
This is more of a comment.
This problem essentially a graph problem, of finding the shortest set pairs between two discrete sets (x and y in this case). Technically, this is a maximum matching of a weighted bipartite graph (see [here][1]). I don't think this problem is NP-complete. But that still can make it hard to solve particularly in SQL.
Regardless of whether or not it is hard in the theoretical sense (NP-complete is considered "hard theoretically"), it is hard to do in SQL. One issue is that greedy algorithms don't work. The same "y" value might be closest to all the X values. Which one to choose? Well, the algorithm has to look further.
The only way that I can think to do this accurate in SQL is an exhaustive approach. That is, generate all possible combinations and then check for the one that meets your conditions. Finding all possible combinations requires generating N-factorial combinations of the X's (or Y's). That, in turn, requires a lot of computation. My first thought would be to use recursive CTEs for this. However, that would only work on small problems.
Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)
I am sure my question is very simple for some, but I cannot figure it out and it is one of those things difficult to search an answer for. I hope you can help.
In a table in SQL I have the following (simplified data):
UserID UserIDX Number Date
aaa bbb 1 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 3 15.02.2020
xxx bbb 99 15.02.2020
And I need a view which will give me the same amount of records, but for every combination of UserID and UserIDX, there should be only 1 value under the Number field, i.e. the highest value found in the combination data set. The Date field needs to remain unchanged. So the above would be transformed to:
UserID UserIDX Number Date
aaa bbb 5 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 9 15.02.2020
xxx bbb 99 15.02.2020
So, for all instances of aaa+bbb combination the unique value in Number should be 5 and for ppp+ggg the unique number is 9.
Thank you very much.
Leo
select userid,useridx,maxnum,date
from table a
inner join (
select userid,useridx,max(number) maxnum
from table
group by userid,useridx) b
on a.userid = b.userid and a.useridx = b.useridx
I have written a query which involves joins and finally returns the below result,
Name ID
AAA 1
BBB 1
BBB 6
CCC 1
CCC 6
DDD 6
EEE 1
But I want my result to be still filtered in such a way that, the duplicate values in the first column should be ignored which has lesser value. ie, CCC and BBB which are duplicates with value 1 should be removed. The result should be
AAA 1
BBB 6
CCC 6
DDD 6
EEE 1
Note: I have a condition called Where (ID = '6' or ID = '1'), is there any way to improve this condition saying Where ID = 6 or ID = 1 (if no 6 is available in that table)"
You will likely want to add:
GROUP BY name
to the bottom of your query and change ID to MAX(ID) in your SELECT statement
It is hard to give a more specific answer without seeing the query you've already written.