if I have two rows for the same ID, then I have to check for col2 and pick rows with values N and Q and skip the row with U. If there is single record with col2=U, then let it be. so for ID 123 and 555, output is with col2 N and Q resp.
ID Col1 Col2 Col3
123 AAA N true
123 BBB U true
000 AAA N true
222 CCC U false
555 FIC Q false
555 VAN U true
expected output is:
Expected output:
ID Col1 Col2 Col3
123 AAA N true
000 AAA N true
222 CCC U false
555 FIC Q false
how can I do this in pandas ?
in sql, I tried with having count(*)>1, and then picked these columns.
You can use this code:
df.drop_duplicates('ID')
Above code keep always first record. You can change this with last instead of first record.
df.drop_duplicates(subset='ID', keep="first")
df.drop_duplicates(subset='ID', keep="last")
or you may sort for any column and then using of drop_duplicates method. In this way, (by order Ascending or Descending) you may use keep="first" for Min or Max.
One simple approach is to sort your dataframe by Col2 ensuring that 'U' will end up last. There are several possibilities:
pandas.Categorical
This sets an ordered categorical type on Col2
categories = np.append(np.setdiff1d(df['Col2'], ['U']), ['U'])
df['Col2'] = pd.Categorical(df['Col2'], categories=categories, ordered=True)
df.sort_values(by='Col2').groupby('ID').first()
Split dataframe
This splits the dataframe in two based on the values of Col2 (not-U and U), and concatenates the two parts to ensure the U are at the end
pd.concat([df.query('Col2 != "U"'), df.query('Col2 == "U"')]).groupby('ID').first()
Custom sort order
This manually defines the sorting order from a list
custom_order = ['N', 'Q', 'Z', 'U']
custom_order_dict = dict(zip(custom_order, range(len(custom_order))))
df.sort_values(by='Col2', key=lambda x: x.map(custom_order_dict)).groupby('ID').first()
input
ID Col1 Col2 Col3
0 123 AAA N True
1 123 BBB U True
2 0 AAA N True
3 222 CCC U False
4 555 FIC Q False
5 555 VAN U True
6 777 UUU U False
7 777 ZZZ Z True
8 999 UUU U False
9 999 NNN N True
output
Col1 Col2 Col3
ID
000 AAA N True
123 AAA N True
222 CCC U False
555 FIC Q False
777 ZZZ Z True
999 NNN N True
I tried a solution with multiple steps. this might not be the best way to do it but I did not find any other solution.
First step:
Separate records/rows for multiple ID's
df_multiple_record=pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
Output:
ID Col1 Col2 Col3
123 AAA N true
123 BBB U true
555 FIC Q false
555 VAN U true
Second Step:
drop the record with col2='U'
df_drop_U=df_multiple_record[df_multiple_record['Col2']!='U']
output:
ID Col1 Col2 Col3
123 AAA N true
555 FIC Q false
third Step:
drop the duplicates on ID from main extract to get the records for single occurance of ID
df_single_record=df.drop_duplicates(subset=['ID'],keep=False)
output:
ID Col1 Col2 Col3
000 AAA N true
222 CCC U false
fourth step:
concatenate single record df with the df where we drop U
df_final=pd.concat([df_single_record,df_drop_U],ignore_index=True)
output:
ID Col1 Col2 Col3
000 AAA N true
222 CCC U false
123 AAA N true
555 FIC Q false
Related
I want to calculate the frequency by each row data. For instance,
column_nameA
column_nameB
column_nameC
title
content
AAA company
AAA
Ben Simons
AAA company has new product lanuch.
AAA company has released new product. AAA claims that the product X has significant changed than before. Ben Simons, who is AAA company CEO, also mentioned.......
BBB company
BBB
Alex Wong
AAA company has new product lanuch.
AAA company has released new product. BBB claims that the product X has significant changed than before, and BBB company has invested around 1 millions…....
In here, the result I expected is
When AAA company happens in the title and counts 1, if AAA company appears twice in the title then it should count as 2.
Also, the similar idea in the content, if AAA company appears once then count number shows 1, if AAA company appears twice in the title then it should count as 2.
However, if AAA company appears in the second row which the row only needs to consider BBB company or BBB instead AAA company or AAA.
So, the result would be like:
nameA_appear_in_title
nameB_appear_in_title
nameC_appear_in_title
nameA_appear_in_content
nameB_appear_in_content
nameC_appear_in_content
1
1
0
2
1
1
0
0
0
1
1
0
All the data has stored into the dataframe, and hope this can manipulate by using panda.
One more thing would be highlighted, the title or content cannot be tokenized to count the frequency.
Use itertools.product for all combinations of lists of columns names and create new columns with count, last remove original columns names if necessary:
cols = df.columns
L1 = ['column_nameA', 'column_nameB', 'column_nameC']
L2 = ['title', 'content']
from itertools import product
for a, b in product(L2, L1):
df[f'{b}_{a}'] = df.apply(lambda x: x[a].count(x[b]), axis=1)
df = df.drop(cols, axis=1)
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 1 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 3 1
1 1 2 0
Last if necessary subtract column_nameA from column_nameB use:
cola = df.columns.str.startswith('column_nameA')
colb = df.columns.str.startswith('column_nameB')
df.loc[:, colb] = df.loc[:, colb] - df.loc[:, cola].to_numpy()
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 0 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 1 1
1 1 1 0
I want to remove 'hi' ,'by', 'dy' from col2 at one shot in sql. I'm very new to sql server, if anyone could give an outline how such problems are solved that would be really helpful.
Col1 col2 col3
A hi!abcd 123
B bypython 678
C norm 888
D dupty dy 999
output:
Col1 col2 col3
A abcd 123
B python 678
C norm 888
D dupty 999
This question seems repetition and answered before but it is a bit tricky.
Let us say I have the following data frame.
Id Col_1
1 aaa
1 ccc
2 bbb
3 aa
Based on the value column Id and Col_1 I want create new column and assign new value by checking the existence of aa in Col_1. And this value should be applied based on the Id means if they have same Id.
The expected result:
Id Col_1 New_Column
1 aaa aa
1 ccc aa
2 bbb
3 aa aa
I tried it with this:
df['New_Column'] = ((df['Id']==1) | df['Col_1'].str.contains('aa')).map({True:'aa', False:''})
and the result is
Id Col_1 New_Column
1 aaa aa
1 ccc
2 bbb
3 aa aa
But as I mentioned it above, I want to assign aa on the new column with the same Id as well.
Can anyone help on this?
Use GroupBy.transform with GroupBy.any for get mask for all groups with at least one aaa:
mask = df['Col_1'].str.contains('aa').groupby(df['Id']).transform('any')
Alternative with Series.isin and filtering Id values by aa:
mask = df['Id'].isin(df.loc[df['Col_1'].str.contains('aa'), 'Id'])
df['New_Column'] = np.where(mask, 'aa','')
print (df)
Id Col_1 New_Column
0 1 aaa aa
1 1 ccc aa
2 2 bbb
3 3 aa aa
EDIT:
mask1 = df['Id'].isin(df.loc[df['Col_1'].str.contains('aa'), 'Id'])
mask2 = df['Id'].isin(df.loc[df['Col_1'].str.contains('bb'), 'Id'])
df['New_Column'] = np.select([mask1, mask2], ['aa','bb'],'')
print (df)
Id Col_1 New_Column
0 1 aaa aa
1 1 ccc aa
2 2 bbb bb
3 3 aa aa
I do not have much knowledge in the database.
For study, I am reading MariaDB's index documents.
But there are parts that I do not understand.
Document
Algorithm, step 2b (GROUP BY)¶
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
aaa or bbb knows that ordering of the indexes is important, regardless of the order of the where clauses. Therefore, the indexes of aaa and bbb in the where clause are used, and sort ccc based on the matched aaa and bbb.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
(no WHERE) means don't use WHERE clause?
What if I use it like this?
WHERE x > 1 GROUP BY x, y
my think:
(1) from table
(2) where x > 1 -> using index
(3) group by x, y -> using index..? because (2) already sorted..? or sort again?
(4) having -> if i did not enter this keyword, is it not used?
(5) select -> print data(?)
(6) order by -> group by already order by(?)
Algorithm, step 2b (GROUP BY)¶
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
there is table like below:
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
------------------
123 | 2 | 27
------------------
125 | 1 | 11
------------------
125 | 3 | 29
------------------
125 | 3 | 40
------------------
WHERE aaa = 123 AND bbb = 1 clause result is this:
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
check ccc column.
ccc column is sorted by bbb column.
so GROUP BY clause can be grouped quickly because the ccc columns are sorted.
**CAUTION**
think about WHERE aaa >= 123 AND bbb = 1 GROUP BY ccc clause.
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
------------------
125 | 1 | 11
------------------
ccc column doesn't be sorted by bbb column.
The ccc column is meaningful only if the aaa and bbb columns have the same value.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
this is same thing.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
should probably say "(if there is no WHERE)". If there is a WHERE, then that index may or may not be useful. You should (usually) build the INDEX based on the WHERE, an only if you get past it, consider the GROUP BY.
WHERE x > 1 GROUP BY x, y
OK, that can use INDEX(x,y), in that order. First, it will filter, and that leaves the rest of the index still in a good order for the grouping. Similarly:
WHERE x > 1 ORDER BY x, y
WHERE x > 1 GROUP BY x, y ORDER BY x, y
No sorting should be necessary.
So, here are the steps I might take:
1. WHERE x > 1 ... --> INDEX(x) (or any index _starting_ with `x`)
2. ... GROUP BY x, y --> INDEX(x,y)
3. recheck that I did not mess up the WHERE.
This has no really good index:
WHERE x > 1 AND y = 4 GROUP BY x,y
1. WHERE x > 1 AND y = 4 ... --> INDEX(y,x) in this order!
2. ... GROUP BY x,y --> can use that index
However, flipping to GROUP BY y,x has the same effect (ignoring the order of display).
(4) having -> if i did not enter this keyword, is it not used?
HAVING, if present, is applied after things for which INDEXes are useful. Having no HAVING does mean there is no HAVING.
(6) order by -> group by already order by(?)
That has become a tricky question. Until very recently (MySQL 8.0; don't know when or if MariaDB changed), GROUP BY implied the equivalent ORDER BY. That was non-standard and potentially interfered with optimization. With 8.0, GROUP BY does not imply any order; you must explicitly request the order (if you care).
(I updated the source document in response to this discussion.)
I am sure my question is very simple for some, but I cannot figure it out and it is one of those things difficult to search an answer for. I hope you can help.
In a table in SQL I have the following (simplified data):
UserID UserIDX Number Date
aaa bbb 1 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 3 15.02.2020
xxx bbb 99 15.02.2020
And I need a view which will give me the same amount of records, but for every combination of UserID and UserIDX, there should be only 1 value under the Number field, i.e. the highest value found in the combination data set. The Date field needs to remain unchanged. So the above would be transformed to:
UserID UserIDX Number Date
aaa bbb 5 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 9 15.02.2020
xxx bbb 99 15.02.2020
So, for all instances of aaa+bbb combination the unique value in Number should be 5 and for ppp+ggg the unique number is 9.
Thank you very much.
Leo
select userid,useridx,maxnum,date
from table a
inner join (
select userid,useridx,max(number) maxnum
from table
group by userid,useridx) b
on a.userid = b.userid and a.useridx = b.useridx