pandas: get top n including the duplicates of a sorted column - pandas

I have some data like
This is a table sorted by score column and also then by cat column
score cat
18 B
18 A
17 A
16 B
16 A
15 B
14 B
13 A
12 A
10 B
9 B
I want to get the top 5 of score including the duplicates and also add the rank
i.e
rank score cat
1 18 B
1 18 A
2 17 A
3 16 B
3 16 A
4 15 B
5 14 B
How can i get this using pandas

Since the data frame is ordered, try factorize
df['rnk'] = df.score.factorize()[0]+1
out = df[df['rnk'] <= 5]
out
score cat rnk
0 18 B 1
1 18 A 1
2 17 A 2
3 16 B 3
4 16 A 3
5 15 B 4
6 14 B 5

Related

Add entries of a table into rows of another table

I have two tables
table a:
ID VALUE_z
1 41
2 32
3 51
table b:
ID TYPE z
1 a 10
1 b 15
1 c 20
2 a 12
2 b 8
2 c 5
3 a 21
3 b 4
3 c 2
I want to add the rows from table a to the column VALUE in table b based on the ID. The result should look like this
table result:
ID TYPE VALUE
1 a 10
1 b 15
1 c 20
1 z 41
2 a 12
2 b 8
2 c 5
2 z 32
3 a 21
3 b 4
3 c 2
3 z 51
Try the following using INSERT INTO SELECT Statement:
insert into tableB
select ID, 'z', VALUE_z
from tableA
See demo

Stack multiple columns into single column while maintaining other columns in Pandas?

Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24

Aggregating counts of different columns subsets

I have a dataset with a tree structure and for each path in the tree, I want to compute the corresponding counts at each level. Here is a minimal reproducible example with two levels.
import pandas as pd
data = pd.DataFrame()
data['level_1'] = np.random.choice(['1', '2', '3'], 100)
data['level_2'] = np.random.choice(['A', 'B', 'C'], 100)
I know I can get the counts on the last level by doing
counts = data.groupby(['level_1','level_2']).size().reset_index(name='count_2')
print(counts)
level_1 level_2 count_2
0 1 A 10
1 1 B 12
2 1 C 8
3 2 A 10
4 2 B 10
5 2 C 10
6 3 A 17
7 3 B 12
8 3 C 11
What I would like to have is a dataframe with one row for each possible path in the tree with the counts at each level in that path. For the example above, it would be something like
level_1 level_2 count_1 count_2
0 1 A 30 10
1 1 B 30 12
2 1 C 30 8
3 2 A 30 10
4 2 B 30 10
5 2 C 30 10
6 3 A 40 17
7 3 B 40 12
8 3 C 40 11
This is an example with only two levels, which is easy to solve, but I would like to have a way to get those counts for an arbitrary number of levels.
This will be the transform
counts['count_1']=counts.groupby(['level_1']).count_2.transform('sum')
counts
Out[445]:
level_1 level_2 count_2 count_1
0 1 A 7 30
1 1 B 13 30
2 1 C 10 30
3 2 A 7 30
4 2 B 7 30
5 2 C 16 30
6 3 A 9 40
7 3 B 10 40
8 3 C 21 40
You can make do from your original data:
groups = data.groupby('level_1').level_2
pd.merge(groups.value_counts(),
groups.size(),
left_index=True,
right_index=True)
which gives:
level_2_x level_2_y
level_1 level_2
1 A 14 39
B 14 39
C 11 39
2 C 13 34
A 12 34
B 9 34
3 B 12 27
C 9 27
A 6 27

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18

See if all records in the same group are of accepted types

Consider the following table. Each document (id) belongs to a group (group_id).
-----------------------
id group_id value
-----------------------
1 1 A
2 1 B
3 1 D
4 2 A
5 2 B
6 3 C
7 4 A
8 4 B
9 4 B
10 4 B
11 4 C
12 5 A
13 5 A
14 5 A
15 6 B
16 6 NULL
17 6 NULL
18 6 D
19 7 NULL
20 8 B
1/ Each document has a value NULL, A, B, C or D
2/ If the documents in the same group all have either A or B as value, the group is completed
3/ In this case, the desired output would read:
---------------------
group_id completed
---------------------
1 0 <== because document 3 = D
2 1 <== all documents have either A or B as a value
3 0 <== only one document in the group, value C
4 1 <== all documents have either A or B as a value
5 1 <== all documents have value A
6 0 <== because of NULL values and value D
7 0 <== NULL
8 1 <== only one document, value B
IS it possible to query this resultset?
As I am not very experienced in SQL, any help would be appreciated!
Try this
SELECT [group_id],
CASE
WHEN Count(CASE WHEN [value] IN ( 'A', 'B' ) THEN 1 END) = Count(*) THEN 1
ELSE 0
END AS COMPLETED
FROM yourtable
GROUP BY [group_id]