Aggregate symmetric pairs pandas - pandas

How do I aggregate symmetric pairs in pandas?
I have a dataframe which looks like this:
X Y count
A B 2
B A 1
C D 5
D C 3
My output should look like this:
X Y count
A B 3
C D 8
Thank you!

I used to have the same problem before , And this is my solution
df1=df[['X','Y']].apply(sorted,1)
df.groupby([df1.X,df1.Y])['count'].sum().reset_index(name='count')
Out[400]:
X Y count
0 A B 3
1 C D 8

Related

Map Dictionary of Lists to pd Dataframe Column and Create Repeat Rows Based on n Number of List Contents

I am trying to use the following two components 1) a dictionary of lists and 2) a dataframe column composed of the dictionary keys. I would like to to map n number of values to their corresponding key in the existing pandas column, and create duplicate rows based on the number of list contents. I would like to maintain this as a df and not convert to series.
ex. dictionary
d = {a:['i','ii'],b:['iii','iv'],c:['v','vi','vii']}
ex. dataframe columns
Column1 Column2
0 g a
1 h b
2 i c
desired output:
Column1 Column2 Column3
0 g a i
1 g a ii
2 h b iii
3 h b iv
4 i c v
5 i c vi
6 i c vii
What if another dictionary had to be mapped similarly to these three columns from the output? Say, with the following dictionary:
d2 = {'i':['A'],'ii':['B'],'iii':['C','D'],'iv':['E'],'v':['F'];'vi':[G];'vii':['H','I','J']}
What if the dictionary was in df format?
Any help would be much appreciated! Thank you!
use map to create a new column and then explode the list into rows
df=df.assign(Column3 = df['Column2'].map( d))
df.explode('Column3')
Column1 Column2 Column3
0 g a i
0 g a ii
1 h b iii
1 h b iv
2 i c v
2 i c vi
2 i c vii
follow the same to map to Column 3
df=df.assign(Column4 = df['Column3'].map( d2))
df=df.explode('Column4')
df
Column1 Column2 Column3 Column4
0 g a i A
0 g a ii B
1 h b iii C
1 h b iii D
1 h b iv E
2 i c v F
2 i c vi G
2 i c vii H
2 i c vii I
2 i c vii J

Conditional frequency of elements within lists in pandas data frame

I have a data frame in pandas like this:
STATUS FEATURES
A [x,y,z]
A [t, y]
B [x,p,t]
B [x,p]
I want to count the frequency of the elements in the lists of features conditional on the status.
The desired output would be:
STATUS FEATURES FREQUENCY
A x 1
A y 2
A z 1
A t 1
B x 2
B t 1
B p 2
Let us do explode , the groupby size
s=df.explode(['FEATURES']).groupby(['STATUS','FEATURES']).size().reset_index()
Use DataFrame.explode and SeriesGroupBy.value_counts:
new_df = (df.explode('FEATURES')
.groupby('STATUS')['FEATURES']
.value_counts()
.reset_index(name='FRECUENCY'))
print(new_df)
Output
STATUS FEATURES FRECUENCY
0 A y 2
1 A t 1
2 A x 1
3 A z 1
4 B p 2
5 B x 2
6 B t 1

Append two pandas dataframe with different shapes and in for loop using python or pandasql

I have two dataframe such as:
df1:
id A B C D
1 a b c d
1 e f g h
1 i j k l
df2:
id A C D
2 x y z
2 u v w
The final outcome should be:
id A B C D
1 a b c d
1 e f g h
1 i j k l
2 x y z
2 u v w
These tables are generated using for loop from json files. So have to keep on appending these tables one below another.
Note: Two dataframes 'id' column is always different.
My approach:
data is a dataframe in which column 'X' has json data and has and "id" column also.
df1=pd.DataFrame()
for i, row1 in data.head(2).iterrows():
df2= pd.io.json.json_normalize(row1["X"])
df2.columns = df2.columns.map(lambda x: x.split(".")[-1])
df2["id"]=[row1["id"] for i in range(df2.shape[0])]
if len(df1)==0:
df1=df2.copy()
df1=pd.concat((df1,df2), ignore_index=True)
Error: AssertionError: Number of manager items must equal union of block items # manager items: 46, # tot_items: 49
How to solve this using python or pandas sql.
You can use pd.concat to concatenate two dataframes like
>>> pd.concat((df,df1), ignore_index=True)
id A B C D
0 1 a b c d
1 1 e f g h
2 1 i j k l
3 2 x NaN y z
4 2 u NaN v w

Pandas groupby sort each group values and order dataframe groups based on max of each group

I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2

How can I run a vlookup function in SQL within the same table?

I'm fairly new to SQL and struggling to find a good way to run the following query.
I have a table that looks something like this:
NAME JOB GRADE MANAGER NAME
X 7 O
Y 6 X
Z 5 X
A 4 Z
B 3 Z
C 2 Z
In this table, it shows that Y and Z report into X, and A, B and C report into Z.
I want to create a computed column showing the grade each person's most senior direct report or "n/a" if they don't manage anyone. So that would look something like this:
NAME JOB GRADE MANAGER NAME GRADE OF MOST SENIOR REPORT
X 7 O 6
Y 6 X N/A
Z 5 X 4
A 4 Z N/A
B 3 Z N/A
C 2 Z N/A
How would I do this?
SELECT g.*,isnull(convert(nvarchar, (SELECT max(g2.GRADE)
FROM dbo.Grade g2 WHERE
g2.manager =g.NAME AND g2.NAME!=g.NAME )),'N/A') as most_graded
FROM dbo.Grade g
The max will find out the topmost graded
Input
X 7 O
y 6 X
Z 5 X
A 6 Z
C 2 Z
Output
X 7 O 6
y 6 X N/A
Z 5 X 6
A 6 Z N/A
C 2 Z N/A
Something like this:
select name, job_grade, manager_name,
(select max(job_grade) from grades g2
where g2.manager_name = g1.name) as grade_of_most_recent_senior
from grades g1
order by name;
The above is ANSI SQL and should work on any DBMS.
SQLFiddle example: http://sqlfiddle.com/#!15/e0806/1