Is there a way to re arrange two data sets for one in pandas? - pandas

I have data read as follows
K a b c
p1 x y z
p2 e r x
p3 v w q
...........
..............
p9 Y..........
K d e f
p9 x y z
p8 o p q
p7 h i j
..............
............
I like to re arrange data as follows
K a b c.......d e f
p1 x y z ..............
p2 e r x ..............
p3 v w q.......
*
*
*
p7............h i j
p8........... o p q
p9 Y..........x y z
Index name is K, and one of index is 'k' too.
Can I turn my one of index into same index name and merge them into one?
Thank you in advance.
I divided them into 2 dfs, then
pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
but didn't work.

Related

How can I map a list of values to a dataframe

i'm trying to map some values to a dataframe, i used some looping methods but it seems that there must be a simple way to acheive the result im looking for
input_df :
A B C
x y z
i j t
f g h
list if values to map :
list = [1,2,3]
result_df :
A B C D
x y z 1
i j t 1
f g h 1
x y z 2
i j t 2
f g h 2
x y z 3
i j t 3
f g h 3
Try a cross join (i.e. Cartesian product):
tmp = pd.DataFrame(list, columns=["D"])
tmp.merge(input_df, how="cross")
Require pandas >= 1.2.0. pd.merge

Pandas Dataframe transformation - Understanding problems with functions I should use and logic I should opt for

I've got a hard problem with transforming a dataframe into another one.
I don't know what functions I should use to do what I want. I had some ideas that didn't work at all.
For example, I do not understand how I should use append (or if I should use it or something else) to do what I want.
Here is my original dataframe:
df1 = pd.DataFrame({
'Key': ['K0', 'K1', 'K2'],
'X0': ['a','b','a'],
'Y0': ['c','d','c'],
'X1': ['e','f','f'],
'Y1': ['g','h','h']
})
Key X0 Y0 X1 Y1
0 K0 a c e g
1 K1 b d f h
2 K2 a c f h
This dataframe represents every links attached to an ID in column Key. Links are following each other : X0->Y0 is the father of X1->Y1.
It's easy to read, and the real dataframe I'm working with has 6500 rows by 21 columns that represents a tree of links. So this dataframe has an end to end links logic.
I want to transform it into another one that has a unitary links and ID logic (because it's a tree of links, some unitary links may be part of multiple end to end links)
So I want to get each individual links into X->Y and I need to get the list of the Keys attached to each unitary links into Keys.
And here is what I want :
df3 = pd.DataFrame({
'Key':[['K0','K2'],'K1','K0',['K1','K2']],
'X':['a','b','e','f'],
'Y':['c','d','g','h']
})
Key X Y
0 [K0, K2] a c
1 K1 b d
2 K0 e g
3 [K1, K2] f h
To do this, I first have to combine X0 and X1 into a unique X column, idem for Y0 and Y1 into a unique Y column. At the same time I need to keep the keys attached to the links. This first transformation leads to a new dataframe, containing all the original information with duplicates which I will deal with after to obtain df3.
Here is the transition dataframe :
df2 = pd.DataFrame({
'Key':['K0','K1','K2','K0','K1','K2'],
'X':['a','b','a','e','f','f'],
'Y':['c','d','c','g','h','h']
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Transition from df1 to df2
For now, I did this to put X0,X1 and Y0,Y1 into X and Y :
Key = pd.Series(dtype=str)
X = pd.Series(dtype=str)
Y = pd.Series(dtype=str)
for i in df1.columns:
if 'K' in i:
Key = Key.append(df1[i], ignore_index=True)
elif 'X' in i:
X = X.append(df1[i], ignore_index=True)
elif 'Y' in i:
Y = Y.append(df1[i], ignore_index=True)
0 K0
1 K1
2 K2
dtype: object
0 a
1 b
2 a
3 e
4 f
5 f
dtype: object
0 c
1 d
2 c
3 g
4 h
5 h
dtype: object
But I do not know how to get the keys to keep them in front of the right links.
Also, I do this to construct df2, but it's weird and I do not understand how I should do it :
df2 = pd.DataFrame({
'Key':Key,
'X':X,
'Y':Y
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 NaN e g
4 NaN f h
5 NaN f h
I tried to use append, to combine the X0,X1 and Y0,Y1 columns directly into df2, but it turns out to be a complete mess, not filling df2 columns with df1 columns content. I also tried to use append to put the Series Key, X and Y directly into df2, but it gave me X and Y in rows instead of columns.
In short, I'm quite lost with it. I know there may be a lot to program to take df1, turn in into df2 and then into df3. I'm not asking for you to solve the problem for me, but I really need help about the functions I should use or the logic I should put in place to achieve my goal.
To convert df1 to df2 you want to look into pandas.wide_to_long.
https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
>>> df2 = pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
>>> df2
X Y
Key num
K0 0 a c
K1 0 b d
K2 0 a c
K0 1 e g
K1 1 f h
K2 1 f h
You can drop the unwanted level "num" from the index using droplevel and turn the index level "Key" into a column using reset_index. Chaining everything:
>>> df2 = (
pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
.droplevel(level='num')
.reset_index()
)
>>> df2
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Finally, to get df3 you just need to group df2 by "X" and "Y", and aggregate the "Key" groups into lists.
>>> df3 = df2.groupby(['X','Y'], as_index=False).agg(list)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d [K1]
2 e g [K0]
3 f h [K1, K2]
If you don't want single keys to be lists you can do this instead
>>> df3 = (
df2.groupby(['X','Y'], as_index=False)
.agg(lambda g: g.iloc[0] if len(g) == 1 else list(g))
)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d K1
2 e g K0
3 f h [K1, K2]

Pandas Merging Data Frames Repeated Values and Values Missing

So I've created three data frames from 3 separate files (csv and xls). I want to combine the three of them into a single data frame that is 20 columns and 15 rows. I've managed to successfully do this using the code at the bottom (this is the final part of the code where I started to merge all of the existing data frames I created). However, an odd thing is happening, where the highest ranking country is duplicated 3 times, and there are two values from the 15 columns that should be there but that are missing, and I'm not exactly sure why.
I've set the index to be the same in each data frame!
So essentially my issue is that there are duplicate values showing up and other values being eliminated after I merge the data frames.
If someone could explain the mechanics to me as to why this issue is occuring I'd really appreciate it :)
***merged = pd.merge(pd.merge(df_ScimEn,df_energy[ListEnergy],left_index=True,right_index=True),df_GDP[ListOfGDP],left_index=True,right_index=True))
merged = merged[ListOfColumns]
merged = merged.sort_values('Rank')
merged = merged[merged['Rank']<16]
final = pd.DataFrame(merged)***
***Example: a shorter version of what is happening
expected:
A B C D J K L R
1 x y z j a e c d
2 b c d l a l c d
3 j k e k a m c d
4 d k c k a n h d
5 d k j l a h c d
generated after I run the code above: (the 1 is repeated and the 3 is missing)
A B C D J K L R
1 x y z j a b c d
1 x y z j a b c d
1 x y z j a b c d
4 d k c k a b h d
5 d k j l a h c d***
***Example Input
df1 = {[1:A,B,C],[2:A,B,C],[3:A,B,C],[4:A,B,C],[5:A,B,C]}
df2 = {[1:J,K,L,M],[2:J,K,L,M],[3:J,K,L,M],[4:J,K,L,M],[5:J,K,L,M]}
df3 = {[1:R,E,T],[2:R,E,T],[3:R,E,T],[4:R,E,T],[5:R,E,T]}
So the indexes are all the same for each data frame and then some have a
different number of rows and different number of columns but I've edited them
to form the final data frame. and each capital letter stands for a column
name with different values for each column***

Append two pandas dataframe with different shapes and in for loop using python or pandasql

I have two dataframe such as:
df1:
id A B C D
1 a b c d
1 e f g h
1 i j k l
df2:
id A C D
2 x y z
2 u v w
The final outcome should be:
id A B C D
1 a b c d
1 e f g h
1 i j k l
2 x y z
2 u v w
These tables are generated using for loop from json files. So have to keep on appending these tables one below another.
Note: Two dataframes 'id' column is always different.
My approach:
data is a dataframe in which column 'X' has json data and has and "id" column also.
df1=pd.DataFrame()
for i, row1 in data.head(2).iterrows():
df2= pd.io.json.json_normalize(row1["X"])
df2.columns = df2.columns.map(lambda x: x.split(".")[-1])
df2["id"]=[row1["id"] for i in range(df2.shape[0])]
if len(df1)==0:
df1=df2.copy()
df1=pd.concat((df1,df2), ignore_index=True)
Error: AssertionError: Number of manager items must equal union of block items # manager items: 46, # tot_items: 49
How to solve this using python or pandas sql.
You can use pd.concat to concatenate two dataframes like
>>> pd.concat((df,df1), ignore_index=True)
id A B C D
0 1 a b c d
1 1 e f g h
2 1 i j k l
3 2 x NaN y z
4 2 u NaN v w

How to aggregate a column by a value on another column?

Suppose I have the following df.
df = pd.DataFrame({
'A':['x','y','x','y'],
'B':['a','b','a','b'],
'C':[1,10,100,1000],
'D':['w','v','v','w']
})
A B C D
0 x a 1 w
1 y b 10 v
2 x a 100 v
3 y b 1000 w
I want to group by columns A and B, sum column C, and keep the value from D which is the same row of the maximum group value of C. Like this:
A B C D
x a 101 v
y b 1010 w
So far, I have this:
df.groupby(['A','B']).agg({'C':sum})
A B C
x a 101
y b 1010
What function do I have to aggregate column D with?
You can use DataFrameGroupBy.idxmax for indices of max values of C with loc:
#unique index
df.reset_index(drop=True, inplace=True)
df1 = df.groupby(['A','B'])['C'].agg(['sum', 'idxmax'])
df1['idxmax'] = df.loc[df1['idxmax'], 'D'].values
df1 = df1.rename(columns={'idxmax':'D','sum':'C'}).reset_index()
Similar solution with map:
df1 = df.groupby(['A','B'])['C'].agg(['sum', 'idxmax']).reset_index()
df1['idxmax'] = df1['idxmax'].map(df['D'])
df1 = df1.rename(columns={'idxmax':'D','sum':'C'})
print (df1)
A B C D
0 x a 101 v
1 y b 1010 w
set_index before you group by
df.set_index('D').groupby(['A','B']).C.agg(['sum','idxmax']).\
reset_index().rename(columns={'idxmax':'D','sum':'C'})
Out[407]:
A B C D
0 x a 101 v
1 y b 1010 w