Reshape dataframe using melt, stack and multi index? - pandas

I am new to Python and I have a dataframe that needs a bit of a complicated reshaping. It is best describing with an example using dummy data:
I have this:
and I need this:
The original dataframe is:
testdata = [('State', ['CA', 'FL', 'ON']),
('Country', ['US', 'US', 'CAN']),
('a1', [0.059485629, 0.968962817, 0.645435903]),
('b2', [0.336665658, 0.404398227, 0.333113735]),
('Test', ['Test1', 'Test2', 'Test3']),
('d', [20, 18, 24]),
('e', [21, 16, 25]),
]
df = pd.DataFrame.from_items(testdata)
The dataframe I am after is:
testdata2 = [('State', ['CA', 'CA', 'FL', 'FL', 'ON', 'ON']),
('Country', ['US', 'US', 'US', 'US', 'CAN', 'CAN']),
('Test', ['Test1', 'Test1', 'Test2', 'Test2', 'Test3', 'Test3']),
('Measurements', ['a1', 'b2', 'a1', 'b2', 'a1', 'b2']),
('Values', [0.059485629, 0.336665658, 0.968962817, 0.404398227, 0.645435903, 0.333113735]),
('Steps', [20, 21, 18, 16, 24, 25]),
]
dfn = pd.DataFrame.from_items(testdata2)
It looks like the solution likely requires use of melt, stack and multiindex but I am not sure how to bring all those together.
Any suggested solutions will be greatly appreciated.
Thank you.

Let's try:
df1 = df.melt(id_vars=['State','Country','Test'],value_vars=['a1','b2'],value_name='Values',var_name='Measuremensts')
df2 = df.melt(id_vars=['State','Country','Test'],value_vars=['d','e'],value_name='Steps').drop('variable',axis=1)
df1.merge(df2, on=['State','Country','Test'], right_index=True, left_index=True)
Output:
State Country Test Measuremensts Values Steps
0 CA US Test1 a1 0.059486 20
1 FL US Test2 a1 0.968963 18
2 ON CAN Test3 a1 0.645436 24
3 CA US Test1 b2 0.336666 21
4 FL US Test2 b2 0.404398 16
5 ON CAN Test3 b2 0.333114 25
Or use #JohnGalt solution:
pd.concat([pd.melt(df, id_vars=['State', 'Country', 'Test'], value_vars=x) for x in [['d', 'e'], ['a1', 'b2']]], axis=1)

There is a way to do this using pd.wide_to_long but you must rename your columns so that the Measurements column contains the correct values
df1 = df.rename(columns={'a1':'Values_a1', 'b2':'Values_b2', 'd':'Steps_a1', 'e':'Steps_b2'})
pd.wide_to_long(df1,
stubnames=['Values', 'Steps'],
i=['State', 'Country', 'Test'],
j='Measurements',
sep='_',
suffix='.').reset_index()
State Country Test Measurements Values Steps
0 CA US Test1 a1 0.059486 20
1 CA US Test1 b2 0.336666 21
2 FL US Test2 a1 0.968963 18
3 FL US Test2 b2 0.404398 16
4 ON CAN Test3 a1 0.645436 24
5 ON CAN Test3 b2 0.333114 25

Related

Unable to run string functions on pandas Series values

i am writing to filter some code from a dataframe.
students = [('jack', 34, 'Sydeny', 'Australia'),
('Riti', 30, 'Delhi', 'India'),
('Vikas', 31, 'Mumbai', 'India'),
('Neelu', 32, 'Bangalore', 'India'),
('John', 16, 'New York', 'US'),
('Mike', 17, 'las vegas', 'US')]
df = pd.DataFrame( students,
columns=['Name', 'Age', 'City', 'Country'],
index=['a', 'b', 'c', 'd', 'e', 'f'])
i am trying to filter records for which country starts with 'I'. When i try to run this
print(df.loc[lambda x:np.char.startswith(x['Country'],'I')])
it says
string operation on non-string array
Even tried converting the column to string with
df.astype({'Country':str})
pl update what is the mistake i am making
Use str accessor:
>>> df[df['Country'].str.startswith('I')]
Name Age City Country
b Riti 30 Delhi India
c Vikas 31 Mumbai India
d Neelu 32 Bangalore India
# OR df[df['Country'].str[0] == 'I']
You can read Testing for strings that match or contain a pattern to know more.
Update
To fix your code, you have to convert Country Series to list or array with string or unicode dtype (not object)
>>> df[np.char.startswith(df['Country'].to_numpy(str), 'I')]
Name Age City Country
b Riti 30 Delhi India
c Vikas 31 Mumbai India
d Neelu 32 Bangalore India

pandas.Series.mode returns ndarray instead of single value

I need to get pd.Series of single values from Series.mode function.
Example code:
df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5'],
'key': [0, 1, 2, 3, 3, 3]})
modes = df.groupby('key')['A'].agg(pd.Series.mode)
key A
0 A0
1 A1
2 A2
3 ['A3' 'A4' 'A5']
The problem is row '3'. It returns numpy.ndarray.
How should I modify my script to get single values in all rows.
It is convenient for me to get any of mode values A3, A4, A5.
You could explode the output to get duplicated indices:
modes = df.groupby('key')['A'].agg(pd.Series.mode).explode()
output:
key
0 A0
1 A1
2 A2
3 A3
3 A4
3 A5
Name: A, dtype: object

Pandas conditional average by groups

Updated:
In this post, the first answer is very close to solve this problem too. However it does not take into account the column A and C.
Pandas Average If in Python : Combining groupby mean with conditional statement
There is a DataFrame with 3 columns. I would like to add 2 new columns which are:
the rolling avg of B by A and C (rolling 2 of the current and the previous row which are pass the statement - the same A and C)
the rolling avg of B by A and C (rolling 2 of the previous 2 which are pass the statement - the same A and C)
For the second part, I have date and a sequence which could be used as the basic of rolling avg calculation.
Any ideas?
df = pd.DataFrame({'A': ['t1', 't1', 't1', 't1', 't2', 't2', 't2', 't2','t1'],
'B': [100, 104, 108, 110, 102, 110, 98, 100, 200],
'C': ['h', 'a', 'a', 'a', 'a', 'h', 'h', 'h','h'],
'expected1': [100, 104, 106, 109, 102, 110, 104, 99, 150],
'expected2': [0, 0, 104, 106, 0, 0, 110, 104, 100]}, columns=['A', 'B', 'C','expected1','expected2'])
df
Use lazy group:
grp = df.groupby(['A', 'C'], sort=False)['B']
df['mean'] = grp.transform('mean')
df['mean_avg'] = grp.rolling(2, min_periods=1).mean().values
Output:
>>> df
A B C mean mean_avg
0 t1 100 h 100.000000 100.0
1 t1 104 a 107.333333 104.0
2 t1 108 a 107.333333 106.0
3 t1 110 a 107.333333 109.0
4 t2 102 a 102.000000 110.0
5 t2 110 h 102.666667 104.0
6 t2 98 h 102.666667 99.0
7 t2 100 h 102.666667 102.0

merge two DataFrame with two columns and keep the same order with original indexes in the result

I have two pandas data frames. Both data frames have two key columns and one value column for merge. I want to keep the same order with original indexes in the merged result.
The keys and values might be missing or changed in the other data frame.
The order of data are important. You can't sort them by the keys or values in the merged result.
It should looks like this:
df1_index / df2_index / results are just used for demonstration.
I tried to use merge with outer:
df1 = pd.DataFrame({
"key1": ['K', 'K', 'A1', 'A2', 'B1', 'B9', 'C3'],
"key2": ['a5', 'a4', 'a7', 'a9', 'b2', 'b8', 'c1'],
"Value1": ['apple', 'guava', 'kiwi', 'grape', 'banana', 'peach', 'berry'],
})
df2 = pd.DataFrame({
"key1": ['K', 'A1', 'A3', 'B1', 'C2', 'C3'],
"key2": ['a9', 'a7', 'a9', 'b2', 'c7', 'c1'],
"Value2": ['apple', 'kiwi', 'grape', 'banana', 'guava', 'orange'],
})
merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])
but it just added missing keys in the end of rows:
How do I merge and align them up?
when constructing the merged dataframe, get the index values from each dataframe.
merged_df = pd.merge(df1, df2, how="outer", on=['key1', 'key2'])
use combine_first to combine index_x & index_y
merged_df['combined_index'] =merged_df.index_x.combine_first(merged_df.index_y)
sort using combined_index & index_x dropping columns which are not needed & resetting index.
output = merged_df.sort_values(
['combined_index', 'index_x']
).drop(
['index_x', 'index_y', 'combined_index'], axis=1
).reset_index(drop=True)
This results in the following output:
key1 key2 Value1 Value2
0 K a5 apple NaN
1 K a9 NaN apple
2 K a4 guava NaN
3 A1 a7 kiwi kiwi
4 A3 a9 NaN grape
5 A2 a9 grape NaN
6 B1 b2 banana banana
7 C2 c7 NaN guava
8 B9 b8 peach NaN
9 C3 c1 berry orange

Is there a way to separate a column containing multiple data sets?

I'm new to this and the dataframe I'm currently working with has four columns containing data in just object datatype. The last column contains multiple data points...
i.e. the first row, last column contains:
[{"year":"1901","a":"A","b":"B"}] #printed in this format
Is there a way so I can create a new column containing just the year? i.e. isolate this data
Thanks in advance
With pandas, you can add a new column the same way you add a value to a dictionary.
So this should work for you.
df['year'] = [i[0]['year'] for i in df['last_column']]
You can use the df.apply() to get the dictionary value and assign it to a new column.
import pandas as pd
df = pd.DataFrame({'col1':['Jack','Jill','Moon','Wall','Hill'],
'col2':[100,200,300,400,500],
'col3':[{"year":"1901","a":"A","b":"B"},
{"year":"1902","c":"C","d":"D"},
{"year":"1903","e":"E","f":"F"},
{"year":"1904","g":"G","h":"H"},
{"year":"1905","i":"I","j":"J"}] })
print (df)
df['year'] = df['col3'].apply(lambda x: x['year'])
print (df)
Output for the above code:
Original DataFrame:
col1 col2 col3
0 Jack 100 {'year': '1901', 'a': 'A', 'b': 'B'}
1 Jill 200 {'year': '1902', 'c': 'C', 'd': 'D'}
2 Moon 300 {'year': '1903', 'e': 'E', 'f': 'F'}
3 Wall 400 {'year': '1904', 'g': 'G', 'h': 'H'}
4 Hill 500 {'year': '1905', 'i': 'I', 'j': 'J'}
Updated DataFrame:
col1 col2 col3 year
0 Jack 100 {'year': '1901', 'a': 'A', 'b': 'B'} 1901
1 Jill 200 {'year': '1902', 'c': 'C', 'd': 'D'} 1902
2 Moon 300 {'year': '1903', 'e': 'E', 'f': 'F'} 1903
3 Wall 400 {'year': '1904', 'g': 'G', 'h': 'H'} 1904
4 Hill 500 {'year': '1905', 'i': 'I', 'j': 'J'} 1905