Updated:
In this post, the first answer is very close to solve this problem too. However it does not take into account the column A and C.
Pandas Average If in Python : Combining groupby mean with conditional statement
There is a DataFrame with 3 columns. I would like to add 2 new columns which are:
the rolling avg of B by A and C (rolling 2 of the current and the previous row which are pass the statement - the same A and C)
the rolling avg of B by A and C (rolling 2 of the previous 2 which are pass the statement - the same A and C)
For the second part, I have date and a sequence which could be used as the basic of rolling avg calculation.
Any ideas?
df = pd.DataFrame({'A': ['t1', 't1', 't1', 't1', 't2', 't2', 't2', 't2','t1'],
'B': [100, 104, 108, 110, 102, 110, 98, 100, 200],
'C': ['h', 'a', 'a', 'a', 'a', 'h', 'h', 'h','h'],
'expected1': [100, 104, 106, 109, 102, 110, 104, 99, 150],
'expected2': [0, 0, 104, 106, 0, 0, 110, 104, 100]}, columns=['A', 'B', 'C','expected1','expected2'])
df
Use lazy group:
grp = df.groupby(['A', 'C'], sort=False)['B']
df['mean'] = grp.transform('mean')
df['mean_avg'] = grp.rolling(2, min_periods=1).mean().values
Output:
>>> df
A B C mean mean_avg
0 t1 100 h 100.000000 100.0
1 t1 104 a 107.333333 104.0
2 t1 108 a 107.333333 106.0
3 t1 110 a 107.333333 109.0
4 t2 102 a 102.000000 110.0
5 t2 110 h 102.666667 104.0
6 t2 98 h 102.666667 99.0
7 t2 100 h 102.666667 102.0
Related
I'm pretty new to numpy arrays, and was not able to find a good explanation / example for my issue. I saw things like take() or take_along_axis() but I didn't understood what was going on...
I have this 2D numpy, which may contain N sub-arrays, of each 5 values (h, s, i, x, y):
values = np.array([
[1,2,3,4,5],
[1,22,33,44,55],
[1,22,333,444,555],
[1,22,333,4444,5555],
[1,222,33,44,55],
[1,222,330,440,550],
[10,20,30,40,50],
[100,200,300,400,500],
])
As you can see, values can be repeated for a same index.
I want to regroup sub-arrays, by indexes values, such as:
1
2
3
4
5
22
33
44
55
333
444
555
4444
5555
222
33
44
55
330
440
550
10
20
30
40
50
100
200
300
400
500
The goal is to obtain a regular array like:
array = [1, 2, 3, 4 , 5, 22, 33, 44, 55, 333, 444, 555, 4444, 5555, 222, 33, 44, 55, 330, 440, 550, 10, 20, 30, 40, 50, 100, 200, 300, 400, 500]
Thank you very much for your support.
you can use flatten method
list(values.flatten())
I'm new to this and the dataframe I'm currently working with has four columns containing data in just object datatype. The last column contains multiple data points...
i.e. the first row, last column contains:
[{"year":"1901","a":"A","b":"B"}] #printed in this format
Is there a way so I can create a new column containing just the year? i.e. isolate this data
Thanks in advance
With pandas, you can add a new column the same way you add a value to a dictionary.
So this should work for you.
df['year'] = [i[0]['year'] for i in df['last_column']]
You can use the df.apply() to get the dictionary value and assign it to a new column.
import pandas as pd
df = pd.DataFrame({'col1':['Jack','Jill','Moon','Wall','Hill'],
'col2':[100,200,300,400,500],
'col3':[{"year":"1901","a":"A","b":"B"},
{"year":"1902","c":"C","d":"D"},
{"year":"1903","e":"E","f":"F"},
{"year":"1904","g":"G","h":"H"},
{"year":"1905","i":"I","j":"J"}] })
print (df)
df['year'] = df['col3'].apply(lambda x: x['year'])
print (df)
Output for the above code:
Original DataFrame:
col1 col2 col3
0 Jack 100 {'year': '1901', 'a': 'A', 'b': 'B'}
1 Jill 200 {'year': '1902', 'c': 'C', 'd': 'D'}
2 Moon 300 {'year': '1903', 'e': 'E', 'f': 'F'}
3 Wall 400 {'year': '1904', 'g': 'G', 'h': 'H'}
4 Hill 500 {'year': '1905', 'i': 'I', 'j': 'J'}
Updated DataFrame:
col1 col2 col3 year
0 Jack 100 {'year': '1901', 'a': 'A', 'b': 'B'} 1901
1 Jill 200 {'year': '1902', 'c': 'C', 'd': 'D'} 1902
2 Moon 300 {'year': '1903', 'e': 'E', 'f': 'F'} 1903
3 Wall 400 {'year': '1904', 'g': 'G', 'h': 'H'} 1904
4 Hill 500 {'year': '1905', 'i': 'I', 'j': 'J'} 1905
I would like to know the best approach for the following pandas dataframe comparison task:
Two dataframes df_a and df_b with both having columns = ['W','X','Y','Z']:
import pandas as pd
df_a = pd.DataFrame([
['a', 2, 2, 3],
['b', 5, 3, 5],
['b', 7, 6, 44],
['c', 3, 12, 19],
['c', 7, 13, 45],
['c', 3, 13, 45],
['d', 5, 11, 90],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
df_b = pd.DataFrame([
['a', 2, 2, 3],
['a', 4, 3, 15],
['b', 5, 12, 24],
['b', 7, 6, 44],
['c', 3, 12, 19],
['d', 3, 23, 45],
['d', 6, 11, 91],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
Extract those rows from df_a that do not have a match in columns ['W','X'] in df_b
Extract those rows from df_b that do not have a match in columns ['W','X'] in df_a
Since I am kind of newbie to pandas (and could not find any other source that gives information on the mentioned task) help-out is very much appreciated.
Thanx in advance.
The basic way is using left outer merge with indicator=True and select left_only using query
cols = ['W', 'X']
df_a_only = (df_a.merge(df_b[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_a.columns])
Out[87]:
W X Y Z
4 c 7 13 45
6 d 5 11 90
df_b_only = (df_b.merge(df_a[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_b.columns])
Out[89]:
W X Y Z
1 a 4 3 15
6 d 3 23 45
7 d 6 11 91
Note: if your dataframe is huge, itt is better doing one full outer merge than using 2 left outer merge as above and choosing left_only and right_only accordingly. However, in doing full outer merge, you need to do post process of NaN, converting float back to integer and rename columns.
I have two dataframes:
df_small = pd.DataFrame(np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]),
columns=['a', 'b', 'c'])
and
df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99],
[31, 4, 5, 6, 75],
[73, 7, 8, 9, 23],
[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Now what I want is to intersect the two and only take the rows in df_large that that do not contain the rows from df_small, hence the result should be:
df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Use DataFrame.merge with indicator=True and left join and because error is necessary remove duplicates by DataFrame.drop_duplicates from df_small:
m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
Another solution is very similar, only filtered by query and last removed column _merge:
df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
Use DataFrame.merge:
df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
Output:
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
You can evade merging and make your code a bit more readable. It's really not that clear what happens when you merge and drop duplicates.
Indexes and Multiindexes were made for intersections and other set operations.
common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\
drop(index = df_small_as_Multiindex).\ #Drop the common rows
reset_index() #Not needed if the a,b,c columns are meaningful indexes
I am new to Python and I have a dataframe that needs a bit of a complicated reshaping. It is best describing with an example using dummy data:
I have this:
and I need this:
The original dataframe is:
testdata = [('State', ['CA', 'FL', 'ON']),
('Country', ['US', 'US', 'CAN']),
('a1', [0.059485629, 0.968962817, 0.645435903]),
('b2', [0.336665658, 0.404398227, 0.333113735]),
('Test', ['Test1', 'Test2', 'Test3']),
('d', [20, 18, 24]),
('e', [21, 16, 25]),
]
df = pd.DataFrame.from_items(testdata)
The dataframe I am after is:
testdata2 = [('State', ['CA', 'CA', 'FL', 'FL', 'ON', 'ON']),
('Country', ['US', 'US', 'US', 'US', 'CAN', 'CAN']),
('Test', ['Test1', 'Test1', 'Test2', 'Test2', 'Test3', 'Test3']),
('Measurements', ['a1', 'b2', 'a1', 'b2', 'a1', 'b2']),
('Values', [0.059485629, 0.336665658, 0.968962817, 0.404398227, 0.645435903, 0.333113735]),
('Steps', [20, 21, 18, 16, 24, 25]),
]
dfn = pd.DataFrame.from_items(testdata2)
It looks like the solution likely requires use of melt, stack and multiindex but I am not sure how to bring all those together.
Any suggested solutions will be greatly appreciated.
Thank you.
Let's try:
df1 = df.melt(id_vars=['State','Country','Test'],value_vars=['a1','b2'],value_name='Values',var_name='Measuremensts')
df2 = df.melt(id_vars=['State','Country','Test'],value_vars=['d','e'],value_name='Steps').drop('variable',axis=1)
df1.merge(df2, on=['State','Country','Test'], right_index=True, left_index=True)
Output:
State Country Test Measuremensts Values Steps
0 CA US Test1 a1 0.059486 20
1 FL US Test2 a1 0.968963 18
2 ON CAN Test3 a1 0.645436 24
3 CA US Test1 b2 0.336666 21
4 FL US Test2 b2 0.404398 16
5 ON CAN Test3 b2 0.333114 25
Or use #JohnGalt solution:
pd.concat([pd.melt(df, id_vars=['State', 'Country', 'Test'], value_vars=x) for x in [['d', 'e'], ['a1', 'b2']]], axis=1)
There is a way to do this using pd.wide_to_long but you must rename your columns so that the Measurements column contains the correct values
df1 = df.rename(columns={'a1':'Values_a1', 'b2':'Values_b2', 'd':'Steps_a1', 'e':'Steps_b2'})
pd.wide_to_long(df1,
stubnames=['Values', 'Steps'],
i=['State', 'Country', 'Test'],
j='Measurements',
sep='_',
suffix='.').reset_index()
State Country Test Measurements Values Steps
0 CA US Test1 a1 0.059486 20
1 CA US Test1 b2 0.336666 21
2 FL US Test2 a1 0.968963 18
3 FL US Test2 b2 0.404398 16
4 ON CAN Test3 a1 0.645436 24
5 ON CAN Test3 b2 0.333114 25