How to bin data from multiple columns? - pandas

I have the following df
| 1 | 2 | 3 |
-------------------------
0.11 0.25 0.74
0.32 0.93 0.26
0.44 0.28 0.76
0.15 0.29 0.79
etc.
I'm using bins:
bins = [0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1]
I created 3 bin columns and ran a value_counts() on them. So now I know how many values are in each bin for each of these 3 columns. But I'm having trouble plotting this into a barplot. Looking for a triple bar graph
df['Bin1'] = pd.cut(df['1'], bins)
df['Bin2'] = pd.cut(df['2'], bins)
df['Bin3'] = pd.cut(df['3'], bins)
Bin1_count = df['Bin1'].value_counts().values
Bin2_count = df['Bin2'].value_counts().values
Bin3_count = df['Bin3'].value_counts().values
x_axis = df['Bin1'].value_counts().index
sns.barplot(x = x_axis, y = [Bin1_count,Bin2_count,Bin3_count])

You can using melt first , then using pd.crosstab, and try look at plot from pandas
meltdf=df.melt()
meltdf.value=pd.cut(meltdf.value,bins)
pd.crosstab(meltdf.variable,meltdf.value).plot(kind='bar')

Related

return list by dataframe linear interpolation

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

Python : How to do conditional rounding in dataframe column values?

data = {'Name' : ['tom','bul','zack','doll','viru'],'price':[.2012,.05785,2.03,5.89,.029876]}
df = pd.DataFrame(data)
I want to round to 0 decimal points if the 'price' value is more than 1 and round to 4 decimal points if the value is less than 1. Please suggest.
If there are many conditions, I prefer using numpy.select as in following:
import numpy as np
np.select(
[df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2)],
)
# df
# Name price price2
# 0 tom 0.201200 0.20
# 1 bul 0.057850 0.06
# 2 zack 2.030000 2.00
# 3 doll 5.890000 6.00
# 4 viru 0.029876 0.03
With more conditions, we could do something like this:
df['price3'] = np.select(
[df.price >= 3, df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2), df.price.round(3)],
)
# df
# Name price price2 price3
# 0 tom 0.201200 0.20 0.201
# 1 bul 0.057850 0.06 0.058
# 2 zack 2.030000 2.00 2.030
# 3 doll 5.890000 6.00 6.000
# 4 viru 0.029876 0.03 0.030

Merging 2 or more data frames and transposing the result

I have several DFs derived from a Panda binning process using the below code;
df2 = df.resample(rule=timedelta(milliseconds=250))[('diffA')].mean().dropna()
df3 = df.resample(rule=timedelta(milliseconds=250))[('diffB')].mean().dropna()
.. etc
Every DF will have column containing 'time' in Datetime format( example:2019-11-22 13:18:00.000 ) and second column containing a number (i.e. 0.06 ). Different DFs will have different 'time' bins. I am trying to concatenate all DFs into one , where certain elements of the resulting DF may contain 'NaN'.
The Datetime format of the DFs give an error when using;
method 1) df4=pd.merge(df2,df3,left_on='time',right_on='time')
method 2) pd.pivot_table(df2, values = 'diffA', index=['time'], columns = 'time').reset_index()
When DFs have been combined , I also want to transpose the resulting DF, where:
Rows: are 'DiffA','DiffB'..etc
Columns: are time bins accordingly.
Have tried the transpose() method with individual DFs, just to try, but getting an error as my time /index is in 'Datetime' format..
Once that is in place, I am looking for a method to extract rows from the resulting transposed DF as individual data series.
Please advise how I can achieve the above with some guidance, appreciate any feedback ! thank you so much for your help.
Data frames ( 2 - for example )
time DiffA
2019-11-25 08:18:01.250 0.06
2019-11-25 08:18:01.500 0.05
2019-11-25 08:18:01.750 0.04
2019-11-25 08:18:02.000 0
2019-11-25 08:18:02.250 0.22
2019-11-25 08:18:02.500 0.06
time DiffB
2019-11-26 08:18:01.250 0.2
2019-11-27 08:18:01.500 0.05
2019-11-25 08:18:01.000 0.6
2019-11-25 08:18:02.000 0.01
2019-11-25 08:18:02.250 0.8
2019-11-25 08:18:02.500 0.5
resulting merged DF should be as follows ( text only);
time ( first row )
2019-11-25 08:18:01.000,
2019-11-25 08:18:01.250,
2019-11-25 08:18:01.500,
2019-11-25 08:18:01.750,
2019-11-25 08:18:02.000,
2019-11-25 08:18:02.250,
2019-11-25 08:18:02.500,
2019-11-26 08:18:01.250,
2019-11-27 08:18:01.500
(second row)
diffA nan 0.06 0.05 0.04 0 0.22 0.06 nan nan
(third row)
diffB 0.6 nan nan nan 0.01 0.8 0.5 0.2 0.05
Solution
The core logic: You need to use outer-join on the column 'time' to merge each of the sampled-dataframes together to achieve your objective. Finally resetting the index to the column time completes the solution.
I will use the dummy data I created below to create a reproducible solution.
Note: I have used df as the final dataframe and df0 as the original dataframe. My df0 is your df.
df = pd.DataFrame()
for i, column_name in zip(range(5), column_names):
if i==0:
df = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
else:
df_other = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
df = pd.merge(df, df_other, on='time', how='outer')
print(df.set_index('time').T)
Output:
Dummy Data
import pandas as pd
# dummy data:
df0 = pd.DataFrame()
df0['time'] = pd.date_range(start='2020-02-01', periods=15, freq='D')
df0['data'] = np.random.randint(0, high=9, size=15)
print(df0)
Output:
time data
0 2020-02-01 6
1 2020-02-02 1
2 2020-02-03 7
3 2020-02-04 0
4 2020-02-05 8
5 2020-02-06 8
6 2020-02-07 1
7 2020-02-08 6
8 2020-02-09 2
9 2020-02-10 6
10 2020-02-11 8
11 2020-02-12 3
12 2020-02-13 0
13 2020-02-14 1
14 2020-02-15 0

Pandas applying condition across columns for a large dataframe

I have a dataframe df which has data as follows:
Date Var Avg Run_1 Run_2 Run_3
2019-01-01 V1 3.16 3.41 3.84 3.17
2019-01-02 V2 66024 0 1 287
2019-01-03 V1 3.16 3.41 3.84 3.17
2019-01-04 V2 66024 0 1 287
The columns Run_1, Run_2 extend all the way to Run_5000. Date is the index column
I am trying to create another dataframe df2 from above which has the following:
Date V1_M K=Avg_V1*v1_M Val1 Val2 Val3
2019-01-01 1.00 3.16 0.25 0 0
2019-01-02 1.01 3.19 0.22 0 0
2019-01-03 1.02 3.22 0.19 0 0
2019-01-04 1.03 3.25 0.16 0 0
The formula to get Val1, Val2, Val3, ..., Val500 is:
=MAX(Run_1_V1 - K, 0)*IF(Run_1_V2 > 0, 0, 1)
Avg_V1 refers to V1 variable from Avg column in df
Run_1_V1 refers to V1 from Run_1 column in df
My current approach gets stuck after this, due to 500 columns of Run_1, Run_2, ..Run_500, as I am not sure how to apply the formula above to all of 500 columns without writing an explicit for loop:
v1 = df[df['Variable'] == 'V1']
v2 = df[df['Variable'] == 'V2']
Edit:
Formula for Val500:
=MAX(Run_500_V1 - K, 0)*IF(Run_500_V2 > 0, 0, 1)
Run_1_V2 refers to V2 from Run_1 column in df
You can try the numpy way. First, extract your runs matrix:
runs = df[col for col in df.columns if col.startswith('Run_')].values
Then, zero out all the values you dont want with a binary mask
var_col = df.VAR.str[1:].astype(int).values
mask = np.zeros((var_col.size, var_col.max()))
mask[np.arange(len(var_col)), var_col-1] = 1
And apply the mask and the K factor:
values = runs * mask * new_df.K.reshape(-1, 1)
Then you can wrap the result with the np.ndarray constructor of a pd.DataFrame

How to access results from extractall on a dataframe

I have a dataframe df in which the column df.Type has dimension information about physical objects. The numbers appear inside a text string which I have successfully extracted using this code:
dftemp=df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
But now, the problem is that results appear as:
0
Unit match
5 0 0.02
1 0.03
6 0 0.02
1 0.02
7 0 0.02
...
How can I multiply these successive numbers (e.g. 0.02 * 0.03 = 0.006) and insert the result into the original dataframe df as a new column, say df.Area for each value of df.Type?
Thanks for your ideas!
I think you can do it with unstack and then prod along axis=1 like
print (dftemp.unstack().prod(axis=1))
then if I'm not mistaken, Unit is the name of the index in df, so I would say that
df['Area'] = dftemp.unstack().prod(axis=1)
should create the column you look for.
With an example:
df = pd.DataFrame( {'Type':['bla 0.03 dddd 0.02 jjk','bli 0.02 kjhg 0.02 wait']},
index=pd.Index([5,6],name = 'Unit'))
df['Area'] = (df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
.unstack().prod(axis=1))
print (df)
Type Area
Unit
5 bla 0.03 dddd 0.02 jjk 0.0006
6 bli 0.02 kjhg 0.02 wait 0.0004