return list by dataframe linear interpolation - dataframe

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you

One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

Related

Plotting interval of data in dataframe

A bit new to python so maybe code could be improved.
I have a txt file with x and y values, separated by some NaN in between.
Data goes from -x to x and then comes back (x to -x) but with somewhat different values of y, say:
x=np.array([-0.02,-0.01,0,0.01,0.02,NaN,1,NaN,0.02,0.01,0,-0.01,-0.02])
And I would like to plot (matplotlib) up to the first NaN with certain format, x=1 with other format, and last set of data with a third different format (color, marker, linewidth...).
Of course the data I have is a bit more complex, but I guess is a simple useful approximation.
Any idea?
I'm using pandas as my data manipulation tool
You can create a group label taking the cumsum of where x is null. Then you can define a dictionary keyed by the label with values being a dictionary containing all of the plotting parameters. Use groupby to plot each group separately, unpacking all the parameters to set the arguments for that group.
Sample Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.array([-0.02,-0.01,0,0.01,0.02,np.NaN,1,np.NaN,0.02,0.01,0,-0.01,-0.02])
df = pd.DataFrame({'x': x})
Code
df['label'] = df.x.isnull().cumsum().where(df.x.notnull())
plot_params = {0: {'lw': 2, 'color': 'red', 'marker': 'o'},
1: {'lw': 6, 'color': 'black', 'marker': 's'},
2: {'lw': 9, 'color': 'blue', 'marker': 'x'}}
fig, ax = plt.subplots(figsize=(3,3))
for label, gp in df.groupby('label'):
gp.plot(y='x', **plot_params[label], ax=ax, legend=None)
plt.show()
This is what df looks like for reference after defining the group label
print(df)
x label
0 -0.02 0.0
1 -0.01 0.0
2 0.00 0.0
3 0.01 0.0
4 0.02 0.0
5 NaN NaN
6 1.00 1.0
7 NaN NaN
8 0.02 2.0
9 0.01 2.0
10 0.00 2.0
11 -0.01 2.0
12 -0.02 2.0

Normalizing and denormalizing rows in a dataframe

I have a dataframe with 20k rows and 100 columns. I am trying to normalize my data across rows. Scikit's MinMaxScaler doesn't allow me to do this by rows. It has something called minmax_scale that allows row normalization but I cannot denormalize it later. At least, I don't see how to do it. How would you guys do it?
From sklearn.preprocessing.minmax_scale:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 5],
'B': [88, 300, 200]})
# Find and store min and max vectors
min_values = df.min()
max_values = df.max()
normalized_df = (df - min_values) / (df.max() - min_values)
denormalized_df= normalized_df * (max_values - min_values) + min_values
A B
1 88
2 300
5 200
A B
0.00 0.000000
0.25 1.000000
1.00 0.528302
A B
1.0 88.0
2.0 300.0
5.0 200.0

Merging 2 or more data frames and transposing the result

I have several DFs derived from a Panda binning process using the below code;
df2 = df.resample(rule=timedelta(milliseconds=250))[('diffA')].mean().dropna()
df3 = df.resample(rule=timedelta(milliseconds=250))[('diffB')].mean().dropna()
.. etc
Every DF will have column containing 'time' in Datetime format( example:2019-11-22 13:18:00.000 ) and second column containing a number (i.e. 0.06 ). Different DFs will have different 'time' bins. I am trying to concatenate all DFs into one , where certain elements of the resulting DF may contain 'NaN'.
The Datetime format of the DFs give an error when using;
method 1) df4=pd.merge(df2,df3,left_on='time',right_on='time')
method 2) pd.pivot_table(df2, values = 'diffA', index=['time'], columns = 'time').reset_index()
When DFs have been combined , I also want to transpose the resulting DF, where:
Rows: are 'DiffA','DiffB'..etc
Columns: are time bins accordingly.
Have tried the transpose() method with individual DFs, just to try, but getting an error as my time /index is in 'Datetime' format..
Once that is in place, I am looking for a method to extract rows from the resulting transposed DF as individual data series.
Please advise how I can achieve the above with some guidance, appreciate any feedback ! thank you so much for your help.
Data frames ( 2 - for example )
time DiffA
2019-11-25 08:18:01.250 0.06
2019-11-25 08:18:01.500 0.05
2019-11-25 08:18:01.750 0.04
2019-11-25 08:18:02.000 0
2019-11-25 08:18:02.250 0.22
2019-11-25 08:18:02.500 0.06
time DiffB
2019-11-26 08:18:01.250 0.2
2019-11-27 08:18:01.500 0.05
2019-11-25 08:18:01.000 0.6
2019-11-25 08:18:02.000 0.01
2019-11-25 08:18:02.250 0.8
2019-11-25 08:18:02.500 0.5
resulting merged DF should be as follows ( text only);
time ( first row )
2019-11-25 08:18:01.000,
2019-11-25 08:18:01.250,
2019-11-25 08:18:01.500,
2019-11-25 08:18:01.750,
2019-11-25 08:18:02.000,
2019-11-25 08:18:02.250,
2019-11-25 08:18:02.500,
2019-11-26 08:18:01.250,
2019-11-27 08:18:01.500
(second row)
diffA nan 0.06 0.05 0.04 0 0.22 0.06 nan nan
(third row)
diffB 0.6 nan nan nan 0.01 0.8 0.5 0.2 0.05
Solution
The core logic: You need to use outer-join on the column 'time' to merge each of the sampled-dataframes together to achieve your objective. Finally resetting the index to the column time completes the solution.
I will use the dummy data I created below to create a reproducible solution.
Note: I have used df as the final dataframe and df0 as the original dataframe. My df0 is your df.
df = pd.DataFrame()
for i, column_name in zip(range(5), column_names):
if i==0:
df = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
else:
df_other = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
df = pd.merge(df, df_other, on='time', how='outer')
print(df.set_index('time').T)
Output:
Dummy Data
import pandas as pd
# dummy data:
df0 = pd.DataFrame()
df0['time'] = pd.date_range(start='2020-02-01', periods=15, freq='D')
df0['data'] = np.random.randint(0, high=9, size=15)
print(df0)
Output:
time data
0 2020-02-01 6
1 2020-02-02 1
2 2020-02-03 7
3 2020-02-04 0
4 2020-02-05 8
5 2020-02-06 8
6 2020-02-07 1
7 2020-02-08 6
8 2020-02-09 2
9 2020-02-10 6
10 2020-02-11 8
11 2020-02-12 3
12 2020-02-13 0
13 2020-02-14 1
14 2020-02-15 0

How to bin data from multiple columns?

I have the following df
| 1 | 2 | 3 |
-------------------------
0.11 0.25 0.74
0.32 0.93 0.26
0.44 0.28 0.76
0.15 0.29 0.79
etc.
I'm using bins:
bins = [0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1]
I created 3 bin columns and ran a value_counts() on them. So now I know how many values are in each bin for each of these 3 columns. But I'm having trouble plotting this into a barplot. Looking for a triple bar graph
df['Bin1'] = pd.cut(df['1'], bins)
df['Bin2'] = pd.cut(df['2'], bins)
df['Bin3'] = pd.cut(df['3'], bins)
Bin1_count = df['Bin1'].value_counts().values
Bin2_count = df['Bin2'].value_counts().values
Bin3_count = df['Bin3'].value_counts().values
x_axis = df['Bin1'].value_counts().index
sns.barplot(x = x_axis, y = [Bin1_count,Bin2_count,Bin3_count])
You can using melt first , then using pd.crosstab, and try look at plot from pandas
meltdf=df.melt()
meltdf.value=pd.cut(meltdf.value,bins)
pd.crosstab(meltdf.variable,meltdf.value).plot(kind='bar')

How to access results from extractall on a dataframe

I have a dataframe df in which the column df.Type has dimension information about physical objects. The numbers appear inside a text string which I have successfully extracted using this code:
dftemp=df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
But now, the problem is that results appear as:
0
Unit match
5 0 0.02
1 0.03
6 0 0.02
1 0.02
7 0 0.02
...
How can I multiply these successive numbers (e.g. 0.02 * 0.03 = 0.006) and insert the result into the original dataframe df as a new column, say df.Area for each value of df.Type?
Thanks for your ideas!
I think you can do it with unstack and then prod along axis=1 like
print (dftemp.unstack().prod(axis=1))
then if I'm not mistaken, Unit is the name of the index in df, so I would say that
df['Area'] = dftemp.unstack().prod(axis=1)
should create the column you look for.
With an example:
df = pd.DataFrame( {'Type':['bla 0.03 dddd 0.02 jjk','bli 0.02 kjhg 0.02 wait']},
index=pd.Index([5,6],name = 'Unit'))
df['Area'] = (df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
.unstack().prod(axis=1))
print (df)
Type Area
Unit
5 bla 0.03 dddd 0.02 jjk 0.0006
6 bli 0.02 kjhg 0.02 wait 0.0004