Is there a way using pandas functions to add values/rows by a particular increment?
For example:
This is what I have:
df = pd.DataFrame([1.1,2,2.8])
df
value other1 other2
zebra 0.3 250
bunny 0.7 10
rat 1.0 35
cat 1.1 100
dog 2.0 150
mouse 2.8 125
EDIT 1:
This is what I want, where ideally the inserted rows' index are whatever is easiest but the previous row names are preserved.
df_goal = pd.DataFrame([1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8])
df_goal
value other1 other2
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
1 1.2
2 1.3
3 1.4
4 1.5
5 1.6
6 1.7
7 1.8
8 1.9
dog 2.0 150
10 2.1
11 2.2
12 2.3
13 2.4
14 2.5
15 2.6
16 2.7
mouse 2.8 125
EDIT 2:
Also I would like to keep the values of other columns that were there previously and any new rows are simply empty or zero.
I think you can use reindex by numpy.arange:
#create index by value column
df = df.reset_index().set_index('value')
#reindex floatindex
s = 0.1
a = np.arange(df.index.min(),df.index.max() + s, step=s)
df = df.reindex(a, tolerance=s/2., method='nearest')
#replace NaN in another columns as index
cols = df.columns.difference(['index'])
df[cols] = df[cols].fillna('')
#replace NaN by range
s = pd.Series(np.arange(len(df.index)), index=df.index)
df['index'] = df['index'].combine_first(s)
#swap column with index
df = df.reset_index().set_index('index')
print (df)
value other1 other2
index
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
9 1.2
10 1.3
11 1.4
12 1.5
13 1.6
14 1.7
15 1.8
16 1.9
dog 2.0 150
18 2.1
19 2.2
20 2.3
21 2.4
22 2.5
23 2.6
24 2.7
mouse 2.8 125
Related
I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks
We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')
Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0
I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6
I have to find the number of cycles within a column in my data frame (A cycle is defined when the variable goes from initial to some max value and again starts from some initial value). Whenever the variable has repeated values, I just average over them. In the desired data frame, I am appending the filter cycle number to that SNo as a suffix to know which cycle the given SNo is in. I need to get the min and the max for a given cycle and SNo (It is not predefined)
An example of the data frame and the desired data frame are as follows:
SNo VarPer Value
1000 0 1.2
1000 1 2.2
1000 2 3.2
1000 3 4.2
1000 4 5.2
1000 4 6.2
1000 5 7.2
1000 5 8.2
1000 0 0.9
1000 1 1.9
1000 2 2.9
1000 3 3.9
1000 3 4.9
1000 4 5.9
1001 0 0.5
1001 1 1.5
1001 2 2.5
1001 2 3.5
1001 0 1
1001 1 1
1001 2 1
SNo VarPer Value
1000_1 0 1.2
1000_1 1 2.2
1000_1 2 3.2
1000_1 3 4.2
1000_1 4 5.7
1000_1 5 7.7
1000_2 0 0.9
1000_2 1 1.9
1000_2 2 2.9
1000_2 3 4.4
1000_2 4 5.9
1001_1 0 0.5
1001_1 1 1.5
1001_1 2 3
1001_2 0 1
1001_2 1 1
1001_2 2 1
I have already tried the following:
y = dat.groupby(['SNo','VarPer'], as_index=False)['Value'].mean()
But this is grouping the entire thing without considering the cycles. I have about 70000 rows of data, so I need something that isn't terribly slow. Please help!
As #Peter Leimbigler noted, I'm also not clear about the logic for how the suffix is generated. I would think 1000_3 through 1000_6 should all be 1000_2.
To use a groupby, you will need to create a new grouping with something like this:
for _, values in df.groupby('SNo'):
group_label = 0
for row in values.index:
if df.loc[row, 'VarPer'] !=0:
df.loc[row, 'group'] = group_label
else:
group_label+=1
df.loc[row, 'group'] = group_label
EDIT: You probably shouldn't use a loop for writing directly to the dataframe. Instead, you can create a list and then create a new column using that list. This will be faster.
new_grouping = []
for _, values in df.groupby('SNo'):
label = 0
group = []
for row in values.index:
if df.loc[row, 'VarPer'] !=0:
group.append(label)
else:
label+=1
group.append(label)
new_grouping.extend(group)
df['group'] = new_grouping
That won't be fast but perhaps you (or someone else) can vectorize it.
Then you can use a groupby to get your averaged values:
df = df.groupby(['SNo','group'],as_index = False])["VarPer"].mean().reset_index()
If your suffixes are actually supposed to be as I describe above, you can do:
df['SNo'] = df['SNo'].map(str) +'_' + df['group'].map(lambda x: str(int(x)).zfill(2))
This will give you:
SNo group VarPer Value
1000_1 1.0 0 1.2
1000_1 1.0 1 2.2
1000_1 1.0 2 3.2
1000_1 1.0 3 4.2
1000_1 1.0 4 5.7
1000_1 1.0 5 7.7
1000_2 2.0 0 0.9
1000_2 2.0 1 1.9
1000_2 2.0 2 2.9
1000_2 2.0 3 4.4
1000_2 2.0 4 5.9
1001_1 1.0 0 0.5
1001_1 1.0 1 1.5
1001_1 1.0 2 3.0
1001_2 2.0 0 1.0
1001_2 2.0 1 1.0
1001_2 2.0 2 1.0
Hi I would like to unify columns in the same dataframe to one column such as:
col1 col2
1 1.4 1.5
2 2.3 2.6
3 3.6 6.7
to
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Thanks for your help
Use stack, then remove level by reset_index and last create one column DataFrame by to_frame:
df = df.stack().reset_index(level=1, drop=True).to_frame('col1&2')
print (df)
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Or:
df = pd.DataFrame({'col1&2': df.values.reshape(1,-1).ravel()}, index=np.repeat(df.index, 2))
print (df)
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7