Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
df.iloc[i : i + window_size, :]
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
new_df.index.names = ["", ""]
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I want to count the number of occurrences of one specific value (string) in one column and write it down in another column cumulatively.
For example, counting the cumulative number of Y values here:
col_1 new_col
Y 1
Y 2
N 2
Y 3
N 3
I wrote this code but it gives me the final number instead of cumulative frequencies.
df['new_col'] = 0
df['new_col'] = df.loc[df.col_1 == 'Y'].count()
To count both values cumulatively you can use:
df['new_col'] = (df
If you want to focus on 'Y':
df['new_col'] = (df
.fillna(0, downcast='infer')
col_1 new_col
0 Y 1
1 Y 2
2 N 2
3 Y 3
4 N 3
As you can see I have a dataframe with several columns with the same name but split into 0., 1. until 27.
How can I take all the values of and have it under
Thank you very much!
Assuming that for all 0<=n<=27 the column names' suffixes are the same, one solution can be:
import pandas as pd
import re
# pattern to extract colum name suffix
pattern = re.compile('^\d\.([\w\.]+)')
# getting all the distinct column names/fields
fields = set([pattern.match(colname).group(1) for colname in df.columns])
# max prefix number, for you 27
n = 27
partitions = []
for i in range(0,n+1):
# creating column selector for partitions
columns_for_partition = list(map(lambda field: str(i) + f'.{field}', fields))
# get partition from dataframe and renaming column to field name (removing n. prefix)
partition = df[columns_for_partition].rename(lambda x: x.split('.',1)[1], axis=1)
new_df = pd.concat(partitions)
With an initial dataframe df 0.something 1.something
0 a 1 d 4
1 b 2 e 5
2 c 3 f 6
The resulting dataframe new_df will look like:
name something
0 a 1
1 b 2
2 c 3
0 d 4
1 e 5
2 f 6
I have 3 data frame:
from the multiplication, using pandas and numpy, I want to the output in df1:
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
I have a pandas dataframe with column names like this:
id ColNameOrig_x ColNameOrig_y
There are many such columns, the 'x' and 'y' came about because 2 datasets with similar column names were merged.
What I need to do:
df.ColName = df.ColNameOrig_x + df.ColNameOrig_y
I am now manually repeating this line for many cols(close to 50), is there a wildcard way of doing this?
You can use DataFrame.filter with DataFrame.groupby by lambda function and axis=1 for grouping per columns names with aggregate sum or use text functions like Series.str.split with indexing:
df1 = df.filter(like='_').groupby(lambda x: x.split('_')[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str.split('_').str[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str[:12], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
You can use the subscripting syntax to access column names dynamically:
col_groups = ['ColName1', 'ColName2']
for grp in col_groups:
df[grp] = df[f'{grp}Orig_x'] + df[f'{grp}Orig_y']
Or you can aggregate by column group. For example
df = pd.DataFrame([
], columns=['ColName1Orig_x', 'ColName1Orig_y', 'ColName2Orig_x', 'ColName2Orig_y'])
# Here's your opportunity to define the wildcard
col_groups = df.columns.str.extract('(.+)Orig_[x|y]')[0]
df.columns = [col_groups, df.columns]
df.groupby(level=0, axis=1).sum()
ColName1Orig_x ColName1Orig_y ColName2Orig_x ColName2Orig_y
1 2 3 4
5 6 7 8
ColName1 ColName2
3 7
11 15