SettingWithCopyError when scaling subset of columns with StandardScaler - pandas

I have a dataframe df with 100 columns:
index | col1 | col2 | col3 | ...
2021-04-01 | qwe | 1 | 1.1 | ...
2021-04-02 | asd | 2 | 2.2 | ...
2021-04-03 | yxc | 3 | 3.3 | ...
dtypes:
col1: category
col2: int32
col3: float64
I want to scale all columns that are not of type "category" AND return it as a dataframe, not a numpy array.
My code so far:
y_feature = "col2"
y = df[[y_feature]] # Set predictor y
X = df.drop(
[
y_feature,
],
axis=1,
)
days = (
pd.date_range(start=df.index.min(), end=df.index.max())
.to_frame(name="date")
.reset_index()
.drop("index", axis=1)
)
limit_training_days = int(len(days.index) * 0.85)
X_train_limit = days.iloc[limit_training_days, 0]
print(f"Date for training: {X_train_limit}")
X_train, y_train = (
X.query("date <= #X_train_limit").squeeze(),
y.query("date <= #X_train_limit").squeeze(),
)
X_test, y_test = (
X.query("date > #X_train_limit").squeeze(),
y.query("date > #X_train_limit").squeeze(),
)
categorical_feature = X_train.select_dtypes("category").columns.tolist()
num_cols = X.drop(categorical_feature, axis=1).columns.tolist()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test_sc[num_cols] = scaler.transform(X_test[num_cols])
After updating my packages it now throws this error for the last 2 lines of code:
SettingWithCopyError: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
How can I scale only non-category columns (keeping category columns untouched) AND return it as a dataframe?

The problem is you trying to change X_train and X_test which are parts of a bigger dataframe. Try:
X_train, X_test = X_train.copy(), X_test.copy()
before scaling.
You can also do:
X_train, y_train = (
X.query("date <= #X_train_limit").squeeze().copy(), # here
y.query("date <= #X_train_limit").squeeze(),
)

Related

Comparison of values in Dataframes with different size

I have a DataFrame in which I want to compare the speed of certain IDs at different conditions.
Boundary conditions:
IDs do not have to be represented in every condition,
ID is not represented in every condition with the same frequency.
My goal is to assign whether the speed remained
larger (speed > than speed in CondA +10%),
smaller ((speed < than speed in CondA -10%)) or
the same (speed < than speed in CondA +10%) & (speed > than speed in CondA -10%))
depending on the condition.
The data
import numpy as np
import pandas as pd
data1 = {
'ID' : [1, 1, 1, 2, 3, 3, 4, 5],
'Condition' : ['Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A','Cond_A','Cond_A', ],
'Speed' : [1.2, 1.05, 1.2, 1.3, 1.0, 0.85, 1.1, 0.85],
}
df1 = pd.DataFrame(data1)
data2 = {
'ID' : [1, 2, 3, 4, 5, 6],
'Condition' : ['Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B' ],
'Speed' : [0.8, 0.55, 0.7, 1.15, 1.2, 1.4],
}
df2 = pd.DataFrame(data2)
data3 = {
'ID' : [1, 2, 3, 4, 6],
'Condition' : ['Cond_C', 'Cond_C', 'Cond_C', 'Cond_C', 'Cond_C' ],
'Speed' : [1.8, 0.99, 1.7, 131, 0.2, ],
}
df3 = pd.DataFrame(data3)
lst_of_dfs = [df1,df2, df3]
# creating a Dataframe object
data = pd.concat(lst_of_dfs)
My goal is to archive a result like this
Condition ID Speed Category
0 Cond_A 1 1.150 NaN
1 Cond_A 2 1.300 NaN
2 Cond_A 3 0.925 NaN
3 Cond_A 4 1.100 NaN
4 Cond_A 5 0.850 NaN
5 Cond_B 1 0.800 faster
6 Cond_B 2 0.550 slower
7 Cond_B 3 0.700 slower
8 Cond_B 4 1.150 equal
...
My attempt:
Calculate average of speed for each ID per condition
data = data.groupby(["Condition", "ID"]).mean()["Speed"].reset_index()
Definition of thresholds. Assuming I want to realize thresholds up to 10 percent around the CondA-Values
threshold_upper = data.loc[(data.Condition == 'CondA')]['Speed'] + (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
threshold_lower = data.loc[(data.Condition == 'CondA')]['Speed'] - (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
Mapping strings 'faster', 'equal', 'slower' based on condition using numpy select.
conditions = [
(data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondB is faster than Speed in CondA+10%
(data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA+10%
((data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondB is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
((data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondC is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
(data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondB is slower than Speed in CondA-10%
(data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA-10%
]
values = [
'faster',
'faster',
'equal',
'equal',
'slower',
'slower'
]
data['Category'] = np.select(conditions, values)
Produces this error: <ValueError: Length of values (0) does not match length of index (16)>
My data frames unfortunately have a different length (since not all IDs performed all trials to each condition). I appreciate any hint. Many thanks in advance.
# Dataframe created
data
ID Condition Speed
0 1 Cond_A 1.20
1 1 Cond_A 1.05
2 1 Cond_A 1.20
# Reset the index
data = data.reset_index(drop=True)
# Creating based on ID
data['group'] = data.groupby(['ID']).ngroup()
# Creating functions which returns the upper and lower limit of speed
def lowlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 0.9
def upperlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 1.1
# Calculate the upperlimit and lowerlimit for the groups
df = pd.DataFrame()
df['ul'] = data.groupby('group').apply(lambda x: upperlimit(x))
df['ll'] = data.groupby('group').apply(lambda x: lowlimit(x))
# reseting the index
# So that we can merge the values of 'group' column
df = df.reset_index()
# Merging the data and df dataframe
data_new = pd.merge(data,df,on='group',how='left')
data_new
ID Condition Speed group ul ll
0 1 Cond_A 1.20 0 1.2650 1.0350
1 1 Cond_A 1.05 0 1.2650 1.0350
2 1 Cond_A 1.20 0 1.2650 1.0350
3 2 Cond_A 1.30 1 1.4300 1.1700
Now we have to apply the conditions
data_new.loc[(data_new['Speed'] >= data_new['ul']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'larger'
data_new.loc[(data_new['Speed'] <= data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'smaller'
data_new.loc[(data_new['Speed'] < data_new['ul']) & (data_new['Speed'] > data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'Same'
Here is the output
You can drop the other columns now, if you want
data_new = data_new.drop(columns=['group','ul','ll'])

How to update a pandas column

Given the following dataframe
col1 col2
1 ('A->B', 'B->C')
2 ('A->D', 'D->C', 'C->F')
3 ('A->K', 'K->M', 'M->P')
...
I want to convert this to the following format
col1 col2
1 'A-B-C'
2 'A-D-C-F'
3 'A-K-M-P'
...
Each sequence shows an arc within a path. Hence, the sequence is like (a,b), (b,c), (c,d) ...
def merge_values(val):
val = [x.split('->') for x in val]
out = []
for char in val:
out.append(char[0])
out.append(val[-1][1])
return '-'.join(out)
df['col2'] = df['col2'].apply(merge_values)
print(df)
Output:
col1 col2
0 1 A-B-C
1 2 A-D-C-F
2 3 A-K-M-P
Given
df = pd.DataFrame({
'col1': [1, 2, 3],
'col2': [
('A->B', 'B->C'),
('A->D', 'D->C', 'C->F'),
('A->K', 'K->M', 'M->P'),
],
})
You can do:
def combine(t, old_sep='->', new_sep='-'):
if not t: return ''
if type(t) == str: t = [t]
tokens = [x.partition(old_sep)[0] for x in t]
tokens += t[-1].partition(old_sep)[-1]
return new_sep.join(tokens)
df['col2'] = df['col2'].apply(combine)

outliers with groupby in pandas

I have a data that are like that (toy data) :
import pandas as pd
import numpy as np
N=5
dfi = pd.DataFrame()
for i in range(5):
df = pd.DataFrame(index=pd.date_range("20100101", periods=N, freq='M'))
df['price'] = np.random.randint(0,N,size=(len(df)))
df['quantity'] = np.random.randint(0,N,size=(len(df)))
df['type'] = 'P'+str(i)
dfi = pd.concat([df, dfi], axis=0)
dfi
From this I would like to calculate a new price per type ie something like that :
new_price = (1+perf)*new_price(t-1)
with :
new_price(0)=price(0)
and
perf = price(t)/price(t-1) if abs(price(t)/price(t-1)-1)<s else 0
I tried :
dfi['prix_corr'] = (dfi
.sort_index()
.groupby('type').price
.apply(lambda x: x.pct_change() if x.pct_change().abs() <= 0.5 else 0)
)
but get an error message :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
``
I would like to correct in each group for outlier time series data.
Any suggestion ?
Given your input, you could try using a custom function in your lambda expression such as:
def compute_price_change(x):
mask = x.pct_change().abs() > 0.5
x = x.pct_change()
x[mask] = 0
return x
dfi['prix_corr'] = (dfi
.groupby('type').price
.apply(lambda x: compute_price_change(x))
)
Output:
price quantity type prix_corr
2010-01-31 3 0 P4 NaN
2010-02-28 3 2 P4 0.0
2010-03-31 0 2 P4 -0.5
2010-04-30 2 4 P4 0.5
2010-05-31 2 2 P4 0.0
2010-01-31 1 2 P3 NaN
2010-02-28 4 3 P3 0.0
2010-03-31 0 0 P3 0.0
2010-04-30 4 0 P3 0.0
2010-05-31 2 2 P3 0.0
. . . . .
. . . . .
. . . . .
Since .pct_change() returns NaN for the first entry, you might want to handle that in some way as well.

Limitation of Keras/Tensorflow for solving Linear Regression tasks

I was trying to implement linear regression in Keras/TensorFlow and was very surprised how difficult it is. The standard examples work great on random data. However, if we change the input data a little bit, all examples stop work correctly.
I try to find coefficients for y = 0.5 * x1 + 0.5 * x2.
np.random.seed(1443)
n = 100000
x = np.zeros((n, 2))
y = np.zeros((n, 1))
x[:,0] = sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ))
x[:,1] = sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ) )
y = (x[:,0] + x[:,1]) /2
model = keras.Sequential()
model.add( keras.layers.Dense(1, input_shape =(2,), dtype="float32" ))
model.compile(loss='mean_squared_error', optimizer='sgd')
model.fit(x,y, epochs=1000, batch_size=64)
print(model.get_weights())
The results:
| epochs| batch_size | bias | x1 | x2
| ------+------------+------------+------------+-----------
| 1000 | 64 | -5.83E-05 | 0.90410435 | 0.09594361
| 1000 | 1024 | -5.71E-06 | 0.98739249 | 0.01258729
| 1000 | 10000 | -3.07E-07 | -0.2441376 | 1.2441349
My first thought was that it is a bug in Keras. So, I tried R/Tensorflow library:
floatType <- "float32"
p <- 2L
X <- tf$placeholder(floatType, shape = shape(NULL, p), name = "x-data")
Y <- tf$placeholder(floatType, name = "y-data")
W <- tf$Variable(tf$zeros(list(p, 1L), dtype=floatType))
b <- tf$Variable(tf$zeros(list(1L), dtype=floatType))
Y_hat <- tf$add(tf$matmul(X, W), b)
cost <- tf$reduce_mean(tf$square(Y_hat - Y))
generator <- tf$train$GradientDescentOptimizer(learning_rate=0.01)
optimizer <- generator$minimize(cost)
session <- tf$Session()
session$run(tf$global_variables_initializer())
set.seed(1443)
n <- 10^5
x <- matrix( replicate(p, sort(scale((rpois(n, 10^6))))) , nrow = n )
y <- matrix((x[,1]+x[,2])/2)
i <- 1
batch_size <- 10000
epoch_number <- 1000
iterationNumber <- n*epoch_number / batch_size
while (iterationNumber > 0) {
feed_dict <- dict(X = x[i:(i+batch_size-1),, drop = F], Y = y[i:(i+batch_size-1),, drop = F])
session$run(optimizer, feed_dict = feed_dict)
i <- i+batch_size
if( i > n-batch_size)
i <- i %% batch_size
iterationNumber <- iterationNumber - 1
}
r_model <- lm(y ~ x)
tf_coef <- c(session$run(b), session$run(W))
r_coef <- r_model$coefficients
print(rbind(tf_coef, r_coef))
The results:
| epochs| batch_size | bias | x1 | x2
| ------+------------+------------+------------+-----------
|2000 | 64 | -1.33E-06 | 0.500307 | 0.4996932
|1000 | 1000 | 2.79E-08 | 0.5000809 | 0.499919
|1000 | 10000 | -4.33E-07 | 0.5004921 | 0.499507
|1000 | 100000 | 2.96E-18 | 0.5 | 0.5
Tensorflow finds the correct result only when batch size = samples number and the optimization algorithm is SGD. If optimization algorithm was "adam" or "adagrad", errors were much larger.
For obvious reasons, I cannot choose hyperparameter batch_size = n. Could you recommend any approaches to solve this problem with precision 1E-07 for Keras or TensorFlow?
Why TensorFlow finds better solutions than Keras?
Comment 1.
Based on post "today" below:
Train dataset shuffling will significantly improve the performance of TensorFlow version:
shuffledIndex<-sample(1:(nrow(x)))
x <- x[shuffledIndex,]
y <- y[shuffledIndex,,drop=FALSE]
For batch size = 2000:
|(Intercept) | x1 | x2
|----------------+-----------+----------
|-1.130693e-09 | 0.5000004 | 0.4999989
The problem is that you are sorting the generated random numbers for each feature value. So they end up very close to each other:
>>> np.mean(np.abs(x[:,0]-x[:,1]))
0.004125721684553685
As a result we would have:
y = (x1 + x2) / 2
~= (x1 + x1) / 2
= x1
= 0.5 * x1 + 0.5 * x1
= 0.3 * x1 + 0.7 * x1
= -0.3 * x1 + 1.3 * x1
= 10.1 * x1 - 9.1 * x1
= thousands of other possible combinations
In this case the solution that Keras would converge to would really depend on the initial value of the weights and bias of Dense layer. With different initial values you would get different results (and possibly for some of them, it may not converge at all):
# set the initial weight of Dense layer
model.layers[0].set_weights([np.array([[0], [1]]), np.array([0])])
# fit the model ...
# the final weights
model.get_weights()
[array([[0.00203656],
[0.9981099 ]], dtype=float32),
array([4.5520876e-05], dtype=float32)] # because: y = 0 * x1 + 1 * x1 = x1 ~= (x1 + x2) / 2
# again set the weights to something different
model.layers[0].set_weights([np.array([[0], [0]]), np.array([1])])
# fit the model...
# the final weights
model.get_weights()
[array([[0.49986306],
[0.50013727]], dtype=float32),
array([1.4176634e-08], dtype=float32)] # the one you were looking for!
However, if you don't sort the features (i.e. just remove sorted) it is very likely that the converged weights to be very close to [0.5, 0.5].

Pandas custom file format

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?
you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}