I want to use TimeSeriesSplit from sklearn on the following dataframe to predict sum:
So to prepare X and y I do the following:
X = df.drop(['sum'],axis=1)
y = df['sum']
and then feed these two to:
for train_index, test_index in tscv.split(X):
X_train01, X_test01 = X[train_index], X[test_index]
y_train01, y_test01 = y[train_index], y[test_index]
by doing so, I get the following error:
KeyError: '[ 0 1 2 ...] not in index'
Here X is a dataframe, and apparently this cause the error, because if I convert X to an array as following:
X = X.values
Then it will work. However, for later evaluation of the model I need X as a dataframe. Is there any way that I can keep X as a dataframe and feed it to tscv without converting it to an array?
As #Jarad rightly said, if you have updated version of pandas, it will not automatically switch to integer based indexing as was possible in previous versions. You need to explicitly use .iloc for integer based slicing.
for train_index, test_index in tscv.split(X):
X_train01, X_test01 = X.iloc[train_index], X.iloc[test_index]
y_train01, y_test01 = y.iloc[train_index], y.iloc[test_index]
See https://pandas.pydata.org/pandas-docs/stable/indexing.html
Related
I'm dealing with a csv file consists of 2 columns and 51 rows in total.
data = pd.read_csv("data.csv", sep = ',')
data.columns=['x_column', 'y_column']
Then I perform linear regresssion
X = data.iloc[:, 0].values.reshape(-1, 1)
y = data.iloc[:, 1].values.reshape(-1, 1)
lr = LinearRegression()
Next thing I need to perform is Tukey Method.
X = data.iloc[[0], :].values
y = data.iloc[[1], :].values
Then I plotted the boxes and found out my range is between -40 to 10.
data.boxplot(return_type='dict')
plt.plot()
I need to assign my outliers to a value in order to remove them before training my dataset again. And this is where I have a problem.
y_column = X[:, 1]
data_outliers = (y_column > 0.0)
data[data_outliers]
When I run this last part I get Item wrong length 1 instead of 50. error and I don't know how to solve that. Any help is appreciated.
Try:
data_outliers = (y_column > 0.0).ravel()
The problem was that your data_outliers was a numpy column with two dimensions (shape: (1,50)) and that was impossible to mask the df like that... ravel just flattened it...
In the example below (Tensorflow 2.0), we have a dummy tensorflow dataset with three elements. We map a function on it (replace_with_float) that returns a randomly generated value in two copies. As we expect, when we take elements from the dataset, the first and second coordinates have the same value.
Now, we create two "slice" datasets from the first coordinates and the second coordinates, respectively and we zip the two datasets back together. The slicing and the zipping operations seems inverses of each other, so I would expect the resulting dataset to be equivalent to the previous one. However, as we see, now the first and second coordinates are different randomly generated values.
Maybe even more interestingly, if we zip the "same" dataset with itself by
df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: x))), the two coordinates will also have different values.
How can this behavior be explained? Perhaps two different graphs are constructed for the two datasets to be zipped and they are run independently?
import tensorflow as tf
def replace_with_float(element):
rand = tf.random.uniform([])
return (rand, rand)
df = tf.data.Dataset.from_tensor_slices([0, 0, 0])
df = df.map(replace_with_float)
print('Before zipping: ')
for x in df:
print(x[0].numpy(), x[1].numpy())
df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: y)))
print('After zipping: ')
for x in df:
print(x[0].numpy(), x[1].numpy())
Sample output:
Before zipping:
0.08801079 0.08801079
0.638958 0.638958
0.800568 0.800568
After zipping:
0.9676769 0.23045003
0.91056764 0.6551999
0.4647777 0.6758332
The short answer is that datasets don't cache intermediate values between full iterations, unless you explicitly request that using df.cache(), and they don't deduplicate common inputs either.
So in the second loop, the entire pipeline runs again.
Similarly, in the second instance, the two df.map calls cause df to run twice.
Adding a tf.print helps explain what happens:
def replace_with_float(element):
rand = tf.random.uniform([])
tf.print('replacing', element, 'with', rand)
return (rand, rand)
I've also pulled the lambdas on separate lines to avoid the autograph warning:
first = lambda x, y: x
second = lambda x, y: y
df = tf.data.Dataset.zip((df.map(first), df.map(second)))
Before zipping:
replacing 0 with 0.624579549
0.62457955 0.62457955
replacing 0 with 0.471772075
0.47177207 0.47177207
replacing 0 with 0.394005418
0.39400542 0.39400542
After zipping:
replacing 0 with 0.537954807
replacing 0 with 0.558757305
0.5379548 0.5587573
replacing 0 with 0.839109302
replacing 0 with 0.878996611
0.8391093 0.8789966
replacing 0 with 0.0165234804
replacing 0 with 0.534951568
0.01652348 0.53495157
To avoid the duplicate input problem, you can use use a single map call:
swap = lambda x, y: (y, x)
df = df.map(swap)
Or you can use df = df.cache() to avoid both effects:
df = df.map(replace_with_float)
df = df.cache()
Before zipping:
replacing 0 with 0.728474379
0.7284744 0.7284744
replacing 0 with 0.419658661
0.41965866 0.41965866
replacing 0 with 0.911524653
0.91152465 0.91152465
After zipping:
0.7284744 0.7284744
0.41965866 0.41965866
0.91152465 0.91152465
I have a 1k rows and 14 columns dataframe containing numpy arrays like shown below.
Here a subset of 2 rows and 3 columns :
[5,4,74,-12] [ 78,1,2,-9] [5 ,1,1,2]
[10,4,4,-1] [ 8,15,21,-19] [1,1,0,0]
where each cell is a numpy array of shape (4,1).
I couldn't find the right placeholder to input my whole dataframe as it needs to be processed by row batches.
Could anyone have an idea ?
I tried this to find the proper placeholder for my dataframe but its not correct:
x = tf.placeholder(tf.int32,[None,14],name='x')
with tf.Session() as sess:
print(sess.run(x,feed_dict={x:Data}))
It gives ValueError: setting an array element with a sequence.
Does anyone have an idea please ?
You did not specify in which format your data is available, so I assume it is a numpy array. In this case, you can do it like this:
n_columns = 14
n_elements_per_column = 4
x = tf.placeholder(tf.int32, [None, n_columns, n_elements_per_column], name='x')
with tf.Session() as sess:
print(sess.run(x,feed_dict={x:Data}))
I am trying built an randomforest classifier for binary classification . My data is inbalanced hence I am performing undersampling.
train = data.drop(['Co_Name','Cust_ID','Phone','Shpr_ID','Resi_Cnt','Buz_Cnt','Nearby_Cnt','parseNumber','removeString','Qty','bins','Adj_Addr','Resi','Weight','Resi_Area','Lat','Lng'], axis=1)
Y = data['Resi']
from sklearn import metrics
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(train, Y)
I am getting the below error
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
How to fix this.
Can you share the dataframe? or a sample of that!
This error can be a lot of things, for example:
If you try:
np.asarray(
[
[1, 2],
[2, 3, 4]
],
dtype=np.float)
You will get:
ValueError: setting an array element with a sequence.
This is because the array have incorrect shape of columns. So you can't create an array from lists, with a column length different on the second list. So doesn't match column length.
But your error probably it's related to train vs Y shape or the type in the train(data). During the Under-sampled fit function should have some conversion that throws this error. Confirm if train (data) have the appropriate type before to do the RandomUnderSampler.
In the case of a matrix mat n x n, i can do the following
sym = 0.5 * (mat + mat.T)
the operation gives the desired result sym[i,j] = sym[j,i]
Suppose we have a 3D array ndarr[i,j,k], where i,j,k 0,1,...n,
then ndarr is n x n x n. The idea is to obtain the following "symmetric" form
nsym[i,j,k] = nsym[j,i,k] using ndarr. I tried this:
import numpy as np
# Generate some random matrix, n = 5
ndarr = np.random.beta(0.1,1,(5,5,5))
# First attempt to symmetrize
sym1 = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(5)])
The problem here is that sym1[i,j,k] != sym1[j,i,k] as it is required. In fact I obtain sym1[i,j,k] = sym1[i,k,j], symmetric under the exchange of the last two symbols!
# Second attempt
sym2 = 0.5*(ndarr+ndarr.T)
Same problem here and sym2 is symmetric with respect the second index sym2[i,j,k]=sym2[k,j,i].
To resume, the goal is to find a symmetric form for a 3D array with respect to the third index and to preserve the values in the diagonal for the original ndarr[i,i,i].
The problem here is that you're not using the correct transpose:
sym = 0.5 * (ndarr + np.transpose(ndarr, (1, 0, 2)))
By default, np.transpose and the .T property will reverse the order of the axes. In your case, we want to only flip the first two axes: (0,1,2) -> (1,0,2).
EDIT: The reason your first attempt failed is because you were concatenating each symmetrized matrix along the first axis, not the last. It's more clear if you make ndarr with shape (5, 5, 3):
In [16]: sym = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(3)])
In [17]: sym.shape
Out[17]: (3L, 5L, 5L)
In any case, the version above with np.transpose is cleaner and more efficient.