How to import a CSV file, split it 70/30 and then use first column as my 'y' value? - pandas

I am having an issue at the moment, I think im making it far more complicated than it needs to be. my csv file is 31 rows by 500. I need to import this, split it in a 70/30 ratio and then be able to use the first column as my 'y' value for a neural network, and the remaining 30 columns need to be my 'x' value.
ive implemented the below code to do this, but when I run it through my basic sigmoid and testing functions, it provides results in a weird format i.e. [6.54694655e-06].
I believe this is due to my splitting/importing of the data, which I think I have done wrong. I need to import the data into arrays that are readable by my functions, and be able to separate my first column specifically to a 'y' value. how do I go about this?
df = pd.read_csv(r'data.csv', header=None)
df.to_numpy()
#splitting data 70/30
trainingdata= df[:329]
testingdata= df[:141]
#converting data to seperate arrays for training and testing
training_features= trainingdata.loc[:, trainingdata.columns != 0].values.reshape(329,30)
training_labels = trainingdata[0]
training_labels = training_labels.values.reshape(329,1)
testing_features = testingdata[0]
testing_labels = testingdata.loc[:, testingdata.columns != 0]

Usually for splitting the dataframe on test and train data I use sklearn.model_selection.train_test_split. Documentation here.
Some other methods are described here Hope this will help you!

Make you train/test split easy by using sklearn.model_selection.train_test_split.
If you don't have sklearn installed, first install it by running pip install -U scikit-learn.
Then
from sklearn.model_selection import train_test_split
df = pd.read_csv(r'data.csv', header=None)
# X is your features, y is your target column
X = df.loc[:,1:]
y = df.loc[:,0]
# Use train_test_split function with test size of 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

df = pd.read_csv(r'data.csv')
df.to_numpy()
print(df)

Related

What is the sequence for preprocessing text df with tensorflow?

I have a pandas data frame, containing two columns: sentences and annotations:
Col 0
Sentence
Annotation
1
[This, is, sentence]
[l1, l2, l3]
2
[This, is, sentence, too]
[l1, l2, l3, l4]
There are several things I need to do:
split to features and labels
split into train-val-test data
vectorize train data using:
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=maxlen,
standardize='lower',
split='whitespace',
ngrams=(1,3),
output_mode='tf-idf',
pad_to_max_tokens=True,)
I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?
You can split your dataset before creating a model. After splitting you need to tokenize your sentences using
tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)
After tokenizing you need to add padding to the sentence using
training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)
Then you can train your model with the data. For more details please refer to this working code example. Thank You.

getting important features in dataframe

I would like to ask please how to get the important features in a dataframe
# fit model to training data
xgb_model = XGBClassifier(random_state = 0 )
xgb_model.fit(X_train, y_train)
print("Feature Importances : ", xgb_model.feature_importances_)
I know how to plot it but I want to know howw to put the 20 most important features in a dataframe or a list

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

Why Tensorflow error: `failed to convert object of type <class 'dict'> to Tensor` happens and How can I solve it?

I am doing a task on traffic analysis and I am stymied with some error in my code. My data rows are like this:
qurter | DOW (Day of week)| Hour | density | speed | label (predicted speed for another half an hour)
The values are like this:
1, 6, 19, 23, 53.32, 45.23
Which means in some specific street during 1st quarter of 19 o'clock on Friday, density of traffic is measured 23 and current speed is 53.32. the predicted speed would be 45.23.
The task is to predict the speed for another half an hour by predictors given above.
I am using this code to build a TensorFlow DNNRegressor for data:
import pandas as pd
data = pd.read_csv('dataset.csv')
X = data.iloc[:,:5].values
y = data.iloc[:, 5].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(data=scaler.transform(X_train),columns = ['quarter','DOW','hour','density','speed'])
X_test = pd.DataFrame(data=scaler.transform(X_test),columns = ['quarter','DOW','hour','density','speed'])
y_train = pd.DataFrame(data=y_train,columns = ['label'])
y_test = pd.DataFrame(data=y_test,columns = ['label'])
import tensorflow as tf
speed = tf.feature_column.numeric_column('speed')
hour = tf.feature_column.numeric_column('hour')
density = tf.feature_column.numeric_column('density')
quarter= tf.feature_column.numeric_column('quarter')
DOW = tf.feature_column.numeric_column('DOW')
feat_cols = [h_percentage, DOW, hour, density, speed]
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train ,batch_size=10,num_epochs=1000,shuffle=False)
model = tf.estimator.DNNRegressor(hidden_units=[5,5,5],feature_columns=feat_cols)
model.train(input_fn=input_func,steps=25000)
predict_input_func = tf.estimator.inputs.pandas_input_fn(
x=X_test,
batch_size=10,
num_epochs=1,
shuffle=False)
pred_gen = model.predict(predict_input_func)
predictions = list(pred_gen)
final_preds = []
for pred in predictions:
final_preds.append(pred['predictions'])
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,final_preds)**0.5
when I run this code, It throws an error with this ending:
TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'label': <tf.Tensor 'fifo_queue_DequeueUpTo:6' shape=(?,) dtype=float64>}. Consider casting elements to a supported type.
First of all what is the concept of error? I couldn't find source for reason of error to deal with it. And how can I modify code for solution?
secondly does it improve the model performance to use tensorflow categorical_column_with_identity instead of numeric_columns for DOW which indicates days of week?
I also want to know if it's useful to merge quarter and hour as a single column like day time (quarter is minutes in an hour which is going to be normalized between 0 and 1)?
First of all what is the concept of error? I couldn't find source for
reason of error to deal with it. And how can I modify code for
solution?
Let me first talk about the solution to the problem. You need to change parameter y in pandas_input_fn as follows.
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train['label'],batch_size=10,num_epochs=1000,shuffle=False)
It seems that the parameters y in pandas_input_fn doesn't support dataframe type when you run to model.train(). pandas_input_fn parses every sample y to a form similar to {columnname: value} in this case, but model.train() can't recognize it. So you need to pass series type.
secondly does it improve the model performance to use tensorflow
categorical_column_with_identity instead of numeric_columns for DOW
which indicates days of week?
This involves when we should choose categorical or choose numeric for feature engineering. A very simple rule is to choose numeric if there is a significant difference between big and small in the internal comparison of your feature. If the feature does not have bigger or smaller significance, you should choose categorical. So I tend to choose categorical_column_with_identity for feature DOW.
I also want to know if it's useful to merge quarter and hour as a
single column like day time (quarter is minutes in an hour which is
going to be normalized between 0 and 1)?
Cross features may bring some benefits such as latitude and longitude features. I recommend you to use tf.feature_column.crossed_column(link) here. It returns a column for performing crosses of categorical features. You can also continue to retain features quarter and hour in model at the same time, .
A similar error occurred to me:
Failed to convert object of type <class 'tensorflow.python.autograph.operators.special_values.Undefined'> to Tensor.
It occurred in a tf.function when I tried to use a variable that I had not assigned before.
To debug this, you have to remove tf.function from the method ;-)

Reading EMNIST dataset

I am building a CNN using tensorflow in python, but having problem with loading the data from EMNIST dataset. Can anyone please show me a sample code of retrieving each image in a batch and pass during the training session?
There are a couple of formats of the EMNIST dataset...the one I've found easiest to understand is the CSV version on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
You can check out one of my implementations of an EMNIST CNN using Keras, where your dataset loading can be similar:
import pandas as pd
raw_data = pd.read_csv("data/emnist-balanced-train.csv")
train, validate = train_test_split(raw_data, test_size=0.1) # change this split however you want
x_train = train.values[:,1:]
y_train = train.values[:,0]
x_validate = validate.values[:,1:]
y_validate = validate.values[:,0]
from https://github.com/Josh-Payne/cs230/blob/master/Alphanumeric-Augmented-CNN/augmented-cnn.py