Selection column in a dataframe in pandas apply min function - pandas

I have n-dataframe in a list
df=[df_1, df_2, df_3, ...., df_n]
Where df_n is a dataframe in pandas (python). df_n is a variable of my keras-model.
X_train=[df_1_1,df_2_1,...,df_n_1]
Where:
df_1_1 is the first dataframe of the list (the first variable) and the first columns of this dataframe, his dataframe has m columns.
Each column of this dataframe if this variable applies a different type of smoothing or filter.
I have 100 column in each dataframe and I want to select the combination of columns (of different dataframes), the X_train than have min value in the score of my model.
score = model.evaluate(X_test,Y_test)
X_test and Y_test are the last n occurrences of the selected columns.
There some library for selected this columns (neuronal networks, GA, colony ant, ...)?
How can I implement it?

What is your prediction task? Do you need a neural network or not? You are essentially looking at a feature selection problem here. You could use simpler models such as the lasso which will select columns using L1-regularization. Or you could use an ensembling technique such as random forests and consider the relative feature importances to select your columns. Perhaps have a look at scikit-learn.

Related

label encoder unable to convert a range of categorical columns into numerical columns

I have a 50 columns, categorical dataset. Among them only 5 columns are numerical. I would like to apply label encoder to make the categorical columns to numerical columns. Categorical columns are basically nominal columns for my dataset. I need to convert columns 0 to 4 to numerical and column 9 to 50 to numerical values.
I used the command
le = LabelEncoder()
df.iloc[:,0:4]=le.fit_transform(df.iloc[:,0:4])
df is the name of the dataframe.
error : ValueError: y should be a 1d array
How could I fix this problem? Thank you.
use .apply() method of DataFrame to apply some rule to columns/rows.
In your particular case it will be smth like this: df.apply(le.fit_transform) (notice that you need to add .iloc here)

How to compare one row in Pandas Dataframe to all other rows in the same Dataframe

I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.

Unable to Perform NULL value analysis in pandas dataframe

I want to perform NULL values analysis here.
Here i have mentioned first 2 rows of the dataset
Shop_name Bikes_avaiable Shop_location Average_price_of_bikes Rating_of_shop
NYC Velo Ninja,hbx Salida 5685$ 4.2
Bike Gallery dtr,mtg,Harley Davidson Portland 6022$ 4.8
Except shop_name, every columns has some NULL values.
Earlier i had used mean based imputation and frequency based imputation to replace NULL values.
But, i have been told to follow the model based imputation technique to replace all
the NULL value.
Can anyone suggest me how to do that.
I guess model based imputation technique refers to using statistical models to predict the missing values. For example, you can use K-Nearest Neighbours model to predict missing values. Train the KNN model using data points on the rows where no missing value exists and predict the missing values based on the rows where there are missing values. You would have to apply one-hot encoding for categorical values.

iloc using scikit learn random forest classifier

I am trying to build a random forest classifier to determine the 'type' of an object based on different attributes. I am having trouble understanding iloc and separating the predictors from the classification. If the 50th column is the 'type' column, I am wondering why the iloc (commented out) line does not work, but the line y = dataset["type"] does. I have attached the code below. Thank you!
X = dataset.iloc[:, 0:50].values
y = dataset["type"]
#y = dataset.iloc[:,50].values
Let's assume that the first column in your dataframe is named "0" and the following columns are named consequently. Like the result of the following lines
last_col=50
tab=pd.DataFrame([[x for x in range(last_col)] for c in range(10)])
now, please try tab.iloc[:,0:50] - it will work because you used slice to select columns indexes.
But if you try tab.iloc[:,50] - it will not work, because there is no column with index 50.
Slicing and selecting column by its index is just a bit different. From pandas documentation:
.iloc[] is primarily integer position based (from 0 to length-1 of the axis)
I hope this help.

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of
174 rows
12430 columns
The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True))
at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')
what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.
I explored the dataframes in pandas and could not find the way to pass the names as parameter.
for empty array, I was able to find the following (numpy) with no names:
a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns
I know R very well (if there is no other way - I hope that I can use it with datalab)
any idea?
Many thanks!
If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.
Assuming the rows and columns are in variables row_names and col_names, do this.
df.loc[row_names, col_names]