Trying to use XGBoost after applying SMOTENC on df, getting ValueError: A given column is not a column of the dataframe - xgboost

Since my df is not balanced (per the response column) I am trying to balance the the response column and than apply XGBoost to find the best predictors from the df.
After running my code, I'm getting ValueError: A given column is not a column of the data frame (it's the 2nd column of my df).
The df has 5 first columns (0-4) with categorial data and 6 more numeric columns.
> df= pd.read_csv('Avis test6.csv')
> #df['Binary'] = np.where(df['status']=='closed', 0,1)
> df['Binary'].value_counts()
> define dataset
> y=df['Binary']
> df.shape
> X=df.drop(columns=['Binary'])
> sm = SMOTENC(random_state=42, categorical_features=[0,1,2,3,4])
> X_res, y_res = sm.fit_resample(X, y)
> X_res_train,y_res_train = df.iloc[:2000, :-1], df.iloc[:2000, -1]
> X_res_test, y_res_test = df.iloc[2000:, :-1], df.iloc[2000:, -1]
> cat_attribs = ['category_list', 'market', 'country_code', 'region', 'city']
> full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')
> encoder = full_pipeline.fit(X_res_train)
> X_res_train = encoder.transform(X_res_train)
> X_res_test = encoder.transform(X_res_test)
> train the model
> model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2)
> model.fit(X_res_train, y_res_train)
> extract the training set predictions
> model.predict(X_res_train)
> extract the test set predictions
> model.predict(X_res_test)

Related

Using 'not is in' in PySpark and getting an empty dataframe back

I'm trying to use filter to find those 'title' that are not in list_A.
A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()
B.filter(~B.title.isin(list_A)).count()
However, I get an empty dataframe back (count is 0)
It works well when I use 'is in':
Why this happened and how can I solve this?
I tried:
B=B.na.drop(subset=["title"])
B.filter(~B.title.isin(list_A)).count()
print(B.filter(~B.title.isin(list_A) | B.title.isNull()).count())
It still returns 0.
It may be because other "title" values are null.
B = spark.createDataFrame([('x',), ('x',), (None,)], ['title'])
A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()
print(B.filter(~B.title.isin(list_A)).count())
# 0
print(B.filter(B.title.isin(list_A)).count())
# 2
If you really need list_A, you shouldn't go to Pandas for it.
You can either use collect
A = B.groupBy("title").count().filter(F.col('count') > 1)
list_A = [x.title for x in A.collect()]
print(list_A)
# ['x', None]
or collect_set
list_A = (B
.groupBy("title").count()
.groupBy((F.col('count') > 1).alias('_c')).agg(
F.collect_set('title').alias('_t')
).filter('_c')
.head()[1]
)
print(list_A)
# ['x']
Finally, to translate your current query to PySpark, you should use window functions.
Input:
from pyspark.sql import functions as F, Window as W
B = spark.createDataFrame(
[('x', 'Example'),
('x', 'Example'),
('x', 'not_example'),
('y', 'not_example'),
(None, 'not_example'),
(None, 'Example')],
['title', 'journal'])
Your current script:
A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()
B.filter(((B.title.isin(list_A))&(B.journal!="Example"))|(~B.title.isin(list_A)))
Suggestion:
B_filtered = (B
.withColumn('A_cnt', F.count('title').over(W.partitionBy('title')))
.filter("(A_cnt > 1 and journal != 'Example') or A_cnt <= 1")
.drop('A_cnt')
)
B_filtered.show()
# +-----+-----------+
# |title| journal|
# +-----+-----------+
# | null|not_example|
# | null| Example|
# | x|not_example|
# | y|not_example|
# +-----+-----------+

Pd['Column1] > Pd['Column2'] Key Error: 0

I am trying to write a function that loops through a pandas dataframe and compares if column1 > column2, if so appends 1 to a list, which is then returned.
Importing finance data from yahoo finance, calculating 2 Std and assigning to column upper and lower, and the moving average.
import pandas as pd
import pandas_datareader as web
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import numpy as np
end = datetime(2021, 1, 16)
start = datetime(2020 , 1, 16)
symbols = ['ETH-USD']
stock_df = web.get_data_yahoo(symbols, start, end)
period = 20
# Simple Moving Average
stock_df['SMA'] = stock_df['Close'].rolling(window=period).mean()
# Standard deviation
stock_df['STD'] = stock_df['Close'].rolling(window=period).std()
# Upper Bollinger Band
stock_df['Upper'] = stock_df['SMA'] + (stock_df['STD'] * 2)
# Lower Bollinger Band
stock_df['Lower'] = stock_df['SMA'] - (stock_df['STD'] * 2)
# List of columns
column_list = ['Close', 'SMA', 'Upper', 'Lower']
stock_df[column_list].plot(figsize=(12.2,6.4)) #Plot the data
plt.title('ETH-USD')
plt.ylabel('USD Price ($)')
plt.show();
#Create a new data frame, Period for calculation, removes NAN's
bolldf = stock_df[period-1:]
#Show the new data frame
bolldf
Function, Loop through column rows and compare, append df['Close'][0] to buy/sell signal if condition is met.
def signal(df):
buy_signal = []
sell_signal = []
for i in range(len(df['Close'])):
if df['Close'][i] > df['Upper'][i]:
buy_signal.append(1)
return buy_signal
buy_signal = signal(bolldf)
buy_signal
Info about the Error:
KeyError: 0
During handling of the above exception, another exception occurred:
---> 12 buy_signal = greaterthan(stock_df)
KeyError: 0
During handling of the above exception, another exception occurred:
11
---> 12 buy_signal = greaterthan(stock_df)
13
----> 8 if bolldf['Close'][i] > bolldf['Open'][i]:
KeyError: 0
When i attempt this function on the columns df['Upper'] > df['Lower], or df['SMA'] < df['Lower'] for example it works as expected, its only when using the columns from the original data that it does not work.
Any help would be amazing. Thank you.
Since it is a multi-index, the columns must also be specified in the multi-index format. You can check the contents of the columns with bolldf.columns, and you can get them by modifying as follows.
bolldf.columns
MultiIndex([('Adj Close', 'ETH-USD'),
( 'Close', 'ETH-USD'),
( 'High', 'ETH-USD'),
( 'Low', 'ETH-USD'),
( 'Open', 'ETH-USD'),
( 'Volume', 'ETH-USD'),
( 'SMA', ''),
( 'STD', ''),
( 'Upper', ''),
( 'Lower', '')],
names=['Attributes', 'Symbols'])
def signal(df):
buy_signal = []
sell_signal = []
for i in range(len(df[('Close','ETH-USD')])):
if df[('Close','ETH-USD')][i] > df[('Upper','')][i]:
buy_signal.append(1)
return buy_signal
buy_signal = signal(bolldf)

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()
List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
Results:
event_data.head(6)
# total_gross total_gross_accumulated
#created
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()

Time Difference between Time Period and Instant

I have some time periods (df_A) and some time instants (df_B):
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
# Data
df_A = pd.DataFrame({'A1': [dt.datetime(2017,1,5,9,8), dt.datetime(2017,1,5,9,9), dt.datetime(2017,1,7,9,19), dt.datetime(2017,1,7,9,19), dt.datetime(2017,1,7,9,19), dt.datetime(2017,2,7,9,19), dt.datetime(2017,2,7,9,19)],
'A2': [dt.datetime(2017,1,5,9,9), dt.datetime(2017,1,5,9,12), dt.datetime(2017,1,7,9,26), dt.datetime(2017,1,7,9,20), dt.datetime(2017,1,7,9,21), dt.datetime(2017,2,7,9,23), dt.datetime(2017,2,7,9,25)]})
df_B = pd.DataFrame({ 'B': [dt.datetime(2017,1,6,14,45), dt.datetime(2017,1,4,3,31), dt.datetime(2017,1,7,3,31), dt.datetime(2017,1,7,14,57), dt.datetime(2017,1,9,14,57)]})
I can match these together:
# Define an Extra Margin
M = dt.timedelta(days = 10)
df_A["A1X"] = df_A["A1"] + M
df_A["A2X"] = df_A["A2"] - M
# Match
Bv = df_B .B .values
A1 = df_A .A1X.values
A2 = df_A .A2X.values
i, j = np.where((Bv[:, None] >= A1) & (Bv[:, None] <= A2))
df_C = pd.DataFrame(np.column_stack([df_B .values[i], df_A .values[j]]),
columns = df_B .columns .append (df_A.columns))
I would like to find the time difference between each time period and the time instant matched to it. I mean that
if B is between A1 and A2
then dT = 0
I've tried doing it like this:
# Calculate dt
def time(A1,A2,B):
if df_C["B"] < df_C["A1"]:
return df_C["A1"].subtract(df_C["B"])
elif df_C["B"] > df_C["A2"]:
return df_C["B"].subtract(df_C["A2"])
else:
return 0
df_C['dt'] = df_C.apply(time)
I'm getting "ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
So, I found two fixes:
You are adding M to the lower value and subtracting from the higher one. Change it to:
df_A['A1X'] = df_A['A1'] - M
df_A['A2X'] = df_A['A2'] + M
You are only passing one row of your dataframe at a time to your time function, so it should be something like:
def time(row):
if row['B'] < row['A1']:
return row['A1'] - row['B']
elif row['B'] > row['A2']:
return row['B'] - row['A2']
else:
return 0
And then you can call it like this:
df_C['dt'] = df_C.apply(time, axis=1) :)

Calculating np.mean predict with percent filter

I need to find np.mean of clf.predict only for rows where one of predicted values percent more then 80%
My current code:
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X, Y)
dropIndexes = []
for i in range(len(X)):
proba = clf.predict_proba ([X.values[i]])
if (proba[0][0] < 80 and proba[0][1] < 80):
dropIndexes.append(i)
#delete all rows where predicted values less then 80
X.drop(dropIndexes, inplace=True)
Y.drop(dropIndexes, inplace=True)
#Returns the average of the array elements
print ("ERR:", np.mean(Y != clf.predict(X)))
Is it possible to make this code more quickly?
Your loop is unnecessary, as predict_proba works on matrices. You can replace it with
prd = clf.predict_proba(X)
dropIndexes = (prd[:, 0] < 0.8) & (prd[:, 1] < 0.8)