Is plotting with Koalas using TopN has any statistic meaning? - pandas

I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records.
I understand the meaning of sampling and internally it uses spark.DataFrame.sample to do it. For TopN, however, they simply take the first max_rows number of records from Koalas' DataFrame using data = data.head(max_rows + 1).to_pandas().
This seems strange and I wonder whether it's correctly reflecting the statistical properties of the dataset doing the data selection in this way.
Koalas DataFrame's plot accessor:
class KoalasPlotAccessor(PandasObject):
pandas_plot_data_map = {
"pie": TopNPlotBase().get_top_n,
"bar": TopNPlotBase().get_top_n,
"barh": TopNPlotBase().get_top_n,
"scatter": SampledPlotBase().get_sampled,
"area": SampledPlotBase().get_sampled,
"line": SampledPlotBase().get_sampled,
}
_backends = {} # type: ignore
...
class TopNPlotBase:
def get_top_n(self, data):
from databricks.koalas import DataFrame, Series
max_rows = get_option("plotting.max_rows")
# Simply use the first 1k elements and make it into a pandas dataframe
# For categorical variables, it is likely called from df.x.value_counts().plot.xxx().
if isinstance(data, (Series, DataFrame)):
data = data.head(max_rows + 1).to_pandas()
...

Related

Predefined feature selection from pandas dataframe as part of a sklearn gridsearch crossvalidation

TLDR: How to iterate through predefined feature subsets as part of a scikit-learn gridsearchcv pipeline?
For a regression task, I have set up a (nested) CV to chose and evaluate models for a given pandas dataframe model_X (all numerical columns, no missing data) and a target pandas series model_y.
My goal is to combine feature selection and hyperparameter tuning. However, for my purpose I do not want to use any of sklearn's feature selection algorithms, instead I simply want to try different predefined subsets of the available columns and get them tested against each other (and of course in all combinations with the other hyperparameters) in the CV.
For this purpose I have a list of tuples feature_candidates_list, where each tuple contains certain column names from model_X to be used together as features.
To achieve this I am using Functiontransformer like so:
def SelectFeatures(model_X, feature_set, feature_sets=feature_candidates_list):
return model_X.loc[:, feature_sets[feature_set]]
CustomFeatureSelector = FunctionTransformer(SelectFeatures, feature_names_out='one-to-one')
And here is how I put all together in a pipeline and param grid (this is a reduced example for only the relevant steps):
PreProcessor = ColumnTransformer([
('selector', CustomFeatureSelector, model_X.columns),
('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
])
pipe = Pipeline(steps=[
('preprocessor', PreProcessor),
('regressor', DummyRegressor()) # just a dummy here, as it can't be empty (actual regressor see regressor_params)
])
preprocessor_params = [
{
'preprocessor__selector__kw_args': [{'feature_set':i} for i in range(len(feature_candidates_list))],
'preprocessor__scaler__with_mean': [True, False],
'preprocessor__scaler__with_std': [True, False],
},
]
regressor_params = [
{
'regressor': [TweedieRegressor(max_iter=1000)],
'regressor__power': [0, 1],
'regressor__alpha': [0, 1],
'regressor__link': ['log'],
'regressor__fit_intercept': [True, False],
},
]
params = [{**dict_pre, **dict_reg} for dict_reg in regressor_params for dict_pre in preprocessor_params]
Finally, to run the model selection and evaluation I use:
scoring = {
'R2': 'r2',
'MAPE': 'neg_mean_absolute_percentage_error',
'MedAE': 'neg_median_absolute_error',
'MSLE': 'neg_mean_squared_log_error',
}
refit_scorer = 'R2'
with parallel_backend('loky', n_jobs=-1):
innerCV = GridSearchCV(
pipe,
params,
scoring= scoring,
refit= refit_scorer,
cv=10,
verbose=1,
)
outerCV = cross_validate(
innerCV,
model_X,
model_y,
scoring=scoring,
cv=10,
return_estimator=True,
verbose=1,
)
I am not sure if this pipeline actually selects the features as intended.
model_X has m columns
every tuple in feature_candidates_list contains n column names (n < m, of course).
What I did to check on a single outer fold's best estimator is:
outerCV['estimator'][0].best_estimator_.named_steps['regressor'].n_features_in_
which gives me m + n but I expected n (also tested it for the other folds).
I think there must be something wrong in how I put together my preprocessor. It seems like it is taking all original columns of model_X and concatenates them with the chosen set of columns instead of replacing. When I switch off the scaler the output of the above is in deed equal to n, however, I still cannot see which features were chosen for a respective estimator because calling .feature_names_in_ on them raises:
AttributeError: 'TweedieRegressor' object has no attribute 'feature_names_in_'
Maybe the whole way I approach this selection of features in gridsearchcv is not smart and I should go a different route? Any hints welcome!
Update:
I switched to sklearn nightly (v1.2.dev0) where I can use set_config(transform_output='pandas') to avoid getting my dataframe converted to a numpy array by transformers. This helps to get feature names when calling the .feature_names_in_ on one of the estimators but it only works when I have just the scaler activated.
When I also activate my custom selector the fitting fails for all folds. But when I turn off the set_config(...) again, it works just like in the stable versions v1.1.2 and v1.1.3 without ability to get feature names.

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.

Randomly selecting different percentages of data in Python

Python beginner, here. I have a dataset with 101 rows which I have imported into Python (as a csv file) using Pandas. I essentially want to randomly generate a number between 0 and 1 and, based on the result, randomly select the percent equivalent from the dataset. So, for instance, a randomly generated number of 0.89 would require 89% of the data to be selected.
I also want to specify different percentages such that I have, for instance, 89%, 8% and 3% of the data randomly selected at once. This is so I can make different assumptions based on X% of data that has been selected (for instance, for 3% of rows selected print('A'), etc.). I finally want to simulate the whole thing several times and store the results.
I have been experimenting with different types of code, such as the df.sample(frac=0.89) and etc. but I'm not sure how to extend this to select different percentages at the same time.
My current code is:
import random
import pandas import pandas as pd
df = pd.read_csv(r'R_100.csv', encoding='cp1252')
df_1 = df['R_MD'].sample(frac=0.8889)
Total = df['PR_MD'].sum()
print(df_1, 'Total=', Total)
Any advice is much appreciated. Thanks in advance.
Here is something you can do, you need a function to do this every time.
import pandas as pd
df = pd.read_csv(r'R_100.csv', encoding='cp1252')
After you read the dataframe
def frac(dataframe, fraction, other_info=None):
"""Returns fraction of data"""
return dataframe.sample(frac=fraction)
here other_info can be specific column name and then call the function however many times you want
df_1 = frac(df, 0.3)
it will return you a new dataframe that you can use for anything you want, you can use this something like this as I infer from your example you are taking sum of a column
import random
def random_gen():
"""generates random number"""
return random.randint(0,1)
def print_sum(column_name):
"""Prints sum"""
# call the random_gen() to give out a number
rand_num = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, rand_num)
print(df_tmp[str(column_name)].sum())
Or if you want
but I'm not sure how to extend this to select different percentages at the same time.
Then just change the print_sum as follows
def print_sum(column_name):
"""returns result for 10 iterations"""
# list to store all the result
results = []
# selecting different percentage fraction
# for 10 different random fraction or you can have a list of all the fractions you want
# and then for loop over that list
for i in range(1,10):
# generate random number
fracr = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, fracr)
result.append(df_tmp[str(column_name)].sum())
return result
Hope this helps! Feedback is much appreciated :)