Tensorflow classifier.predict write to csv - pandas

I've not used Tensorflow for a while and when I updated it seemed to have broken my old code as many of the old functions are deprecated. I fixed them with the new code and it all seems to be running except for when I write out the results:
y_predicted = classifier.predict(X_test)
There is an as iterable option as well - which I don't think I need.
I use to write out the results of the predictions using:
pandas.DataFrame(y_predicted).to_csv(/dir/)
but now I am getting an error that not all elements can be converted into String type. Is there a class in y_predicted I am suppose to be calling instead of the whole thing?

Anyways, I found a solution using np.array instead of a pandas dataframe:
result = np.asarray(y_predicted)
formatInt = result.astype(np.int)
np.savetxt("dir",formatInt,delimiter=",")

You can also try,
df = pandas.DataFrame({'Prediction':list(y_predicted)})
df.to_csv('filename.csv')

Related

How to setup a batched matrix multiplication in Numba with np.dot() using contiguous arrays

I am trying to speed up a batched matrix multiplication problem with numba, but it keeps telling me that it's faster with contiguous code.
Note: I'm using numba version 0.55.1, and numpy version 1.21.5
Here's the problem:
import numpy as np
import numba as nb
def numbaFastMatMult(mat,vec):
result = np.zeros_like(vec)
for n in nb.prange(vec.shape[0]):
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
return result
D,N = 10,1000
mat = np.random.normal(0,1,(N,D,D))
vec = np.random.normal(0,1,(N,D))
result = numbaFastMatMult(mat,vec)
print(mat.data.contiguous)
print(vec.data.contiguous)
print(mat[n,:,:].data.contiguous)
print(vec[n,:].data.contiguous)
clearly all the relevant data is contiguous (run the above code snippet and see the results of print()...
But, when I run this code, I get the following warning:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 1d, C), array(float64, 2d, A))
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
2 Extra comments:
This is just a toy problem for replication. I'm actually using something with many more data points, so hoping this will speed up.
I think the "right" way to solve this is with np.tensordot. However, I want to understand what's going on for future reference. For example, this discussion addresses a similar issue, but as far as I can tell, doesn't address why the warning shows up directly.
I've tried adding a decorator:
nb.float64[:,::1](nb.float64[:,:,::1],nb.float64[:,::1]),
I've tried reordering the arrays so the batch index is first (n in the above code)
I've tried printing whether the "mat" variable is contiguous from inside the function
I'll leave this up, but I figured it out:
Outside of a numba function:
mat[n,:,:].data.contiguous==True
but inside numba, mat[n,:,:] is no longer continous.
Changing my code above to np.dot(vec[n], mat[n]) removed the warning.
I'm making this the "correct" answer since it solved my problem. However, according to max9111's response, this behavior may be a bug!

SkLearn - Using RegressorChain with ColumnTransformer in Pipelines?

I'm having problems using sklearn's RegressorChain (https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html), and unfortunately there doesn't seem to be a lot of documentation/examples about this.
The documentation states indirectly (through the set_params method) that it can be used with Pipelines. My pipeline has:
ct = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), numerical_columns),
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough'
)
cv = TimeSeriesSplit(n_splits = groups.nunique()) #groups by date
pipeline = make_pipeline(ct, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())
and then I do:
chain_regressor = RegressorChain(base_estimator=target_transform_output , order=[1,0,2])
chain_regressor.fit(X, y)
In the above, both X and y are pandas Dataframes, and y has 3 target columns.
When I run the code, I get a python stack trace caused by the fit() call, starting in __init.py__ in _get_column_indices(X, key) when doing all_columns = X.columns. The error is:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
and further down at the end of the stack trace:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I assume this is because the ColumnTransformer returns ndarrays, a well-known problem. Does this mean that the RegressorChain can't be used with the ColumnTransformer?
After this, I removed the column transformer step from the pipeline and tried again, and without the ColumnTransformer everything works fine (even the TransformedTargetRegressor).
Any help, ideas or workaround appreciated.
You have the issue the wrong way around: it's not that ColumnTransformer outputs an array and RegressorChain expected a dataframe; rather, the RegressorChain converts your input to an array before calling your pipeline, and so your ColumnTransformer doesn't get a dataframe as input and cannot use your column-name specifications.
You could just specify the columns by index or callable in the ColumnTransformer. But I think in this case, you have two unfortunate side-effects:
For each target, you are re-encoding day_of_week and re-scaling each independent variable (not wrong, just a little wasteful), and
you never scale the targets, even when they are used as independent variables for "later" targets' regressions (not wrong for a tree-based model like your lightGBM [in fact, for LGBM, why bother scaling at all?], but other models might suffer from not scaling those).
(1) can be fixed by preprocessing as a pipeline step before RegressorChain. (2) can be fixed by changing the scaler's column specification to a callable, below using the helper make_column_selector. Doing that fix for (2) does end up re-calculating the scalings at each step (hurting (1) again), but I think in the end that's a bigger deal (if you wanted to use something other than a tree model at some point).
So I would suggest instead:
encoder = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), ['day_of_week']),
],
remainder='passthrough',
)
scale_nums = ColumnTransformer(
transformers=[
('scaler', MinMaxScaler(), make_column_selector(dtype_include=np.number)),
],
remainder='passthrough',
)
modeling_pipe = make_pipeline(scale_nums, lgb.LGBMRegressor(random_state=42))
target_transform_output = TransformedTargetRegressor(
regressor=modeling_pipe,
transformer=PowerTransformer(),
)
final_pipeline = make_pipeline(encoder, target_transform_output)

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Object dtype dtype('O') has no native HDF5 equivalent

Well, it seems like a couple of similar questions were asked here in stack overflow, but none of them seem like answered correctly or properly, nor they described the exact examples.
I have a problem with saving array or list into hdf5 ...
I have a several files contains list of (n, 35) dimensions, where n may be different in each file. Each of them can be saved in hdf5 with code below.
hdf = hf.create_dataset(fname, data=d)
However, if I want to merge them to make in 3d the error occurs as below.
Object dtype dtype('O') has no native HDF5 equivalent
I have no idea why it turns to dtype object, since what I have done is only this
all_data = list()
for fname in file_list:
d = np.load(fname)
all_data.append(d)
hdf = hf.create_dataset('all_data', data=all_data)
How can I save such data?
I tried a couple of tests, and it seems like all_data turns to dtype with 'object' when I change them with
all_data = np.array(all_data)
Which looks it has the similar problem with saving hdf5.
Again, how can I save such data in hdf5?
I was running into a similar issue with h5py, and changing the type of the NumPy array using array.astype worked for me (I believe this changes the type from dtype('O') to the data type you specify). Please see the code snippet below:
import numpy as np
print(X.dtype)
--> dtype('O')
print(X.astype(np.float64).dtype)
--> dtype('float64')
When I ran h5.create_dataset with this data type conversion, I was able to successfully create a h5 dataset. Hope this helps!
ONE ADDITIONAL UPDATE: I believe the NumPy object type 'O' is created when the NumPy array itself has mixed element types (e.g. np.int8 and np.float32).
dtype('O') stands for object. In my case I had a list of lists where the lengths were different and got the same error. If you convert it to a numpy array numpy warns Creating an ndarray from ragged nested sequences. h5 files can't handle this type of data for more info see this post
This error comes when I use:
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X)
pfile['peakX'] = np.array(Y)
However when I used dtype when saving the arrays... the problem went away... I guess h5py is not able to create datasets from undefined data types.
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X, dtype=np.float32)
pfile['peakX'] = np.array(Y, dtype=np.float32)

Python: What's the proper way to convert between python bool and numpy bool_?

I kept getting errors with a numpy ndarray with booleans not being accepted as a mask by a pandas structure when it occured to me that I may have the 'wrong' booleans. Edit: it was not a raw numpy array but a pandas.Index.
While I was able to find a solution, the only one that worked was quite ugly:
mymask = mymask.astype(np.bool_) #ver.1 does not work, elements remain <class 'bool'>
mymask = mymask==True #ver.2, does work, elements become <class 'numpy.bool_'>
mypdstructure[mymask]
What's the proper way to typecast the values?
Ok, I found the problem. My original post was not fully correct: my mask was a pandas.Index.
It seems that the pands.Index.astype is behaving unexpectedly (for me), as I get different behavior for the following:
mask = pindex.map(myfun).astype(np.bool_) # doesn't cast
mask = pindex.map(myfun).astype(np.bool_,copy=False) # doesn't cast
mask = pindex.map(myfun).values.astype(np.bool_) # does cast
Maybe it is actually a pandas bug? This result is surprising to me because I was under the impression that pandas is usually just calling the functions of the numpy arrays that it is based on. This is clearly not the case here.