Subclass pandas DataFrame with required argument - pandas

I'm working on a new data structure that subclasses pandas DataFrame. I want to enforce my new data structure to have new_property, so that it can be processed safely later on.
However, I'm running into error when using my new data structure, because the constructor gets called by some internal pandas function without the required property.
Here is my new data structure.
import pandas as pd
class MyDataFrame(pd.DataFrame):
#property
def _constructor(self):
return MyDataFrame
_metadata = ['new_property']
def __init__(self, data, new_property, index=None, columns=None, dtype=None, copy=True):
super(MyDataFrame, self).__init__(data=data,
index=index,
columns=columns,
dtype=dtype,
copy=copy)
self.new_property = new_property
Here is an example that causes error
data1 = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [15, 25, 30], 'd': [1, 1, 2]}
df1 = MyDataFrame(data1, new_property='value')
df1[['a', 'b']]
Here is the error message
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-
packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-33-b630fbf14234>", line 1, in <module>
df1[['a', 'b']]
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2053, in __getitem__
return self._getitem_array(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2098, in _getitem_array
return self.take(indexer, axis=1, convert=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1670, in take
result = self._constructor(new_data).__finalize__(self)
TypeError: __init__() missing 1 required positional argument: 'new_property'
Is there a fix to this or an alternative way to design this to enforce my new data structure to have new_property?
Thanks in advance!

This question has been answered by a brilliant pandas developer. See this issue for more details. Pasting the answer here.
class MyDataFrame(pd.DataFrame):
#property
def _constructor(self):
return MyDataFrame._internal_ctor
_metadata = ['new_property']
#classmethod
def _internal_ctor(cls, *args, **kwargs):
kwargs['new_property'] = None
return cls(*args, **kwargs)
def __init__(self, data, new_property, index=None, columns=None, dtype=None, copy=True):
super(MyDataFrame, self).__init__(data=data,
index=index,
columns=columns,
dtype=dtype,
copy=copy)
self.new_property = new_property
data1 = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [15, 25, 30], 'd': [1, 1, 2]}
df1 = MyDataFrame(data1, new_property='value')
df1[['a', 'b']].new_property
Out[121]: 'value'
MyDataFrame(data1)
TypeError: __init__() missing 1 required positional argument: 'new_property'

I know this is an old issue, but I wanted to extend on hlu's answer.
When implementing the answer described by hlu, I was getting the following error when just trying to print the subclassed DataFrame: AttributeError: 'internal_constructor' object has no attribute '_from_axes'
To fix this, I have used an object instead of the function used in hlu's answer to be able to implement the _from_axes method on the callable.
There is no classmethod type decorator for the _internal_constructor class, so instead we instantiate it with the callers class so it can be used when the _internal_constructor is called.
class MyDataFrame(pd.DataFrame):
#property
def _constructor(self):
return MyDataFrame._internal_constructor(self.__class__)
class _internal_constructor(object):
def __init__(self, cls):
self.cls = cls
def __call__(self, *args, **kwargs):
kwargs['my_required_argument'] = None
return self.cls(*args, **kwargs)
def _from_axes(self, *args, **kwargs):
return self.cls._from_axes(*args, **kwargs)

Related

Python Pandas: itemsize functions throws AttributeError: 'Series' object has no attribute 'itemsize'

I am new to Python Pandas, and, as part of that, I wanted to create a pandas.Series and print the itemsize of the Series. However, while I was doing that, it returned an attribute error.
Following is my code:
import pandas as pd
data = [1, 2, 3, 4]
s = pd.Series(data)
print(s.itemsize)
This however does not return the itemsize and throws the following error:
Traceback (most recent call last):
File "c:\Users\USER\filename.py", line 32, in <module>
print(s.itemsize)
^^^^^^^^^^
File "C:\Users\USER\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\generic.py", line 5902, in __getattr__
return object.__getattribute__(self, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Series' object has no attribute 'itemsize'```
What might the reason for this be?
You probably confused it with numpy.ndarray.itemsize in numpy.
In pandas, it's pandas.Series.dtype.itemsize :
import pandas as pd
data = [1, 2, 3, 4]
s = pd.Series(data)
Output :
print(s.dtype.itemsize)
#8

Cannot update dictionary

I have a dict_ class that attempts to copy the built-in dict class.
Here is that class (the new function returns the original object):
class dict_:
def __init__(self, *args, **kwargs):
self.kv = kwargs
if not self.kv:
for kv in args:
for k, v in kv:
self.kv.update({k: v})
def __str__(self):
return "%s" % self.kv
def __getitem__(self, item):
return self.kv[item]
def update(self, *args):
self.kv.update(args)
I've called it like this:
from Dodger.dodger import *
term = new(System())
a = new(dict_(a=1, b=2))
a.update(new(dict_(c=3)))
term.println(a)
This is supposed to modify a to {"a": 1, "b": 2, "c": 3} but instead it gives me this error:
Traceback (most recent call last):
File "C:/free_time/Dodger/dodger_test.py", line 5, in <module>
File "C:\free_time\Dodger\dodger.py", line 176, in update
File "C:\free_time\Dodger\dodger.py", line 173, in __getitem__
KeyError: 0
Why is it giving a KeyError? What does the 0 mean? (I am using python 3.8.2)
I figured out how to solve this problem. I just have to implement __setitem__ too:
def __setitem__(self, key, value):
"""Set self[key] to value"""
self.kv[key] = value

model.predict(sample) returning TypeError: cannot perform reduce with flexible type

I keep running into this error when I try to predict based on fitted model.
training, testing = train_test_split(gesture, test_size = 0.2, random_state = 0)
x = training.drop('CLASS', axis = 1) # remove the Class column from Training dataframe
y = testing.drop('CLASS', axis = 1) # remove the Class column from Testing dataframe
f_train = x.values.tolist()
l_train = training['CLASS'].values.tolist() # make a list of class identifiers from Training dataframe
f_test = y.values.tolist()
knn = KNeighborsRegressor(n_neighbors = 5)
knn.fit(f_train, l_train)
predictions = knn.predict(f_test)
The error occurs in the last line of the above code and the error message is given below:
Traceback (most recent call last):
File "C:\Users\Umair Khan\Dropbox\`Shift betweeen PCs\Work\EMG Hand Gesture\Codes\ML_on_CSV.py", line 39, in <module>
predictions = knn.predict(f_test)
File "C:\Users\Umair Khan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\neighbors\_regression.py", line 185, in predict
y_pred = np.mean(_y[neigh_ind], axis=1)
File "<__array_function__ internals>", line 6, in mean
File "C:\Users\Umair Khan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\core\fromnumeric.py", line 3335, in mean
out=out, **kwargs)
File "C:\Users\Umair Khan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\core\_methods.py", line 151, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type
f_test is a list of lists like such [[16, 30, 35, 250, -1, 0.5, 35, 0.03, 0.02], [16, 30, 35, 250, -1, 0.5, 35, 0.03, 0.02]]
I have also tried passing an array in predict(sample) but the issue still remains.
predictions = knn.predict(np.array(f_test).astype(np.float))
We need to see more of the error traceback. And info on the function inputs, particularly shape and dtype.
I've seen this error message when working with structured arrays. But it's not obvious where those might arise in your code.
In [15]: np.ones((2,), dtype='i,i')
Out[15]: array([(1, 1), (1, 1)], dtype=[('f0', '<i4'), ('f1', '<i4')])
In [16]: np.sum(np.ones((2,), dtype='i,i'))
....
---> 87 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
88
89
TypeError: cannot perform reduce with flexible type
Solved:
changed dtype of l_train from string to float and the error disappeared. f_train and f_test were already of type float.

Python 3.5 Trying to plot PCA with sklearn and matplotlib

Using the following code generates the error: TypeError: float() argument must be a string or a number, not 'Pred':
I am struggling to figure out what is causing this error to be thrown.
self.features is a list composed of three floats ex. [1.1, 1.2, 1.3]
an example of self.features:
[array([-1.67191985, 0.1 , 9.69981494]), array([-0.68486623, 0.05 , 9.99085024]), array([ -1.36 , 0.1 , 10.44720459]), array([-2.46918915, 0. , 3.5483372 ]), array([-0.835 , 0.1 , 4.02740479])]
This is the method where the error is being thrown.
def pca(self):
pca = PCA(n_components=2)
x_np = np.asarray(self.features)
pca.fit(x_np)
X_reduced = pca.transform(x_np)
plt.figure(figsize=(10, 8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdBu')
plt.xlabel('First component')
plt.ylabel('Second component')
The full trace back is:
Traceback (most recent call last):
File "/Users/user/PycharmProjects/Post-Translational-Modification-
Prediction/pred.py", line 244, in <module>
y.generate_pca()
File "/Users/user/PycharmProjects/Post-Translational-Modification-
Prediction/pred.py", line 222, in generate_pca
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdBu')
File "/usr/local/lib/python3.5/site-packages/matplotlib/pyplot.py",
line 3435, in scatter
edgecolors=edgecolors, data=data, **kwargs)
File "/usr/local/lib/python3.5/site-packages/matplotlib/__init__.py",
line 1892, in inner
return func(ax, *args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/matplotlib/axes/_axes.py", line 3976, in scatter
c_array = np.asanyarray(c, dtype=float)
File "/usr/local/lib/python3.5/site-packages/numpy/core/numeric.py", line 583, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
TypeError: float() argument must be a string or a number, not 'Pred'
The suggested fix by #WhoIsJack is to add np.arange(len(self.features))
The functional code for those who run into similar issues is:
def generate_pca(self):
y= np.arange(len(self.features))
pca = PCA(n_components=2)
x_np = np.asarray(self.features)
pca.fit(x_np)
X_reduced = pca.transform(x_np)
plt.figure(figsize=(10, 8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdBu')
plt.xlabel('First component')
plt.ylabel('Second component')
plt.show()

Shape must be rank 0 but is rank 1, parse_single_sequence_example

For the past few days I have been having an issue with serializing data to tfrecord format and then subsequently deserializing it using parse_single_sequence example. I am attempting to retrieve data for use with a fairly standard RNN model, however this is my first attempt at using the tfrecords format and the associated pipeline that goes with it.
Here is a toy example to reproduce the issue I am having:
import tensorflow as tf
import tempfile
from IPython import embed
sequences = [[1, 2, 3], [4, 5, 1], [1, 2]]
label_sequences = [[0, 1, 0], [1, 0, 0], [1, 1]]
def make_example(sequence, labels):
ex = tf.train.SequenceExample()
sequence_length = len(sequence)
ex.context.feature["length"].int64_list.value.append(sequence_length)
fl_tokens = ex.feature_lists.feature_list["tokens"]
fl_labels = ex.feature_lists.feature_list["labels"]
for token, label in zip(sequence, labels):
fl_tokens.feature.add().int64_list.value.append(token)
fl_labels.feature.add().int64_list.value.append(label)
return ex
writer = tf.python_io.TFRecordWriter('./test.tfrecords')
for sequence, label_sequence in zip(sequences, label_sequences):
ex = make_example(sequence, label_sequence)
writer.write(ex.SerializeToString())
writer.close()
tf.reset_default_graph()
file_name_queue = tf.train.string_input_producer(['./test.tfrecords'], num_epochs=None)
reader = tf.TFRecordReader()
context_features = {
"length": tf.FixedLenFeature([], dtype=tf.int64)
}
sequence_features = {
"tokens": tf.FixedLenSequenceFeature([], dtype=tf.int64),
"labels": tf.FixedLenSequenceFeature([], dtype=tf.int64)
}
ex = reader.read(file_name_queue)
# Parse the example (returns a dictionary of tensors)
context_parsed, sequence_parsed = tf.parse_single_sequence_example(
serialized=ex,
context_features=context_features,
sequence_features=sequence_features
)
context = tf.contrib.learn.run_n(context_parsed, n=1, feed_dict=None)
print(context[0])
sequence = tf.contrib.learn.run_n(sequence_parsed, n=1, feed_dict=None)
print(sequence[0])
The associated stack trace is:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/common_shapes.py", line 594, in call_cpp_shape_fn
status)
File "/usr/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors.py", line 463, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InvalidArgumentError: Shape must be rank 0 but is rank 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "my_test.py", line 51, in
sequence_features=sequence_features
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/parsing_ops.py", line 640, in parse_single_sequence_example
feature_list_dense_defaults, example_name, name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/parsing_ops.py", line 837, in _parse_single_sequence_example_raw
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_parsing_ops.py", line 285, in _parse_single_sequence_example
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2382, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1783, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/common_shapes.py", line 596, in call_cpp_shape_fn
raise ValueError(err.message)
ValueError: Shape must be rank 0 but is rank 1
I posted this as a potential issue over on github though it seems I may just be using it incorrectly: Tensorflow Github Issue
So with the background information out of the way, I'm just wondering if I am in fact making an error here? Any help in the right direction would be greatly appreciated, its been a few days and my poking around hasn't panned out. Thanks all!
Got it and it was a bad assumption on my part. The tf.TFRecordReader.read(queue, name=None) returns a tuple when I assumed it would have returned just the value not (key, value) which I was directly passing into the example parser.