AttributeError: 'Series' object has no attribute 'pipe' - pandas

I am getting the following error when I run the python3 keras code on my ec2 instance. The code works fine on Azure Jupyter Notebook.
Code:
numpy.random.seed(7)
dataframe = pandas.read_csv("some_data.csv", header = None)
df = dataframe
char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
for c in char_cols:
df[c] = pandas.factorize(df[c])[0]
dataframe = df
Error:
Traceback (most recent call last):
File "pi_8_1st_year.py", line 12, in <module>
char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 1815, in __getattr__
(type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'pipe'
My configuration:
ubuntu#ipxxxx:~$ python3
Python 3.4.3 (default, Nov 28 2017, 16:41:13)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
_frozen_importlib:321: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
/usr/lib/python3.4/importlib/_bootstrap.py:321: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
return f(*args, **kwds)
>>> pandas.__version__
'0.13.1'
>>> import numpy
>>> numpy.__version__
'1.15.0'
>>> import sklearn
/usr/lib/python3.4/importlib/_bootstrap.py:321: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
return f(*args, **kwds)
/usr/lib/python3.4/importlib/_bootstrap.py:321: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
return f(*args, **kwds)
>>> sklearn.__version__
'0.19.2'
>>> import keras
Using TensorFlow backend.
/usr/lib/python3.4/importlib/_bootstrap.py:321: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
return f(*args, **kwds)
>>> keras.__version__
'2.2.2'
>>>

As the error suggests you can't use pipe on series object. df.dtypes returns an series object, hence the error.
If you want to find columns of object types you can do it by:
s = (df.dtypes == 'object')
cols = s[s.values == True].index

Related

Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object)

I am converting pandas dataframe to polars dataframe but pyarrow throws error.
My code:
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat(
[
pd.read_excel(
excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df_pl = pl.from_pandas(df)
Error:
File "pyarrow\array.pxi", line 312, in pyarrow.lib.array
File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?
Edit: Polars 0.13.42 and later
Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.
Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).
Polars: prior to 0.13.42
I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.
For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
pandas_to_pydf(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
arrow_dict = {
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
str(col): _pandas_series_to_arrow(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
There are several lengthy discussions on this issue, such as these:
to_parquet can't handle mixed type columns #21228
pyarrow.lib.ArrowTypeError: "Expected a string or bytes object, got a 'int' object" #349
This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column:
https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116
How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).
My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.
import os
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df.to_csv("temp.csv",index=False)
df_pl = pl.scan_csv("temp.csv")
os.remove("temp.csv")

TypeError: _any() missing 1 required keyword-only argument: 'where'

I am trying to read the file using pandas but it is showing me a type error. I am not able to discern why. Can someone help me?
Below is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#prepare the files
df = pd.read_csv("~/Downloads/Boston.csv") # for doing modifications
Traceback (most recent call last):
File "", line 1, in
df = pd.read_csv("~/Downloads/Boston.csv") # for doing modifications
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
low_memory=_c_parser_defaults["low_memory"],
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
iterator = kwds.get("iterator", False)
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1148, in read
names : iterable of names
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in init
d = {'col1': [1, 2], 'col2': [3, 4]}
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 233, in init_dict
datelike_vals = maybe_infer_to_datetimelike(values)
TypeError: _any() missing 1 required keyword-only argument: 'where'
Could be that read_csv method has troubles parsing your file without any other indications.
Try using additional keywords arguments such as sep, usecols, etc.
Refer to documentation for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Assertion error when making an MP4 video out of numpy arrays with OpenCV

I have this python code that should make a video:
import cv2
import numpy as np
out = cv2.VideoWriter("/tmp/test.mp4",
cv2.VideoWriter_fourcc(*'MP4V'),
25,
(500, 500),
True)
data = np.zeros((500,500,3))
for i in xrange(500):
out.write(data)
out.release()
I expect a black video but the code throws an assertion error:
$ python test.py
OpenCV(3.4.1) Error: Assertion failed (image->depth == 8) in writeFrame, file /io/opencv/modules/videoio/src/cap_ffmpeg.cpp, line 274
Traceback (most recent call last):
File "test.py", line 11, in <module>
out.write(data)
cv2.error: OpenCV(3.4.1) /io/opencv/modules/videoio/src/cap_ffmpeg.cpp:274: error: (-215) image->depth == 8 in function writeFrame
I tried various fourcc values but none seem to work.
According to #jeru-luke and #dan-masek's comments:
import cv2
import numpy as np
out = cv2.VideoWriter("/tmp/test.mp4",
cv2.VideoWriter_fourcc(*'mp4v'),
25,
(1000, 500),
True)
data = np.transpose(np.zeros((1000, 500,3), np.uint8), (1,0,2))
for i in xrange(500):
out.write(data)
out.release()
The problem is that you did not specify the data type of elements when calling np.zeros. As the documentation states, by default numpy will use float64.
>>> import numpy as np
>>> np.zeros((500,500,3)).dtype
dtype('float64')
However, the VideoWriter implementation only supports 8 bit image depth (as the "(image->depth == 8)" part of the error message suggests).
The solution is simple -- specify the appropriate data type, in this case uint8.
data = np.zeros((500,500,3), dtype=np.uint8)

hstack csr matrix with pandas array

I am doing an exercise on Amazon Reviews, Below is the code.
Basically I am not able to add column (pandas array) to CSR Matrix which i got after applying BoW.
Even though the number of rows in both matrices matches i am not able to get through.
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE
#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')
filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
if x < 3:
return 'negative'
return 'positive'
actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape
display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)
sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape
display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final['Score'].value_counts()
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)
final_counts.shape
type(final_counts)
positive_negative = final['Score']
#Below is giving error
final_counts = hstack((final_counts,positive_negative))
sparse.hstack combines the coo format matrices of the inputs into a new coo format matrix.
final_counts is a csr matrix, so the sparse.coo_matrix(final_counts) conversion is trivial.
positive_negative is a column of a DataFrame. Look at
sparse.coo_matrix(positive_negative)
It probably is a (1,n) sparse matrix. But to combine it with final_counts it needs to be (1,n) shaped.
Try creating the sparse matrix, and transposing it:
sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))
Used Below but still getting error
merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(positive_negative).T))
Below is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, sparse.coo_matrix(positive_
negative).T))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(pos
itive_negative).T))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 464, in h
stack
return bmat([blocks], format=format, dtype=dtype)
File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 600, in b
mat
dtype = upcast(*all_dtypes) if all_dtypes else None
File "C:\Python34\lib\site-packages\scipy\sparse\sputils.py", line 52, in upca
st
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))
Even I was facing the same issue with sparse matrices. you can convert the CSR matrix to dense by todense() and then you can use np.hstack((dataframe.values,converted_dense_matrix)). It will work fine. you can't deal with sparse matrices by using numpy.hstack
However for very large data set converting to dense matrix is not a good idea. In your case scipy hstack won't work because the data types are different in hstack(int,object).
Try positive_negative = final['Score'].values and scipy.sparse.hstack it. if it doesn't work can you give me the output of your positive_negative.dtype

ImportError: No module named Image when importing ironpython dll

I have a python package called CoreCode which I have compiled using clr.CompileModules() in IronPython 2.7.5. This generated a file called CoreCode.dll. I then import this dll into my IronPython module by using clr.AddReference(). I know the dll works because I have successfully tested some of the classes as shown below. However, my problem lies with the Base_Slice_Previewer class. This class makes use of Image and ImageDraw from PIL in order to generate and save a bitmap file.
I know the problem doesn't lie with PIL because the package works perfectly well when run in Python 2.7. I'm assuming that this error is coming up because IronPython can't find PIL but I'm not sure how to work around this problem. Any help will be much appreciated.
Code to create the dll
import clr
clr.CompileModules("CoreCode.dll", "CoreCode\AdvancedFileHandlers\ScannerSliceWriter.py", "CoreCode\AdvancedFileHandlers\__init__.py", "CoreCode\MarcamFileHandlers\MTTExport.py", "CoreCode\MarcamFileHandlers\MTTImporter.py", "CoreCode\MarcamFileHandlers\__init__.py", "CoreCode\Visualizer\SlicePreviewMaker.py", "CoreCode\Visualizer\__init__.py", "CoreCode\Timer.py", "CoreCode\__init__.py")
Test for Timer.py
>>> import clr
>>> clr.AddReference('CoreCode.dll')
>>> from CoreCode.Timer import StopWatch
>>> stop_watch = StopWatch()
>>> print stop_watch.__str__()
0:00:00:00 0:00:00:00
>>>
Test for MTTExport.py
>>> from CoreCode.MarcamFileHandlers.MTTExport import MTT_Layer_Exporter
>>> mttlayer = MTT_Layer_Exporter()
>>> in_val = (2**20)+ (2**16) + 2
>>> bytes = mttlayer.write_lf_int(in_val, force_full_size=True)
>>> print "%s = %s" %(bytes, [hex(ord(x)) for x in bytes])
à ◄ ☻ = ['0xe0', '0x0', '0x0', '0x0', '0x0', '0x11', '0x0', '0x2']
>>>
Test for SlicePreviewMaker.py
>>> from CoreCode.Visualizer.SlicePreviewMaker import Base_Slice_Previewer
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "CoreCode\Visualizer\SlicePreviewMaker", line 1, in <module>
ImportError: No module named Image
>>>