hstack csr matrix with pandas array - pandas

I am doing an exercise on Amazon Reviews, Below is the code.
Basically I am not able to add column (pandas array) to CSR Matrix which i got after applying BoW.
Even though the number of rows in both matrices matches i am not able to get through.
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE
#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')
filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
if x < 3:
return 'negative'
return 'positive'
actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape
display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)
sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape
display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final['Score'].value_counts()
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)
final_counts.shape
type(final_counts)
positive_negative = final['Score']
#Below is giving error
final_counts = hstack((final_counts,positive_negative))

sparse.hstack combines the coo format matrices of the inputs into a new coo format matrix.
final_counts is a csr matrix, so the sparse.coo_matrix(final_counts) conversion is trivial.
positive_negative is a column of a DataFrame. Look at
sparse.coo_matrix(positive_negative)
It probably is a (1,n) sparse matrix. But to combine it with final_counts it needs to be (1,n) shaped.
Try creating the sparse matrix, and transposing it:
sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))

Used Below but still getting error
merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(positive_negative).T))
Below is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, sparse.coo_matrix(positive_
negative).T))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(pos
itive_negative).T))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 464, in h
stack
return bmat([blocks], format=format, dtype=dtype)
File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 600, in b
mat
dtype = upcast(*all_dtypes) if all_dtypes else None
File "C:\Python34\lib\site-packages\scipy\sparse\sputils.py", line 52, in upca
st
raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

Even I was facing the same issue with sparse matrices. you can convert the CSR matrix to dense by todense() and then you can use np.hstack((dataframe.values,converted_dense_matrix)). It will work fine. you can't deal with sparse matrices by using numpy.hstack
However for very large data set converting to dense matrix is not a good idea. In your case scipy hstack won't work because the data types are different in hstack(int,object).
Try positive_negative = final['Score'].values and scipy.sparse.hstack it. if it doesn't work can you give me the output of your positive_negative.dtype

Related

ValueError: could not convert string to float: in python colab

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
uploded = files.upload()
df = pd.read_csv("quotation_test_data.csv")
print(df)
import seaborn as sns
df . describe() #describe the data as it shows mean , count
df.isnull().sum() #for see is there any column dont have a data
df ["Net profit"].value_counts()
sns.countplot(df['Net profit'])
#spli data into x and y then make it into train and test data
x=df.iloc[: , :-1]
y=df.iloc[: , :-1]
# [:,:-1] the (:) mean the coumln (:-1) this mean like print the alll column except the last one
#npw i will split the data into train and test data will use sklearn
from sklearn.model_selection import train_test_split
x_train , x_test,y_train,y_test = train_test_split(x,y,random_state=100)
x_train.shape
y_train.shape
from sklearn.tree import DecisionTreeClassifier #intilize desion tree
from pandas.core.common import random_state
clf = DecisionTreeClassifier(criterion="gini",max_depth=7,min_samples_split=10,random_state=10) # criteraition
import pandas as pd
data=pd.read_csv('quotation_test_data.csv')
dataconver = data.replace('[^\d.]','', regex=True).astype(float)
i was try to make a desion tree
then this is show for me :
ValueError Traceback (most recent call last)
<ipython-input-75-df55a55b03a4> in <module>
----> 1 dataconver = data.replace('[^\d.]', regex=True).astype(float)
2
7 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
1199 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
1200 # Explicit copy, or required since NumPy can't view from / to object.
-> 1201 return arr.astype(dtype, copy=True)
1202
1203 return arr.astype(dtype, copy=copy)
ValueError: could not convert string to float: 'Kellie Scott'
i was try make desion tree

"im2col_out_cpu" not implemented for 'Byte'

I am trying to generate overlap patches from image size (112,112) but i am unable to do so. I have already tried a lot but it didn't work out.
**Code**
import torch
import numpy as np
import torch.nn as nn
from torch import nn
from PIL import Image
import cv2
import os
import math
import torch.nn.functional as F
import torchvision.transforms as T
from timm import create_model
from typing import List
import matplotlib.pyplot as plt
from torchvision import io, transforms
from utils_torch import Image, ImageDraw
from torchvision.transforms.functional import to_pil_image
IMG_SIZE = 112
# PATCH_SIZE = 64
resize = transforms.Resize((IMG_SIZE, IMG_SIZE))
img = resize(io.read_image("Adam_Brody_233.png"))
img = img.to(torch.float32)
image_size = 112
patch_size = 28
ac_patch_size = 12
pad = 4
img = img.unsqueeze(0)
soft_split = nn.Unfold(kernel_size=(ac_patch_size, ac_patch_size), stride=(patch_size, patch_size), padding=(pad, pad))
patches = soft_split(img).transpose(1, 2)
fig, ax = plt.subplots(16, 16)
for i in range(16):
for j in range(16):
sub_img = patches[:, i, j]
ax[i][j].imshow(to_pil_image(sub_img))
ax[i][j].axis('off')
plt.show()
Traceback
Traceback (most recent call last):
File "/home/cvpr/Documents/OPVT/unfold_ours.py", line 32, in <module>
patches = soft_split(img).transpose(1, 2)
File "/home/cvpr/anaconda3/envs/OPVT/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/cvpr/anaconda3/envs/OPVT/lib/python3.7/site-packages/torch/nn/modules/fold.py", line 295, in forward
self.padding, self.stride)
File "/home/cvpr/anaconda3/envs/OPVT/lib/python3.7/site-packages/torch/nn/functional.py", line 3831, in unfold
_pair(dilation), _pair(padding), _pair(stride))
RuntimeError: "im2col_out_cpu" not implemented for 'Byte'
Yes this is an open issue in PyTorch. A simple fix is just to convert your image tensor from ints to floats you can do it like this:
img = img.to(torch.float32)
This should solve your problem

import pandas error : Traceback (most recent call last) and expected string or bytes-like object

I installed GIS-Pro and use Jupyter. I believe Jupyter is in the GIS-Pro package. I use Jupyter to write Python codes. Since yesterday, I've got the following errors once executing import pandas as pd :
TypeError Traceback (most recent call last)
C:\Users\AppData\Local\Temp\2/ipykernel_23172/4080736814.py in <module>
----> 1 import pandas as pd
C:\ArcGISPro28\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\__init__.py in <module>
# numpy compat
from pandas.compat import (
np_version_under1p18 as _np_version_under1p18,
is_numpy_dev as _is_numpy_dev,
C:\ArcGISPro28\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\compat\__init__.py in <module>
np_version_under1p20)
from pandas.compat.pyarrow import (
pa_version_under1p0,
pa_version_under2p0,
C:\ArcGISPro28\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\compat\pyarrow.py in <module>
pa_version = pa.__version__
palv = Version(_pa_version)
pa_version_under1p0 = _palv < Version("1.0.0")
pa_version_under2p0 = _palv < Version("2.0.0")
C:\ArcGISPro28\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\util\version\__init__.py in __init__(self, version)
# Validate the version and parse it into pieces
match = self._regex.search(version)
if not match:
raise InvalidVersion(f"Invalid version: '{version}'")
TypeError: expected string or bytes-like object

Couldn't load pyspark data frame to decision tree algorithm. It says can't work with pyspark data frame

I was working on IBM's data platform. I was able to load data into the pyspark data frame and made a spark SQL table. After splitting the data set, then feeding it into the Classification algorithm. It rises errors like spark SQL data can't load. required ndarrays.
from sklearn.ensemble import RandomForestRegressor`
from sklearn.model_selection import train_test_split`
from sklearn import preprocessing`
import numpy as np`
X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42)
RM = RandomForestRegressor()
RM.fit(X_train.reshape(1,-1),y_train)`
Error:
TypeError: Expected sequence or array-like, got {<}class 'pyspark.sql.dataframe.DataFrame'>
after this error, I did something like this:
x = spark.sql('select Id,YearBuilt,MoSold,YrSold,Fireplaces FROM Train').toPandas()
y = spark.sql('Select SalePrice FROM Train where SalePrice is not null').toPandas()
Error:
AttributeError Traceback (most recent call last)
in ()
5 X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42)
6 RM = RandomForestRegressor()
----> 7 RM.fit(X_train.reshape(1,-1),y_train)
/opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'reshape'
As the sklearn documentation says:
"""
X : array-like or sparse matrix, shape = [n_samples, n_features]
"""
regr = RandomForestRegressor()
regr.fit(X, y)
So firstly you're trying to give as the X argument a pandas.DataFrame instead of an array.
Secondly the reshape() method is not an attribute of the DataFrame object but numpy array.
import numpy as np
x = np.array([[2,3,4], [5,6,7]])
np.reshape(x, (3, -1))
Hope this helps.

Assertion error when making an MP4 video out of numpy arrays with OpenCV

I have this python code that should make a video:
import cv2
import numpy as np
out = cv2.VideoWriter("/tmp/test.mp4",
cv2.VideoWriter_fourcc(*'MP4V'),
25,
(500, 500),
True)
data = np.zeros((500,500,3))
for i in xrange(500):
out.write(data)
out.release()
I expect a black video but the code throws an assertion error:
$ python test.py
OpenCV(3.4.1) Error: Assertion failed (image->depth == 8) in writeFrame, file /io/opencv/modules/videoio/src/cap_ffmpeg.cpp, line 274
Traceback (most recent call last):
File "test.py", line 11, in <module>
out.write(data)
cv2.error: OpenCV(3.4.1) /io/opencv/modules/videoio/src/cap_ffmpeg.cpp:274: error: (-215) image->depth == 8 in function writeFrame
I tried various fourcc values but none seem to work.
According to #jeru-luke and #dan-masek's comments:
import cv2
import numpy as np
out = cv2.VideoWriter("/tmp/test.mp4",
cv2.VideoWriter_fourcc(*'mp4v'),
25,
(1000, 500),
True)
data = np.transpose(np.zeros((1000, 500,3), np.uint8), (1,0,2))
for i in xrange(500):
out.write(data)
out.release()
The problem is that you did not specify the data type of elements when calling np.zeros. As the documentation states, by default numpy will use float64.
>>> import numpy as np
>>> np.zeros((500,500,3)).dtype
dtype('float64')
However, the VideoWriter implementation only supports 8 bit image depth (as the "(image->depth == 8)" part of the error message suggests).
The solution is simple -- specify the appropriate data type, in this case uint8.
data = np.zeros((500,500,3), dtype=np.uint8)