numpy concatenate dimension mismatch - numpy

I've been running into an issue with numpy's concatenate that I've been unable to make sense of, and was hoping that someone had encountered and resolved the same problem.I'm attempting to join together two arrays created by SciKit-Learn's TfidfVectorizer and labelencoder, but getting an error message that "arrays must have the same number of dimensions", despite the fact that the inputs are a (77946, 12157) array and (77946, 1000) array, respectively. (As requested in the comments, a reproducible example is at the bottom)
TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
tagvect2=preprocessing.LabelBinarizer()
tagvect2.fit(DS['location2'].tolist())
TV.fit(DS['tweet'])
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
A=tagvect2.transform(DS['location2'])
B=TV.transform(DS['tweet'])
print A.shape
print B.shape
pdb.set_trace()
Xtrain=np.concatenate([A,B.todense()],axis=1)
I initially thought the fact that B was encoded as a sparse matrix might be causing the issues, but converting it to a dense matrix didn't resolve the issue. I had the same issue using hstack instead.
Even more peculiar is that adding in a third labelencoder matrix results in no error :
TV.fit(DS['tweet'])
tagvect.fit(DS['state'].tolist())
tagvect2.fit(DS['location'].tolist())
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
Xtrain=pd.DataFrame(np.concatenate([tagvect.transform(DS['state']),tagvect2.transform(DS['location']),TV.transform(DS['tweet'])],axis=1))
Here is the error message:
Traceback (most recent call last):
File "smallerdimensions.py", line 49, in <module>
Xtrain=pd.DataFrame(np.concatenate((A,B.todense()),axis=1))
ValueError: arrays must have same number of dimensions
Thank you for any help you can provide. Here is a reproducible example:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
import numpy as np
tweets=["Jazz for a Rainy Afternoon","RT: #mention: I love rainy days.", "Good Morning Chicago!"]
location=["Oklahoma", "Oklahoma","Illinois"]
DS=pd.DataFrame({"tweet":tweets,"location":location})
TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
DS['location']=DS['location'].fillna("none")
tagvect.fit(DS['location'].tolist())
TV.fit(DS['tweet'])
print "before problem"
print DS['tweet']
print DS['location']
print tagvect.transform(DS['location'])
print tagvect.transform(DS['location']).shape
print TV.transform(DS['tweet']).shape
print TV.transform(DS['tweet'])
print TV.transform(DS['tweet']).todense()
print np.concatenate([tagvect.transform(DS['location']),TV.transform(DS['tweet'])],axis=1)
Numpy is v 1.6.1, pandas is v 0.12.0, scikit is 0.14.1.

Related

TensorFlow:Failed to convert a NumPy array to a Tensor (Unsupported object type int)

I am practicing on this kaggle dataset regarding car price prediction (https://www.kaggle.com/hellbuoy/car-price-prediction). I dont know why am I receiving this error.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras import layers,models
cars_data=pd.read_csv('/content/CarPrice_Assignment.csv')
cars_data.head()
cars_data.info()
cars_data.describe()
train_data=cars_data.iloc[:103]
train_data=train_data.drop('price',axis=1)
train_data=np.asarray(train_data.values)
train_targets=cars_data.price.iloc[:103]
train_targets=np.asarray(train_targets)
test_data=cars_data.iloc[103:165]
test_data=test_data.drop('price',axis=1)
test_data=np.asarray(test_data.values)
test_targets=cars_data.price.iloc[103:165]
test_targets=np.asarray(test_targets)
val_data=cars_data.iloc[165:]
val_data=val_data.drop('price',axis=1)
val_data=np.asarray(val_data.values)
val_targets=cars_data.price.iloc[165:]
val_targets=np.asarray(val_targets)
model=models.Sequential()
model.add(layers.Dense(10,activation='relu',input_shape=(25,)))
model.add(layers.Dense(8,activation='relu'))
model.add(layers.Dense(6,activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',loss='mse',metrics=['mae'])
model.fit(train_data,train_targets,epochs=20,batch_size=1)
There are 2 things you need to address in your code.
Categorical Variables
By printing the value of train_data, I can see there are still some categorical variables in form of string. Tensorflow cannot process that kind of data directly, so you need to deal with categorical variables. See answer from Best way to deal with categorical variables in regression problem - python as your starting point.
target shape
Your train_targets shape is (107,) means that this is a 1D array. The correct shape for tensorflow input(for simple regression problem) is (107,1). Modify your code like this to reshape the value :
train_targets=np.asarray(train_targets).reshape(-1,1)

TypeError: _parse_args() got an unexpected keyword argument 'size'

I got an error, though I found that similar error has been posted there, but it does not lead to me a conclusion, here is the code and error is related to size. Well my friend used the same code and it worked for him but I got above mentioned error.
enter code here
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
n=100
alpha=5
alpha1=2
np.random.seed(1)
x=10*ss.uniform.rvs(size=n)
y=alpha+alpha1*x+ss.norm(loc=0, scale=1, size=n)
plt.figure()
plt.plot(x,y,"o", ms=10)
xx=np.array([0,10])
plt.plot(xx, alpha+alpha1*x)
ss.norm does not accept argument size you passed to it.
Did you mean:
y=alpha+alpha1*x+ss.norm.rvs(loc=0, scale=1, size=n)
output of first plot:
I am guessing there is a typo in your last line too, did you mean:
plt.plot(xx, alpha+alpha1*xx)
output of this:

How to write text to a file using TextLineDataset

I am trying to read text in a file Shakespear.txt line by line using
tensorflow TextLineDataset. Split the words in a line and write the words in another file txt.txt one word per line. Here is my code
import tensorflow as tf
tf.enable_eager_execution()
BATCH_SIZE=2
#from tensorflow.keras.model import Sequential
dataset_in_lines=tf.data.TextLineDataset("Shakespear.txt")
dataset=dataset_in_lines.map(lambda string: tf.string_split([string]).values)
with open("txt.txt","w") as f:
for k in dataset.take(2):
for x in k:
f.write("\n".join(x))
When i run it it gives the error: Cannot iterate over a scalar tensor
in the f.write line. Please help me figure out the issue
It will be helpful if you could share the shakespear.txt file, but based on your error, it seems liek it is receiving the tensor not the actual value.
So, you first need to get the value from tensor k, you can use k.numpy().
Replace for x in k: with for x in k.numpy():
Let us know if it works.
I found a better way, replace dataset=dataset_in_lines.map(lambda string:tf.string_split([string]).values) with tokenizer.tokenize. The following code achieves the objective(see https://www.tensorflow.org/tutorials/load_data/text for more details)
import tensorflow as tf
tf.enable_eager_execution()
import tensorflow_datasets as tfds
tokenizer = tfds.features.text.Tokenizer()
dataset_in_lines=tf.data.TextLineDataset("Shakespear.txt")
vocabulary_set = set()
for x in dataset_in_lines:
k=tokenizer.tokenize(x.numpy())
vocabulary_set.update(k)
with open("txt.txt","w") as f:
for x in vocabulary_set:
f.write(x+"\n")

TypeError when setting y-axis range in pyplot

I am getting a TypeError like this for the following code.
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
import numpy as np
import matplotlib.pyplot as plt
import math
plt.axis([1,5000,100,10**20])
plt.xscale('log')
plt.yscale('log')
plt.savefig('test.png')
plt.close()
When I set the maximum value for y-axis as 10^19, there is no error.
But from 10^20, I start getting the TypeError described above.
Could you help me understand this error and set the y-axis range from 1 to 10^20?

NameError: name 'pd' is not defined

I am attempting run in Jupyter
import pandas as pd
import matplotlib.pyplot as plt # plotting
import numpy as np # dense matrices
from scipy.sparse import csr_matrix # sparse matrices
%matplotlib inline
However when loading the dataset with
wiki = pd.read_csv('people_wiki.csv')
# add id column
wiki['id'] = range(0, len(wiki))
wiki.head(10)
the following error persists
NameError Traceback (most recent call last)
<ipython-input-1-56330c326580> in <module>()
----> 1 wiki = pd.read_csv('people_wiki.csv')
2 # add id column
3 wiki['id'] = range(0, len(wiki))
4 wiki.head(10)
NameError: name 'pd' is not defined
Any suggestions appreciated
Select Restart & Clear Output and run the cells again from the beginning.
I had the same issue, and as Ivan suggested in the comment, this resolved it.
If you came here from a duplicate, notice also that your code needs to contain
import pandas as pd
in the first place. If you are using a notebook like Jupyter and it's already there, or if you just added it, you probably need to re-evaluate the cell, as suggested in the currently top-voted answer by martin-martin.
python version will need 3.6 above, I think you have been use the python 2.7. Please select from top right for your python env version.
Be sure to load / import Pandas first
When stepping through the Anaconda Navigator demo, I found that pressing "play" on the first line before inputting the second line resolved the issue.