How to apply pandas data on word2vec - pandas

I am trying to use W2V.
I saved my preprocessed data as a pandas dataframe, and I want to apply the word2vec algorithm to my preprocessed data.
This is my data.
http://naver.me/IFjLAHld
This is my code.
from gensim.models.word2vec import Word2Vec
import pandas as pd
import numpy as np
df = pd.read_excel('re_nlp0820.xlsx')
model = Word2Vec(df['nlp'],
sg=1,
window=3,
min_count=1,
workers=4,
iter=1)
model.init_sims(replace=True)
model_result1 = model.wv.most_similar('국민', topn =20)
print(model_result1)
Please, help me

First you need to convert the data you are passing to the Word2Vec instance into a nested list where each list contains the tokenized form of the text. You can do so by:
from gensim.models.word2vec import Word2Vec
import pandas as pd
import numpy as np
import nltk
df = pd.read_excel('re_nlp0820.xlsx')
nlp = [nltk.word_tokenize(i) for i in df['nlp']]
model = Word2Vec(nlp,
sg=1,
window=3,
min_count=1,
workers=4,
iter=1)
model.init_sims(replace=True)
model_result1 = model.wv.most_similar('국민', topn =20)
print(model_result1)

Gensim's Word2Vec needs as its training corpus a re-iterable sequence, where each item is a list-of-words.
You df['nlp'] is probably just a sequence of strings, so it's not in the right format. You should make sure each of its items is broken into a Python list that has your desired words as individual strings.
(Separately: min_count=1 is almost always a bad idea with this algorithm, which gives better results if rare words with few usage examples are discarded. And, you shouldn't need to call .init_sims() at all.)

Related

Numpy broadcasting comparison report "'bool' object has no attribute 'sum'" error when dealing with large dataframe

I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.
I post the two csv files in the following links:
large file
small file
import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index)
print(matirx)
when running this, I could get the difference matrix.
when switch to large file, It reports the following error. Does anybody know why this happens?
EDIT:The numpy version is 1.19.5
np.__version__
'1.19.5'

TensorFlow:Failed to convert a NumPy array to a Tensor (Unsupported object type int)

I am practicing on this kaggle dataset regarding car price prediction (https://www.kaggle.com/hellbuoy/car-price-prediction). I dont know why am I receiving this error.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras import layers,models
cars_data=pd.read_csv('/content/CarPrice_Assignment.csv')
cars_data.head()
cars_data.info()
cars_data.describe()
train_data=cars_data.iloc[:103]
train_data=train_data.drop('price',axis=1)
train_data=np.asarray(train_data.values)
train_targets=cars_data.price.iloc[:103]
train_targets=np.asarray(train_targets)
test_data=cars_data.iloc[103:165]
test_data=test_data.drop('price',axis=1)
test_data=np.asarray(test_data.values)
test_targets=cars_data.price.iloc[103:165]
test_targets=np.asarray(test_targets)
val_data=cars_data.iloc[165:]
val_data=val_data.drop('price',axis=1)
val_data=np.asarray(val_data.values)
val_targets=cars_data.price.iloc[165:]
val_targets=np.asarray(val_targets)
model=models.Sequential()
model.add(layers.Dense(10,activation='relu',input_shape=(25,)))
model.add(layers.Dense(8,activation='relu'))
model.add(layers.Dense(6,activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',loss='mse',metrics=['mae'])
model.fit(train_data,train_targets,epochs=20,batch_size=1)
There are 2 things you need to address in your code.
Categorical Variables
By printing the value of train_data, I can see there are still some categorical variables in form of string. Tensorflow cannot process that kind of data directly, so you need to deal with categorical variables. See answer from Best way to deal with categorical variables in regression problem - python as your starting point.
target shape
Your train_targets shape is (107,) means that this is a 1D array. The correct shape for tensorflow input(for simple regression problem) is (107,1). Modify your code like this to reshape the value :
train_targets=np.asarray(train_targets).reshape(-1,1)

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

Too many values to unpack using NLTK and Pandas in Python

I am trying out different things to make the NLTK's naive bayes work using the NLTK and Pandas modules, but I am getting the "too many values to unpack" error.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import nltk
### Remove cases with missing name or missing ethnicity information
def read_file():
data = pd.read_csv("C:\sample.csv", encoding="utf-8")
frame = DataFrame(data)
frame.columns = ["Name", "Gender"]
return frame
#read_file()
def gender_features(word):
return {'last_letter': word[-1]}
#gender_features()
frame = read_file()
featuresets = [(gender_features(n), gender) for (n, gender) in frame]
train_set, test_set = features[500:], featuresets[:500]
classifier = nltkNaiveBayesClassifier.train(train_set)
I suspect you are trying to do something bigger than name classification when using panadas.DataFrame because the DataFrame object is normally used when you have limited RAM and wants to makes use of diskspace as you iterate through the data to extract features:
a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table,
or a dict of Series objects. It is generally the most commonly used
pandas object. Like Series, DataFrame accepts many different kinds of
input:
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
I suggest you go through the pandas tutorial to learn about the library first: http://pandas.pydata.org/pandas-docs/dev/tutorials.html
And then learn about the NLTK classification from http://www.nltk.org/book/ch06.html
Firstly, there are several things wrong in how you access pandas.DataFrame object.
To iterate through the rows of the dataframe, you should do this:
# Read file into pandas dataframe
df = DataFrame(pd.read_csv('sample.csv'))
df.columns = ['name', 'gender']
for index, row in df.iterrows():
print row['name'], row['gender']
Next to train a classifier, you should do this:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from nltk.corpus import names
from nltk.classify import NaiveBayesClassifier as nbc
# Create a sample.csv file
male_names = [','.join([i,'m']) for i in names.words('male.txt')]
female_names = [','.join([i,'m']) for i in names.words('female.txt')]
with open('sample.csv', 'w') as fout:
fout.write('\n'.join(male_names+female_names))
# Feature extractor function.
def gender_features(word):
return {'last_letter': word[-1]}
# Read file into pandas dataframe
df = DataFrame(pd.read_csv('sample.csv'))
df.columns = ['name', 'gender']
# Extract features.
featuresets = [(gender_features(name), gender) for index, (name, gender) in df.iterrows()]
# Split train and test set
train_set, test_set = featuresets[500:], featuresets[:500]
# Train a classifier
classifier = nbc.train(train_set)
# Test classifier on "Neo"
print classifier.classify(gender_features('Neo'))
[out]:
m