I have a dataframe, which has two columns (review and sentiment). I am using pytorch and torchtext library for preprocessing data.
Is it possible to use dataframe as source to read data from, in torchtext?
I am looking for something similar to, but not
data.TabularDataset.splits(path='./data')
I have performed some operation (clean, change to required format) on data and final data is in a dataframe.
If not torchtext, what other package would you suggest that would help in preprocessing text data present in a datarame. I could not find anything online. Any help would be great.
Adapting the Dataset and Example classes from torchtext.data
from torchtext.data import Field, Dataset, Example
import pandas as pd
class DataFrameDataset(Dataset):
"""Class for using pandas DataFrames as a datasource"""
def __init__(self, examples, fields, filter_pred=None):
"""
Create a dataset from a pandas dataframe of examples and Fields
Arguments:
examples pd.DataFrame: DataFrame of examples
fields {str: Field}: The Fields to use in this tuple. The
string is a field name, and the Field is the associated field.
filter_pred (callable or None): use only exanples for which
filter_pred(example) is true, or use all examples if None.
Default is None
"""
self.examples = examples.apply(SeriesExample.fromSeries, args=(fields,), axis=1).tolist()
if filter_pred is not None:
self.examples = filter(filter_pred, self.examples)
self.fields = dict(fields)
# Unpack field tuples
for n, f in list(self.fields.items()):
if isinstance(n, tuple):
self.fields.update(zip(n, f))
del self.fields[n]
class SeriesExample(Example):
"""Class to convert a pandas Series to an Example"""
#classmethod
def fromSeries(cls, data, fields):
return cls.fromdict(data.to_dict(), fields)
#classmethod
def fromdict(cls, data, fields):
ex = cls()
for key, field in fields.items():
if key not in data:
raise ValueError("Specified key {} was not found in "
"the input data".format(key))
if field is not None:
setattr(ex, key, field.preprocess(data[key]))
else:
setattr(ex, key, data[key])
return ex
Then, first define fields using torchtext.data fields. For example:
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)
fields = { 'sentiment' : LABEL, 'review' : TEXT }
before simply loading the dataframes:
train_ds = DataFrameDataset(train_df, fields)
valid_ds = DataFrameDataset(valid_df, fields)
Thanks Geoffrey.
From looking at the source code for torchtext.data.field
https://pytorch.org/text/_modules/torchtext/data/field.html
It looks like the 'train' parameter needs to be either a Dataset already, or some iterable source of text data. But given we haven't created a dataset at this point I am guessing you have passed in just the column of text from the dataframe.
Related
I have dataframe column 'review' with content like 'Food was Awesome' and I want a new column which counts the number of repetition of each word.
name The First Years Massaging Action Teether
review A favorite in our house!
rating 5
Name: 269, dtype: object
Expecting output like ['Food':1,'was':1,'Awesome':1]
I tried with for loop but its taking too long to execute
for row in range(products.shape[0]):
try:
count_vect.fit_transform([products['review_without_punctuation'][row]])
products['word_count'][row]=count_vect.vocabulary_
except:
print(row)
I would like to do it without for loop.
I found a solution for this.
I have defined a function like this-
def Vectorize(text):
try:
count_vect.fit_transform([text])
return count_vect.vocabulary_
except:
return-1
and applied above function-
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
products['word_count'] = products['review_without_punctuation'].apply(Vectorize)
This solution worked and I got vocabulary in new column.
You can get the count vector for all docs like this:
cv = CountVectorizer()
count_vectors = cv.fit_transform(products['review_without_punctuation'])
To get the count vector in array format for a particular document by index, say, the 1st doc,
count_vectors[0].toarray()
The vocabulary is in
cv.vocabulary_
To get the words that make up a count vector, say, for the 1st doc, use
cv.inverse_transform(count_vectors[0])
This is so similar to other posts on SO e.g. here, i just can't see what I'm doing wrong.
I want to scrape the box labelled 'activity' on this page, and I want the output to look like this:
So you can see the two main features of interest compared to the original webpage (1) combining multiple tables into one table, by just creating a new column if the column is not already seen and (2) I want to extract the actual href for that column as opposed to just the name e.g. 'Jacobsen et al' because I was to eventually extract the PMID value (an integer) from the href.
These are my two goals, I wrote this code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
for i in range(23,24):
# try:
res = requests.get("http://www.conoserver.org/index.php?page=card&table=protein&id=" + str(i))
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table',{'class':'activitytable'})
for each_table in table:
#this can print references
print(each_table.a)
#this can print the data frames
df = pd.read_html(str(each_table))
print(df)
#how to combine the two?
Can someone tell me the correct way to print the href individually for each row of each table (e.g essentially so it adds an extra column to each table with the actual href?; so it should print out three tables, with an extra href column in each table)
Then I can try focus on how to combine the tables, I've just mentioned the ultimate goal here in case someone can think of a more pythonic way of killing two birds with one stone/in case it helps but I think they're different issues.
You can initialise a final dataframe. Then as you iterate, store the href as a variable string then add that column to the sub table dataframe. Then you'll just keep appending those dataframes to a final dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Initalized empty "final" dataframe
final_df = pd.DataFrame()
for i in range(20,24):
# try:
res = requests.get("http://www.conoserver.org/index.php?page=card&table=protein&id=" + str(i))
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table',{'class':'activitytable'})
for each_table in table:
#Store the href
href = each_table.a['href']
#Get the table
df = pd.read_html(str(each_table))[0]
#Put that href in the column 'ref'
df['ref'] = href
# Append that dataframe into your final dataframe, and repeat
final_df = final_df.append(df, sort=True).reset_index(drop=True)
I am trying to learn some classification in Scikit-learn. However, I couldn't figure out what this error means.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data_frame = pd.read_csv('data.csv', header=0)
data_in_numpy = data_frame.values
c = CountVectorizer()
c.fit_transform(data_in_numpy.data)
This throws an error:
NotImplementedError: multi-dimensional sub-views are not implemented
How can I go around this issue? One record from my csv file looks like:
Time Directors Actors Rating Label
123 Abc, Def A, B,c 7.2 1
I suppose this error is due to the fact that there are more than one values under Directors or Actors column.
Any help would be appreciated.
Thanks,
According to the docstring, sklearn.feature_extraction.text.CountVectorizer will:
Convert a collection of text documents to a matrix of token counts
So then why, I wonder, are you inputing numerical values?
Try transforming only the strings (directors and actors):
data_in_numpy['X'] = data_frame[['Directors', 'Actors']].apply(lambda x: ' '.join(x), axis=1)
data_in_numpy = data_frame['X'].values
First though, you might want to clean the data up by removing the commas.
data_frame['Directors'] = data_frame['Directors'].str.replace(',', ' ')
data_frame['Actors'] = data_frame['Actors'].str.replace(',', ' ')
I am just learning OOP in Python 2.7 and I want to put together a class that handles frequency data with the following structure:
"Freq" "Device 1" "Device 2" etc.....
100 90 95
500 95 100
. . .
. . .
My first thought was to use a composition of a dataframe to represent the data above along with a units attribute to track the units (ex Volts vs milliVolts).
I want to be able to use all of the built in functionality of pandas, like being able to merge DataFrames together, slicing, indexing, plotting etc. I also want to keep track of the units of the data so that when I merge, plot, etc. I can adjust values so they are congruent.
If I create a class based on composition, I find merging is difficult
class FreqData(object):
__init__(self, data=None, index=None, columns=None, dtype=None,
copy=False, units=None):
self.df = DataFrame(data, index, columns, dtype, copy)
self.units = units
Then my merge method looks something like this
def combine(self, fds, axis=1):
try:
chk_units = self.units == fds.units
fds = [fds]
except AttributeError:
chk_units = all([self.units == f.units for f in fds])
combined_fd = FreqData()
if chk_units:
combined_fd.units = self.units
df_list = [f.df for f in fds]
df_list.insert(0, self.df)
combined_fd.df = pd.concat(df_list, axis=1)
return combined_fd
else:
raise TypeError("""One or more of the FreqData objects measurement
types does not match""")
Is there an easier way to do this using composition or should I try using inheritance?
Also, I want to use slicing methods like df[], but I'd like to be able to do this on the FreqData object rather than having to write a method like:
def __getitem__(self, key):
fd = FreqData()
fd.df = self.df.__getitem__(key)
fd.units = self.units
return fd
This seems like a bunch of redundant code. How can I improve this?
I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170