I am just learning OOP in Python 2.7 and I want to put together a class that handles frequency data with the following structure:
"Freq" "Device 1" "Device 2" etc.....
100 90 95
500 95 100
. . .
. . .
My first thought was to use a composition of a dataframe to represent the data above along with a units attribute to track the units (ex Volts vs milliVolts).
I want to be able to use all of the built in functionality of pandas, like being able to merge DataFrames together, slicing, indexing, plotting etc. I also want to keep track of the units of the data so that when I merge, plot, etc. I can adjust values so they are congruent.
If I create a class based on composition, I find merging is difficult
class FreqData(object):
__init__(self, data=None, index=None, columns=None, dtype=None,
copy=False, units=None):
self.df = DataFrame(data, index, columns, dtype, copy)
self.units = units
Then my merge method looks something like this
def combine(self, fds, axis=1):
try:
chk_units = self.units == fds.units
fds = [fds]
except AttributeError:
chk_units = all([self.units == f.units for f in fds])
combined_fd = FreqData()
if chk_units:
combined_fd.units = self.units
df_list = [f.df for f in fds]
df_list.insert(0, self.df)
combined_fd.df = pd.concat(df_list, axis=1)
return combined_fd
else:
raise TypeError("""One or more of the FreqData objects measurement
types does not match""")
Is there an easier way to do this using composition or should I try using inheritance?
Also, I want to use slicing methods like df[], but I'd like to be able to do this on the FreqData object rather than having to write a method like:
def __getitem__(self, key):
fd = FreqData()
fd.df = self.df.__getitem__(key)
fd.units = self.units
return fd
This seems like a bunch of redundant code. How can I improve this?
Related
I want to get the numpy vectors created using the "Embedding Column" in Tensorflow.
For example, creating a sample DF:
sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds
Converting the pandas DF to Tensorflow object
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('B')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
#print (ds)
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
#print (ds)
ds = ds.batch(batch_size)
return ds
Creating a embedding column:
tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)
Is there anyway to get the embeddings as numpy vectors from the 'col_a_embedding' object?
Example,
The category "Apple" will be embedded into a vector size 8:
[a1 a2 a3 a4 a5 a6 a7 a8]
Can we fetch that vector?
I don't see a way to get what you want using feature columns (I dont see a function named sequence_embedding_column or similar in the available functions in tf.feature_column). Because the result from feature columns seem to be a fixed-length tensor. They achieve that by using a combiner to aggregate individual embedding vectors (sum, mean, sqrtn etc). So the dimension on the sequence of categories are actually lost.
But it's totally doable if you use lower-level apis.
First you could construct a lookup table to convert categorical strings to ids.
features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)
#Content of "fruit.txt"
apple
mango
banana
unknown
Now you could initialize the embedding as a 2d variable. Its shape is [number of categories, embedding dimension].
num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
"embedding_table", [num_categories, embedding_dim],
initializer=tf.truncated_normal_initializer(stddev=0.02))
You could then lookup category embedding like below:
ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)
Note the results in ids_embeddings is a concatenated long tensor. Feel free to reshape it to the shape you want.
I suggest the easiest fastest way is to do like this, which is what I am doing in my own app:
Use pandas to read_csv your file into a string column of type
"category" in pandas using the dtype parameter. Let's call it field
"f". This is the original string column, not a numerical column yet.
Still in pandas, create a new column and copy the original column's
pandas cat.codes into the new column. Let's call it field "f_code". Pandas automatically encodes this into a compactly represented numerical column. It will have the numbers you need for passing to neural networks.
Now in an Embedding layer in your keras functional api neural
network model, pass the f_code to your model's Input layer. The
value in the f_code will be a number now, like int8. The Embedding
layer will process it correctly now. Don't pass the original column to the model.
Below are some sample code lines copied out of my project doing exactly the steps above.
all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}
<some code omitted>
d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)
<some code omitted>
from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.
cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)
<some code omitted>
# Actually add _code column for the selected columns
for cn in add_code_columns:
codecolname = cn + "_code"
if not codecolname in d.columns:
d[codecolname] = d[cn].cat.codes
You can see the numeric codes pandas made for you:
d.info()
d.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid 99991 non-null int32
itemid 99991 non-null int32
rating 99991 non-null float32
job 99991 non-null category
job_code 99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB
Finally, you can omit the job column and retain the job_code column, in this example, for passing into your keras neural network model. Here is some of my model code:
v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)
By the way, please also wrap np.array() around all pandas dataframes when passing them into model.fit(). It's not well documented and apparnetly also not checked at runtime that pandas dataframes cannot be safely passed in. You get massive memory allocs otherwise which crash hosts.
I have this annoying problem and i dont know how to solve it.
I am reading in batches of data from a CSV using a dataset reader and am wanting to gather certain columns. The reader returns a tuple of tensors and, depending on which reader i use, columns are either indexed via integer or string.
I can easily enough do a for loop in python and slice the columns I want however I am wanting to do this in a tf.while_loop to take advantage of parallel execution.
This is where my issue lies - the iterator in the while loop is tensor based and i cannot use this to index into my dataset. If i try and evaluate it I get an error about the session not being the same etc etc
How can i use a while loop (or a map function) and have the function be able to index into a python list/dict without evaluating or running the iterator tensor?
Simple example:
some_data = [1,2,3,4,5]
x = tf.constant(0)
y = len(some_data)
c = lambda x: tf.less(x, y)
b = lambda x: some_data[x] <--- You cannot index like this!
tf.while_loop(c, b, [x])
Does this fit your requirement somewhat ? It does nothing apart from print the value.
import tensorflow as tf
from tensorflow.python.framework import tensor_shape
some_data = [11,222,33,4,5,6,7,8]
def func( v ):
print (some_data[v])
return some_data[v]
with tf.Session() as sess:
r = tf.while_loop(
lambda i, v: i < 4,
lambda i, v: [i + 1, tf.py_func(func, [i], [tf.int32])[0]],
[tf.constant(0), tf.constant(2, tf.int32)],
[tensor_shape.unknown_shape(), tensor_shape.unknown_shape()])
r[1].eval()
It prints
11
4
222
33
The order changes everytime but I guess tf.control_dependencies may be useful to control that.
I have a dataframe, which has two columns (review and sentiment). I am using pytorch and torchtext library for preprocessing data.
Is it possible to use dataframe as source to read data from, in torchtext?
I am looking for something similar to, but not
data.TabularDataset.splits(path='./data')
I have performed some operation (clean, change to required format) on data and final data is in a dataframe.
If not torchtext, what other package would you suggest that would help in preprocessing text data present in a datarame. I could not find anything online. Any help would be great.
Adapting the Dataset and Example classes from torchtext.data
from torchtext.data import Field, Dataset, Example
import pandas as pd
class DataFrameDataset(Dataset):
"""Class for using pandas DataFrames as a datasource"""
def __init__(self, examples, fields, filter_pred=None):
"""
Create a dataset from a pandas dataframe of examples and Fields
Arguments:
examples pd.DataFrame: DataFrame of examples
fields {str: Field}: The Fields to use in this tuple. The
string is a field name, and the Field is the associated field.
filter_pred (callable or None): use only exanples for which
filter_pred(example) is true, or use all examples if None.
Default is None
"""
self.examples = examples.apply(SeriesExample.fromSeries, args=(fields,), axis=1).tolist()
if filter_pred is not None:
self.examples = filter(filter_pred, self.examples)
self.fields = dict(fields)
# Unpack field tuples
for n, f in list(self.fields.items()):
if isinstance(n, tuple):
self.fields.update(zip(n, f))
del self.fields[n]
class SeriesExample(Example):
"""Class to convert a pandas Series to an Example"""
#classmethod
def fromSeries(cls, data, fields):
return cls.fromdict(data.to_dict(), fields)
#classmethod
def fromdict(cls, data, fields):
ex = cls()
for key, field in fields.items():
if key not in data:
raise ValueError("Specified key {} was not found in "
"the input data".format(key))
if field is not None:
setattr(ex, key, field.preprocess(data[key]))
else:
setattr(ex, key, data[key])
return ex
Then, first define fields using torchtext.data fields. For example:
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)
fields = { 'sentiment' : LABEL, 'review' : TEXT }
before simply loading the dataframes:
train_ds = DataFrameDataset(train_df, fields)
valid_ds = DataFrameDataset(valid_df, fields)
Thanks Geoffrey.
From looking at the source code for torchtext.data.field
https://pytorch.org/text/_modules/torchtext/data/field.html
It looks like the 'train' parameter needs to be either a Dataset already, or some iterable source of text data. But given we haven't created a dataset at this point I am guessing you have passed in just the column of text from the dataframe.
The current TensorFlow dataset interleave functionality is basically a interleaved flat-map taking as input a single dataset. Given the current API, what's the best way to interleave multiple datasets together? Say they have already been constructed and I have a list of them. I want to produce elements from them alternatively and I want to support lists with more than 2 datasets (i.e., stacked zips and interleaves would be pretty ugly).
Thanks! :)
#mrry might be able to help.
EDIT 2: See tf.contrib.data.choose_from_datasets. It performs deterministic dataset interleaving.
EDIT: See tf.contrib.data.sample_from_datasets. Even though it performs random sampling I guess it can be useful.
Even though this is not "clean", it is the only workaround I came up with.
datasets = [tf.data.Dataset...]
def concat_datasets(datasets):
ds0 = tf.data.Dataset.from_tensors(datasets[0])
for ds1 in datasets[1:]:
ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
return ds0
ds = tf.data.Dataset.zip(tuple(datasets)).flat_map(
lambda *args: concat_datasets(args)
)
Expanding user2781994 answer (with edits), here is how I implemented it:
import tensorflow as tf
ds11 = tf.data.Dataset.from_tensor_slices([1,2,3])
ds12 = tf.data.Dataset.from_tensor_slices([4,5,6])
ds13 = tf.data.Dataset.from_tensor_slices([7,8,9])
all_choices_ds = [ds11, ds12, ds13]
choice_dataset = tf.data.Dataset.range(len(all_choices_ds)).repeat()
ds14 = tf.contrib.data.choose_from_datasets(all_choices_ds, choice_dataset)
# alternatively:
# ds14 = tf.contrib.data.sample_from_datasets(all_choices_ds)
iterator = ds14.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
while True:
try:
value=sess.run(next_element)
except tf.errors.OutOfRangeError:
break
print(value)
The output is:
1
4
7
2
5
8
3
6
9
In Tensorflow 2.0
tot_imm_dataset1 = 105
tot_imm_dataset2 = 55
e = tf.data.Dataset.from_tensor_slices(tf.cast([1,0,1],tf.int64)).repeat(int(tot_imm_dataset1/2))
f=tf.data.Dataset.range(1).repeat(int(tot_imm_dataset2-tot_imm_dataset1/2))
choice=e.concatenate(f)
datasets=[dataset2,dataset1]
dataset_rgb_compl__con_patch= tf.data.experimental.choose_from_datasets(datasets, choice)
That works for me
I pull historical data for a large universe of stocks and ETFs daily. Quandl has pretty good free coverage of US Equities, but they do not have historical data for ETFs so I use the Google API as a backup for Quandl.
The recent Google finance "renovation" hasn't left me with a great alternative, so I am trying to apply Brad Solomon's work (thanks Brad, link below) to a list of symbols. Assume it is unlikely without a loop given that he is creating URLs. Any clever ideas welcome.
Related question: How come pandas_datareader for google doesn't work?
Thanks.
Under the hood, pandas-datareader is looping through each symbol that you pass and making http requests one by one.
Here's the function that does that in the base class, from which the google- and yahoo-related classes inherit: base._DailyBaseReader._dl_mult_symbols.
The magic is that these are appended to a list and then aggregated into a pandas Panel.
I would note, however, that Panel is deprecated and you can get the same functionality in a DataFrame with a MultiIndex, a structure that's technically 2-dimenionsal but replicates higher dimensionalities in practice.
So, here's the barebones of what you could do, below. Please note I'm skipping a lot of the functionality embedded within the package itself, such as parsing string dates to datetime.
import datetime
from io import StringIO
import requests
from pandas.io.common import urlencode
import pandas as pd
BASE = 'http://finance.google.com/finance/historical'
def get_params(sym, start, end):
params = {
'q': sym,
'startdate': start.strftime('%Y/%m/%d'),
'enddate': end.strftime('%Y/%m/%d'),
'output': "csv"
}
return params
def build_url(sym, start, end):
params = get_params(sym, start, end)
return BASE + '?' + urlencode(params)
def get_one_data(sym, start=None, end=None):
if not start:
start = datetime.datetime(2010, 1, 1)
if not end:
end = datetime.datetime.today()
url = build_url(sym, start, end)
data = requests.get(url).text
return pd.read_csv(StringIO(data), index_col='Date',
parse_dates=True).sort_index()
def get_multiple(sym, start=None, end=None, return_type='Panel'):
if isinstance(sym, str):
return get_one_data(sym, start=start, end=end)
elif isinstance(sym, (list, tuple, set)):
res = {}
for s in sym:
res[s] = get_one_data(s, start, end)
# The actual module also implements a 'passed' and 'failed'
# check here and also using chunking to get around
# data retreival limits (I believe)
if return_type.lower() == 'panel':
return pd.Panel(res).swapaxes('items', 'minor')
elif return_type.lower() == 'mi': # MultiIndex DataFrame
return pd.concat((res), axis=1)
An example:
syms = ['AAPL', 'GE']
data = get_multiple(syms, return_type='mi')
# Here's how you would filter down to Close prices
# on MultiIndex columns
data.xs('Close', axis=1, level=1)
AAPL GE
Date
2010-01-04 30.57 15.45
2010-01-05 30.63 15.53
2010-01-06 30.14 15.45
2010-01-07 30.08 16.25
2010-01-08 30.28 16.60
...