derive features from date string in TensorFlow - tensorflow

I try to parse a CSV file which contains a date string (format "2018-03-30 09:30:05").
It should be turned into one-hot encoded features in the form of day / hour / minute / second.
One obvious way to do this is using pandas and store in a separate file or HDF store.
But in order to simplify the workflow (and leverage the GPU), I would like to do this directly in TensorFlow.
Assume the date string is on position -2, I thought something like tf.int32(tf.substr(row[-2],0,4)) should work to get the year, but it returns TypeError: 'DType' object is not callable.
with tf.python_io.TFRecordWriter("train_sample_sorted.tfrecords") as tf_writer:
i = 0
for row in myArray:
i +=1
if(i%10000==0):
print(row[-2])
#timefeatures = int(row[-2][0:4]) ## TypeError: Value must be iterable
#timefeatures = tf.int32(tf.substr(row[-2],0,4)) ## TypeError: 'DType' object is not callable
features, label = row[:-2], row[-1]
example = tf.train.Example()
example.features.feature["features"].float_list.value.extend(features)
example.features.feature["timefeatures"].float_list.value.extend(timefeatures)
example.features.feature["label"].int64_list.value.append(label)
tf_writer.write(example.SerializeToString())
What is the best practice to handle date strings as input features? Is there a way around pre-processing?
Thanks

The first version int( row[ -2 ][ 0 : 4 ] ) fails for two reasons: one is that indexing cannot be used on a string tensor's strings, and if it didn't fail for that, it would fail because you cannot convert it to int like that.
The second version tf.int32( tf.substr( row[ -2 ], 0, 4 ) ) is almost there, it does the string splitting fine, but to convert strings to numbers you have to use tf.string_to_number you cannot simply cast a string to a number like that with tensors.
Without access to the data you use I couldn't test it, but this should work:
tf.string_to_number( tf.substr( row[ -2 ], 0, 4 ), out_type = tf.int32 )

Related

TypeError: '<' not supported between instances of 'int' and 'Timestamp'

I am trying to change the product name when the period between the expiry date and today is less than 6 months. When I try to add the color, the following error appears:
TypeError: '<' not supported between instances of 'int' and 'Timestamp'.
Validade is the column where the products expiry dates are in. How do I solve it?
epi1 = pd.read_excel('/content/timadatepandasepi.xlsx')
epi2 = epi1.dropna(subset=['Validade'])`
pd.DatetimeIndex(epi2['Validade'])
today = pd.to_datetime('today').normalize()
epi2['ate_vencer'] = (epi2['Validade'] - today) /np.timedelta64(1, 'M')
def add_color(x):
if 0 <x< epi2['ate_vencer']:
color='red'
return f'background = {color}'
epi2.style.applymap(add_color, subset=['Validade'])
Looking at your data, it seems that you're subtracting two dates and using this result inside your comparison. The problem is likely occurring because df['date1'] - today returns a pandas.Series with values of type pandas._libs.tslibs.timedeltas.Timedelta, and this type of object does not allow you to make comparisons with integers. Here's a possible solution:
epi2['ate_vencer'] = (epi2['Validade'] - today).dt.days
# Now you can compare values from `"ate_vencer"` with integers. For example:
def f(x): # Dummy function for demonstration purposes
return 0 < x < 10
epi2['ate_vencer'].apply(f) # This works
Example 1
Here's a similar error to yours, when subtracting dates and calling function f without .dt.days:
Example 2
Here's the same code but instead using .dt.days:

pd.to_timedelta('1D') works while pd.to_timedelta('D') fails. Issue when converting ferquency to timedelta

I'm encountering a silly problem I guess.
Currently, I'm using pd.infer_freq to get the frequency of a dataframe index. Afterwards, I use pd.to_timedelta() to convert this frequency to a timedelta object, to be added to another date.
This works quite fine, except when the dataframe index has a frequency which can be expressed as a single time unit (eg. 1 day, or 1 minute).
To be more precise,
freq = pd.infer_freq(df)
# let's say it gives '2D' because data index is spaced on that interval
timedelta = pd.to_timedelta(freq)
works, while
freq = pd.infer_freq(df)
# let's say it gives 'D' because data index is spaced on that interval
timedelta = pd.to_timedelta(freq)
fails and returns
ValueError: unit abbreviation w/o a number
This could work if I supplied '1D' instead of 'D' though.
I could try to check if the first character of the freq string is numeric, and add '1' otherwise, but that seems quite cumbersome.
Is anyone aware of a better approach ?

Octave keyboard input function to filter concatenated string and integer?

if we write 12wkd3, how to choose/filter 123 as integer in octave?
example in octave:
A = input("A?\n")
A?
12wkd3
A = 123
while 12wkd3 is user keyboard input and A = 123 is the expected answer.
assuming that the general form you're looking for is taking an arbitrary string from the user input, removing anything non-numeric, and storing the result it as an integer:
A = input("A? /n",'s');
A = int32(str2num(A(isdigit(A))));
example output:
A?
324bhtk.p89u34
A = 3248934
to clarify what's written above:
in the input statement, the 's' argument causes the answer to get stored as a string, otherwise it's evaluated by Octave first. most inputs would produce errors, others may be interpreted as functions or variables.
isdigit(A) produces a logical array of values for A with a 1 for any character that is a 0-9 number, and 0 otherwise.
isdigit('a1 3 b.') = [0 1 0 1 0 0 0]
A(isdigit(A)) will produce a substring from A using only those values corresponding to a 1 in the logical array above.
A(isdigit(A)) = 13
that still returns a string, so you need to convert it into a number using str2num(). that, however, outputs a double precision number. so finally to get it to be an integer you can use int32()

Numpy maximum(arrays)--how to determine the array each max value came from

I have numpy arrays representing July temperature for each year since 1950.
I can use the numpy.maximum(temp1950,temp1951,temp1952,..temp2014)
to determine the maximum July temperature at each cell.
I need the maximum for each cell..the numpy.maximum() works for only 2 arrays
How do I determine the year that each max value came from?
Also the numpy.maximum(array1,array2) works comparing only two arrays.
Thanks to Praveen, the following works fine:
array1 = numpy.array( ([1,2],[3,4]) )
array2 = numpy.array( ([3,4],[1,2]) )
array3 = numpy.array( ([9,1],[1,9]) )
all_arrays = numpy.dstack((array1,array2,array3))
#maxvalues = numpy.maximum(all_arrays)#will not work
all_arrays.max(axis=2) #this returns the max from each cell location
max_indexes = numpy.argmax(all_arrays,axis=2)#this returns correct indexes
The answer is argmax, except that you need to do this along the required axis. If you have 65 years' worth of temperatures, it doesn't make sense to keep them in separate arrays.
Instead, put them all into a single 2D dimensional array using something like np.vstack and then take the argmax over rows.
alltemps = np.vstack((temp1950, temp1951, ..., temp2014))
maxindexes = np.argmax(alltemps, axis=0)
If your temperature arrays are already 2D for some reason, then you can use np.dstack to stack in depth instead. Then you'll have to take argmax over axis=2.
For the specific example in your question, you're looking for something like:
t = np.dstack((array1, array2)) # Note the double parantheses. You need to pass
# a tuple to the function
maxindexes = np.argmax(t, axis=2)
PS: If you are getting the data out of a file, I suggest putting them in a single array to start with. It gets hard to handle 65 variable names.
You need to use Numpy's argmax
It would give you the index of the largest element in the array, which you can map to the year.

Dynamically creating variables, while doing map/apply on a dataframe in pandas to get key names for the values in Series object returned

I am writing code for a Naive Bayes model(I know there's a standard implementation in Sklearn, but I want to code it anyway) - For this I have say upwards of 30 features, against all of which I have the corresponding click & impression counts (Treat them as True/False flags)
What I need then, is to calculate
P(Click/F1, F2.. F30) = (P(Click)*P(F1/Click)*P(F2|click) ..*P(F30|Click))/(P(F1, F2...F30), and
P(NoClick/F1, F2.. F30) = (P(NoClick)*P(F1/NoClick)*P(F2|Noclick) ..*P(F30|NOClick))/(P(F1, F2...F30)
Where I will disregard the denominator as it will affect both Click & Non click behaviour similarly.
Example, for two features, day_custom & is_tablet_phone, I have
is_tablet_phone click impression
FALSE 375417 28291280
TRUE 17743 4220980
day_custom click impression
Fri 77592 7029703
Mon 43576 3773571
Sat 65950 5447976
Sun 66460 5031271
Thu 74329 6971541
Tue 55282 4575114
Wed 51555 4737712
My approach to the Problem : Assuming I read the individual files in data frame, one after another, I want the abilty to calculate & store the corresponding Probablities back in a file, that I will then use for real time prediction of Probabilty to click vs no click.
One possible structure of "processed file" thus would be -:
Here's my entire code -:
In the full blown example, I am traversing the entire directory structure(of 30 txt files, one at a time, from the base path) - which is why I need the ability to create "names" at runtime.
for base_path in base_paths:
for root, dirs, files in os.walk(base_path):
for file in files:
file_paths.append(os.path.join(root, file))
For reasons of tractability, follow from here, by taking the 2 txt files as sample input
file_paths=['/home/ekta/Desktop/NB/day_custom.txt','/home/ekta/Desktop/NB/is_tablet_phone.txt']
flag=0
for filehandle in file_paths:
feature_name=filehandle.split("/")[-1].split(".")[0]
df= pd.read_csv(filehandle,skiprows=0, encoding='utf-8',sep='\t',index_col=False,dtype={feature_name: object,'click': int,'impression': int})
df2=df[(df.impression-df.click>0) & (df.click >0)]
if flag ==0:
MySumC,MySumNC,Mydict=0,0,collections.defaultdict(dict)
MySumC=sum(df2['click'])
MySumNC=sum(df2['impression'])
P_C=float(MySumC)/float(MySumC+MySumNC)
P_NC=1-P_C
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
flag=1 %Set the flag as "1" because we don't need to compute the MySumC,MySumNC, P_C & P_NC again
Question :
It looks like THIS loop is the killer here.Also, intutively, looping on a dataframe is a BAD practice. How can I rewrite this, perhaps using Map/Apply ?
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
What I need in Mydict , which is a hash to store each feature name and each feature value in it
{'day_custom_Mon':{'P_day_custom_Mon_C':.787,'P_day_custom_Mon_NC': 0.556},
'day_custom_Tue':{'P_day_custom_Tue_C':0.887,'P_day_custom_Tue_NC': 0.156},
'day_custom_Wed':{'P_day_custom_Tue_C':0.087,'P_day_custom_Tue_NC': 0.167}
'day_custom_Thu':{'P_day_custom_Tue_C':0.947,'P_day_custom_Tue_NC': 0.196},
'is_tablet_phone_True':{'P_is_tablet_phone_True_C':.787,'P_is_tablet_phone_True_NC': 0.066},
'is_tablet_phone_False':{'P_is_tablet_phone_False_C':.787,'P_is_tablet_phone_False_NC': 0.077},
.. and so on..
%PPS: I just made up those float numbers, but you get the point
Also because I will later serialize this file & pass to Redis directly, for other systems to feed on it, in an cron-job manner, so I need to preserve some sort of Dynamic naming .
What I tried -:
Since I am reading feature_name as
feature_name=filehandle.split("/")[-1].split(".")[0]` # thereby abstracting & creating variables dynamically
def funct1(row):
return row[feature_name]
def funct2(row):
return row['click']
def funct3(row):
return row['impression']
then..
df2.apply(funct2,axis=1)df2.apply(funct,axis=1)*float(P_C))/MySumC, df2.apply(funct3,axis=1)*float(P_NC))/MySumNC Gives me both the values I need for a feature_value(say Mon, Tue, Wed, and so on..) for a feature_name (say,day_custom)
I also know that df2.apply(funct1, axis=1) contains part of mycustom "names"(ie feature values), how would I then build these names using map/apply ?
Ie. I will have the values, but how would I create the "key" 'P_'+feature_name+'_'+feature_value+'_C' , since feature value post apply is returned as a series object.
check out the following recipe which does exactly what you want, only using data frame manipulations. I also simplified the actual frequency calculation a bit ;)
#set the feature name values as the index of
df2.set_index(feature_name, inplace=True)
#This is what df2.set_index() looks like:
# click impression
#day_custom
#Fri 9917 3163
#Mon 2566 3818
#Sat 8725 7753
#Sun 6938 8642
#Thu 6136 2556
#Tue 5234 2356
#Wed 9463 9433
#rename the index of your data frame
df2.rename(index=lambda x:"%s_%s"%('day_custom', x), inplace=True)
#compute the total sum of your data frame entries
totsum = float(df2.values.sum())
#use apply to multiply every data frame element by the total sum
df2 = df2.applymap(lambda x:x/totsum)
#transpose the data frame to have the following shape
#day_custom day_custom_Fri day_custom_Mon ...
#click 0.102019 0.037468 ...
#impression 0.087661 0.045886 ...
#
#
dftranspose = df2.T
# template kw for formatting
templatekw = {'click':"P_%s_C", 'impression':"P_%s_NC"}
# build a list of small data frames with correct index names P_%s_NC etc
dflist = [dftranspose[[col]].rename(lambda x:templatekw[x]%col) for col in dftranspose]
#use the concatenate function to produce a sparse dictionary
MyDict= pd.concat(dflist).to_dict()
Instead of assigning to MyDict at the end, you can use the update-method during the loop.
For understanding the comments below, see here my
Original answer:
Try to use a pivot_table:
def clickfunc(x):
return np.sum(x) * P_C / MySumC
def impressionfunc(x):
return np.sum(x) * P_NC / MySumNC
newtable = df2.pivot_table(['click', 'impression'], 'feature_name', \
aggfunc=[clickfunc, impressionfunc])
#transpose the table for the dictionary to have the right form
newtable = newtable.T
#to_dict functionality already gives the correct result
MyDict = newtable.to_dict()
#rename by copying
for feature_value, subdict in MyDict.items():
word = feature_name +"_"+ feature_value
copydict[word] = {'P_' + word + '_C':subdict['click'],\
'P_' + word + '_NC':subdict['impression'] }
This gives you the result you want in copydict
itertuples() is what worked for me(worked at lightspeed) - though It is still not using the map/apply approach that I so much wanted to see. Itertuples on a pandas dataframe returns the whole row, so I no longer have to do df2[df2[feature_name]==feature_value]['click'] - be aware that this matching by value is not only expensive, but also undesired, since it may return a series, if there were duplicate rows. itertuples solves that problem were elegantly, though I need to then access the individual objects/columns by integer indexes , which means less re-usable code. I could abstract this, but It wont be like accessing by column names, the status-quo.
for row in df2.itertuples():
Mydict[feature_name+'_'+str(row[1])]={'P_'+feature_name+'_'+str(row[1])+'_C':(row[2]*float(P_C))/MySumC, \
'P_'+feature_name+'_'+str(row[1])+'_NC':(row[3]*float(P_NC))/MySumNC}
Note that I am accesing each column in the row by row[1] , row[2] and like. For example, row has (0, u'Fri', 77592, 7029703)
Post this I get
dict(Mydict)
{'day_custom_Thu': {'P_day_custom_Thu_NC': 0.18345372640838162, 'P_day_custom_Thu_C': 0.0019559423132143377}, 'day_custom_Mon': {'P_day_custom_Mon_C': 0.0011466875948906617, 'P_day_custom_Mon_NC': 0.099300235316209587}, 'day_custom_Sat': {'P_day_custom_Sat_NC': 0.14336163246883712, 'P_day_custom_Sat_C': 0.0017354517827023852}, 'day_custom_Tue': {'P_day_custom_Tue_C': 0.001454726996987919, 'P_day_custom_Tue_NC': 0.1203925662982053}, 'day_custom_Sun': {'P_day_custom_Sun_NC': 0.13239618235343156, 'P_day_custom_Sun_C': 0.0017488722589598259}, 'is_tablet_phone_TRUE': {'P_is_tablet_phone_TRUE_NC': 0.11107365073163174, 'P_is_tablet_phone_TRUE_C': 0.00046690100046229593}, 'day_custom_Wed': {'P_day_custom_Wed_NC': 0.12467127727567069, 'P_day_custom_Wed_C': 0.0013566522616712882}, 'day_custom_Fri': {'P_day_custom_Fri_NC': 0.1849842396242351, 'P_day_custom_Fri_C': 0.0020418070466026303}, 'is_tablet_phone_FALSE': {'P_is_tablet_phone_FALSE_NC': 0.74447539516197614, 'P_is_tablet_phone_FALSE_C': 0.0098789704610580936}}