How to work with symbols, numbers, and letters in TensorFlow? - tensorflow

I'm working on my first Tensorflow model and when I was training the dataset, my accuracy dropped to 25% from around 60% when using sci-kit. A friend told me it might have to do with some of the data, for example, "781C376B-E380-C052-448B-B4AB6F3D". How do I deal with symbols (dashes here), numbers, and letters in my data when running my models?
Currently I am looking into text vectorization so it could read my data easier.

You can you tf.strings.unicode_decode() which converts an encoded string scalar to a vector of code points. It provides unique number for each character in the string.
For example:
# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
[u'781C376B-E380-C052-448B-B4AB6F3D']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
print(sentence_chars)
output:[55, 56, 49, 67, 51, 55, 54, 66, 45, 69, 51, 56, 48, 45, 67, 48, 53, 50, 45, 52, 52, 56, 66, 45, 66, 52, 65, 66, 54, 70, 51, 68]
For more details please refer to this document. Thank You.

Related

Recurrent neural network LSTM problem solved using 1 epoch

Looking at a solved problem on which the goal is to make predictions of stock price I've found that only 1 epoch is used to solve it. The data is composed of little less than 1500 points each corresponding to a daily closing price. So we have a dataset of dates (days) and prices.
Using LSTM approach the X_train training set is generated as:
Original dataset:
Date Price
1-1-2010 100
2-1-2010 80
3-1-2010 50
4-1-2010 40
5-1-2010 70
...
30-10-2012 130
...
X_train:
[[100, 80, 50, 40, 70, ...],
[80, 50, 40, 70, 90, ...],
[50, 40, 70, 90, 95, ...],
...
[..., 78, 85, 72, 60, 105],
[..., 85, 72, 60, 105, 130]]
The training set is 60 in length and shifted by one day everytime until a fraction of the total dataframe is reached (training set). Please don't consider things like normalization, etc. This is just an example.
The thing is that in the training part of the problem the epochs are set to 1, this is the first time I see this approach of considering just one pass through the data to train the model. I've searched about it but to no avail.
Does anyone knows how this technique is called (if it has a name) so I can search more about it?

How eliminate rows in a dataframe with common values with other dataframe? Rstudio

I need to eliminate the rows in a dataframe that have at a column common values with the same column in a second dataframe
The columns the code have to take into account contain IDs of subjetcs, while the rest contain data refering to those subjects.
Example of dataframes (Rstudio)
df1<-data.frame(ID=c(13, 16, 25, 36, 25, 17, 50, 63, 61, 34, 65, 17), AnyData=round(runif(12, 1, 5)))
df2<-data.frame(ID=c(89, 57, 13, 17, 18, 21, 51, 50, 72, 84), AnyData=round(runif(10, 1, 5)))
I have tried two functions
df1<- filter(df1, ID!=df2[ID])
df1<- df1[-c(which(df1[ID]==df2[ID]))]
The result should be:
df1 <- data.frame(ID=c(16, 25, 36, 25, 63, 61, 34, 656), AnyData=(...)
AnyData depends on the values asigned with ruinf, so it will vary, but the value must be the same as in the original df1.
What you need is an anti_join():
library(dplyr)
df1 %>%
anti_join(df2, by = "ID")

postgresql expression wher id is in array

I have an array of ids such as:
a = [13, 51, 99, 143, 225, 235, 873]
What is the most efficient way of getting the records where the id is in the array.
I don't really want to use or such as WHERE id = 13 || 92 , as the array could be extremely long. I've tried this:
select * from authors where id <# [11, 8, 51, 62, 7];
but that's not correct.
Thanks
Use any
select *
from authors
where id = any (array[11, 8, 51, 62, 7]);

matplotlib plot data with nans

I'm surprised how few are the posts relating to this problem. Anyway...
here it is:
I have csv data files containing X values in the first column, and several Y values columns thereafter. But for a given X value not all Y series have a corresponding value. Here is an example:
0, 16, 96, 99
10, 88, 45, 85
20, 85, 61, 10
30, 30, --, 45
40, 82, 28, 82
50, 23, 9, 61
60, 40, 77, 0
70, 26, 21, --
80, --, 58, 99
90, 1, 14, 30
when this csv data is loaded with numpy.genfromtxt, the '--' strings are taken as nan which is good. But when plotting, the plots are interrupted with blanks where there is a nan. Is there an option when a nan appears to make pyplot.plot() ignoring both the nan and the corresponding X value?
Not sure if matplotlib has such functionality built in, but you could home-brew it doing the following:
idx = ~numpy.isnan(Y)
pyplot.plot(X[idx], Y[idx])
Look at this post
As proposed in my answer there, I'd recommend using np.isfinite instead of np.isnan. There might be other reasons for your plot to have discontinuities, e.f., inf

sql - incorrect total using SUM of DATEDiff results

when i run DateDiff(dd, value1, value2) it returns a lists of values
56, 90, 61, 61, 58, 70, 193, 28, 143, 143, 0, 146 which totals 1360 (i've checked in excel)
However, when i use SUM(DateDiff(dd, value1, value2)) i am returned the value 1459
I have tried running the SUM code on a few different lists, and comparing the manually total values obtained by the first method. Sometimes the SUM value is higher, sometime lower, and there doesn't seem to be any relationship/ratio.
any ideas? thanks.