I'm surprised how few are the posts relating to this problem. Anyway...
here it is:
I have csv data files containing X values in the first column, and several Y values columns thereafter. But for a given X value not all Y series have a corresponding value. Here is an example:
0, 16, 96, 99
10, 88, 45, 85
20, 85, 61, 10
30, 30, --, 45
40, 82, 28, 82
50, 23, 9, 61
60, 40, 77, 0
70, 26, 21, --
80, --, 58, 99
90, 1, 14, 30
when this csv data is loaded with numpy.genfromtxt, the '--' strings are taken as nan which is good. But when plotting, the plots are interrupted with blanks where there is a nan. Is there an option when a nan appears to make pyplot.plot() ignoring both the nan and the corresponding X value?
Not sure if matplotlib has such functionality built in, but you could home-brew it doing the following:
idx = ~numpy.isnan(Y)
pyplot.plot(X[idx], Y[idx])
Look at this post
As proposed in my answer there, I'd recommend using np.isfinite instead of np.isnan. There might be other reasons for your plot to have discontinuities, e.f., inf
Related
Looking at a solved problem on which the goal is to make predictions of stock price I've found that only 1 epoch is used to solve it. The data is composed of little less than 1500 points each corresponding to a daily closing price. So we have a dataset of dates (days) and prices.
Using LSTM approach the X_train training set is generated as:
Original dataset:
Date Price
1-1-2010 100
2-1-2010 80
3-1-2010 50
4-1-2010 40
5-1-2010 70
...
30-10-2012 130
...
X_train:
[[100, 80, 50, 40, 70, ...],
[80, 50, 40, 70, 90, ...],
[50, 40, 70, 90, 95, ...],
...
[..., 78, 85, 72, 60, 105],
[..., 85, 72, 60, 105, 130]]
The training set is 60 in length and shifted by one day everytime until a fraction of the total dataframe is reached (training set). Please don't consider things like normalization, etc. This is just an example.
The thing is that in the training part of the problem the epochs are set to 1, this is the first time I see this approach of considering just one pass through the data to train the model. I've searched about it but to no avail.
Does anyone knows how this technique is called (if it has a name) so I can search more about it?
I'm working on my first Tensorflow model and when I was training the dataset, my accuracy dropped to 25% from around 60% when using sci-kit. A friend told me it might have to do with some of the data, for example, "781C376B-E380-C052-448B-B4AB6F3D". How do I deal with symbols (dashes here), numbers, and letters in my data when running my models?
Currently I am looking into text vectorization so it could read my data easier.
You can you tf.strings.unicode_decode() which converts an encoded string scalar to a vector of code points. It provides unique number for each character in the string.
For example:
# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
[u'781C376B-E380-C052-448B-B4AB6F3D']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
print(sentence_chars)
output:[55, 56, 49, 67, 51, 55, 54, 66, 45, 69, 51, 56, 48, 45, 67, 48, 53, 50, 45, 52, 52, 56, 66, 45, 66, 52, 65, 66, 54, 70, 51, 68]
For more details please refer to this document. Thank You.
A lucky number is found by listing all numbers up to n.
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
And then remove every second number so we get: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Now the next number after 1 here is 3 so now remove every third number:
1,3,7,9,13,15,19,21,25,27,29
Now the next number after 3 is 7, so now remove every seventh number:
1,3,7,9,13,15,21,25,27,29
And the next number after 7 in our list is 9 so now remove every ninth number.
etc
The remaining numbers are lucky numbers: 1,3,7,9,13,15,21,25,31
Hello, I am a relatively new Python programmer who is trying to figure this out.
I did not even come close to solving this, and I want them up to the 100 billions so an advice of the best way to go about this is welcome. here is my best try to get this done in Numpy:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
b = a[::2] #using skip of 2 on our array
c = b[::3] #using skip of 3 on our array
d = c[::7] #using skip of 7 on our array
e = d[::9] #using skip of 9 on our array
print(e)
It returns only 1 so this requires more advanced programming to find the lucky numbers,
I need some clever programming in order to automatically find the next skip also since I can't input millions of skips like I have done here with the skips of 2, 3, 7 & 9.
IIUC, one way using while loop with checker:
def find_lucky(n):
arr = list(range(1, n+1))
done = set()
ind = 1
while len(arr) >= (i:= arr[ind]):
if i in done:
ind += 1
else:
del arr[i-1::i]
done.add(i)
return arr
Output:
find_lucky(32)
# [1, 3, 7, 9, 13, 15, 21, 25, 31]
I need to eliminate the rows in a dataframe that have at a column common values with the same column in a second dataframe
The columns the code have to take into account contain IDs of subjetcs, while the rest contain data refering to those subjects.
Example of dataframes (Rstudio)
df1<-data.frame(ID=c(13, 16, 25, 36, 25, 17, 50, 63, 61, 34, 65, 17), AnyData=round(runif(12, 1, 5)))
df2<-data.frame(ID=c(89, 57, 13, 17, 18, 21, 51, 50, 72, 84), AnyData=round(runif(10, 1, 5)))
I have tried two functions
df1<- filter(df1, ID!=df2[ID])
df1<- df1[-c(which(df1[ID]==df2[ID]))]
The result should be:
df1 <- data.frame(ID=c(16, 25, 36, 25, 63, 61, 34, 656), AnyData=(...)
AnyData depends on the values asigned with ruinf, so it will vary, but the value must be the same as in the original df1.
What you need is an anti_join():
library(dplyr)
df1 %>%
anti_join(df2, by = "ID")
I have a csv that looks like the image below. I want to calculate the percentile(10,50,90) of each row starting from B2 to X2 and adding that final percentile in a new column. Essentially, I want to find the 10th percetile of the average(std, cv, sp_tim.....) value over the entire period of record available.
I have created the following code line to read it in python as a dataframe format so far.
da = pd.read_csv('Project/11433300_annual_flow_matrix.csv', index_col=0, parse_dates=True)
If I have understood your question correctly then below code might be helpful for you:
I have Used some Dummy data, and given similar kind of treatment on it which you are looking for
aq = [1, 2, 2, 3, 3, 4, 4, 5, 7, 8, 10, 11]
aw = [91, 25, 13, 53, 95, 94, 75, 35, 57, 88, 111, 12]
df = pd.DataFrame({'aq': aq, 'aw': aw})
n = df.shape[0]
p = 0.1 #for 10th percentile
position = np.ceil(n*p)
position = int(position)
df.iloc[position,]
Kindly have a look and let me know if this is works for you.