Select rows in matrix using list in Tensorflow - tensorflow

Somehow I ended up with a list that looks like this [ 1 36 2 72 37 74] instead of [ 1, 36, 2, 72, 37,74]. How can I convert it so that I can these values to select the rows of matrix A, which is a 5266 x 441 matrix in my case? The output should be a 6 x 441 matrix.

Although I don't see the difference between your lists (why does one have commas, and the other not?), I think you can use the tf.gather function to end up with the matrix you want to get: https://www.tensorflow.org/api_docs/python/tf/gather

Related

Pandas : make new column after nth row

I have the following table as a data frame.
0
8
990
15
70
85
36
2
43
5
68
61
62
624
65
82
523
98
I want to create a new column after every third row. So the data should look like this.
Thanks in advance.
Looks like your column can be converted into an array(i.e., list). If this is the case, you can break down the value by array and create an array of array. Then, use the array of array to create a dataframe.
The code might look something like this:
listofitems = [...]
## create new dataframe based on list index jump
newdf = pd.DataFrame([listofitems[i::3] for i in range(3)])
## transpose dataframe to 3 columns dataframe
newdf = newdf.T
For the example given above, 4139055 rows is not a big data. If you do have a big and complex data, please take a look at PySpark specifically on dataframe with Sparks. This is one of the big data frameworks that help optimizing data transformation over big dataframe.
import pandas as pd
import numpy as np
numbers= [0,
8,
990,
15,
70,
85,
36,
2,
43,
5,
68,
61,
62,
624,
65,
82,
523,
98]
pd.DataFrame(np.reshape(numbers, (6,3)))

I can't put multiple tuples values of index position which a pattern occurs, inside an df.iloc[] formula, to show only dataframes on these positions

Allright, I'll explain with more details about the project to a good understand about my problem and what I want to achieve.
I have one python script extracting data in a time intervals and another python script reading the csv file.
The first script extract the State values in a period of seconds and put into a dataframe with the time that state was measured and saves to .csv.
The second script reads the .csv file generated by the first script like I'll show below
My main Dataframe is this:
0, Datetime State, G_val, start, ssi
1, 11:02:32, 2, 1, True, 0
2, 11:02:50, 2, 1, True, 1
3, 11:03:19, 3, 1, True, 2
4, 11:03:49, 1, 1, True, 3
5, 11:04:21, 2, 1, True, 4
When my second script reads the .csv file, I define the above formats on a dataframe:
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Datetime 1361 non-null object
1 State 1361 non-null int64
2 Gval 1361 non-null int64
3 start 1361 non-null bool
4 script_start_index 1361 non-null int64
I found a way to find the patterns that I need, but because my pattern appear on a sequence of rows of a State column I need to convert my dataframe into a sequence.
Example of pattern that I need:
Pattern = "2,2,3"
The pattern appears on index positions '0-2' in the State column:
State
0, 11:02:32, 2, 1, True, 0
1, 11:02:50, 2, 1, True, 1
2, 11:03:19, 3, 1, True, 2
Code that I used to put the state columns in sequence for find the pattern with .match():
#Input
df2=(df['State'])
df2.index[df2]
growth = df2.astype(str).str.cat()
print(growth)
#Output
223121111231221232....
Then I used the Re lib for match patterns and give me tuples which represent the first position and the last position of the pattern:
for match in finditer("223", growth):
data=(tuple(match.span()))
print(data)
#output
(0, 2)
(117, 120)
(195, 198)
(247, 250)
(339, 342)
(416, 419)
(423, 426)
(427, 430)
(433, 436)
(517, 520)
(545, 548)
(562, 565)
....
Note: That this file is updating and new patterns are being generating, because of this I need to use variables.
My goal is to show me the datetime column , with the location of the states based on these index showed on data tuple that I mentioned.
I found this formula that works with the values gave directly:
#input (code solution partially)
x=data
df.iloc[x:y,0:3]
example:
x=117
y=119
print(df.iloc[x:y,0:3])
#output
Datetime State G-val
117 11:16:23 2 1
118 11:16:53 2 1
119 11:17:23 3 1
But I need a way to put all tuples on the Iloc formula as variables for extract me all tuples like this abstraction below:
For all tuples values found on below code:
for match in finditer("223", growth):
data=(tuple(match.span()))
print(data)
#output
(0, 2)
(117, 119)
(195, 198)
(247, 250)
(339, 342)
(416, 419)
.....
My goal is to put all tuple values on below formula to give me the desired output for all tuples:
df.iloc[x:y,0:3]
and be like:
df.iloc[(first tuple value dinamically updated):(second tuple value dinamically updated),0:3]
#desired output
Datetime State G-val
0 11:02:32 2 1
1 11:02:50 2 1
2 11:03:19 3 1
117 11:16:23 2 1
118 11:16:53 2 1
119 11:17:23 3 1
195 11:28:34 2 1
196 11:30:34 2 1
197 11:37:23 3 1
247 11:48:44 2 1
248 11:49:14 2 1
249 11:50:00 3 1
...
I already tried to put that values dynamically transforming the tuples on dict or list , or other dataframe but I didn't find how to give me all positions as a variables, the far that I achieved returns me only the last position value do not working with the all values.
I need to find a way that I can split the tuple values to put each index in the df.Iloc[] formula for all these tuples(the tuples values are increasing when the First script update this csv file that the second script are reading in a loop).
If that be a way to do this using this df.Iloc[] formula I think this is the best way for that I need.
One doubt in addition, if anyone can help: Is LSTM the best model to predict when these patterns (state can only assume 3 values) appear around time?
Those patterns occurs under a unknown pattern in time, but is predictable using the last measured values.

How to select "column" in nested list?

Lets say I have a nested array like this:
[
['2020-06-17 00:10:00' 2345 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' 1233 134 29334 ]
['2020-06-17 00:14:00' 3352 135 28234 ]
]
How can I select a specific "Column" from that if:
A) its a list of lists
B) its an numpy array of numpy arrays
and set it replace it by the values of a 1d list/ array of the same length, for example [ 100, 200, 300, 400, 500]
C) Plus how can I drop one specific column?
A) You can do this by doing the following. Note I have stored this in variable alist and it is not the same array as shown.
alist = [[0, 1, 2, 3, 4, 5],
[6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]]
If you wanted to get, say the 3rd column, this is the code(note: zero-indexing applies):
[row[2] for row in alist]
B) For NumPy, it is even easier. We have the same list turned into an array, but now, we just specify the parameters to get the 3rd column.
import numpy as np
npalist = np.array(alist)
npalist[:,2]
Basically, I just imported numpy, converted the 2d list into a 2d numpy array. Then I entered the parameters. If there is a colon in the first parameter, then there are all rows. If there is a colon in the second parameter, then there are all columns. We specified a specific column(i.e the 3rd column), and you can run this code on your local file system.
We can replace the third column with array [100,200,300,400] by doing this:
npalist[:,2] = [100,200,300,400]
Even though it is not the same shape as your array, it is the same logic.
C) For the third question, I will be performing it on NumPy. We can use the np.delete() function(https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.delete.html).
As you could see, we need to input 3 parameters; the array; the row or column number, and the axis. You will see what I mean later on. If we want to drop the third column, we can run the following code.
np.delete(npalist, 3, 1)
If you don't know what that means, well here it is. The first parameter is obviously, the array name. The second parameter is the row or column number. So if we want the 3rd row, we type in 2(zero-indexing). If we want the 3rd column, we still type in 2. What's the catch?
It's the last parameter. If it is 0, then we are dropping rows. If it is 1, then we are dropping columns. As you can see the above code snippet uses axis 1 as we are dropping columns.

Pandas Get Two Previous and Two Next non-NaN Values in Column

I have a Pandas DataFrame with a column containing some missing data. I want to try an imputation strategy where for a NaN point, I impute the point with a value drawn from a distribution centered at a weighted average of the previous two and next two non-NaN data points. I am not sure how in Pandas to get those values, ideally as an np.ndarray.
For example if I had this DataFrame:
0 12
1 33
2 Nan
3 22
4 Nan
5 7
I would like a function called on row 2 to return [12, 33, 22, 7]. If it were called on row 4 to return [33, 22, 7, None].
This post deals with a similar problem, but not exactly what I am looking for.

pandas Selecting/sampling at different interval frequencies

I have a 2d array/DF
>>> X_df.shape
Out[36]: (138, 2164)
Each row of this array has values uptill a given column (different for all rows) followed by nans:
>>> X_df.head(1).T
Out[28]:
Row1
0 1208.380
1 1207.600
2 1207.400
... ...
247 1213.030
248 1212.950
249 1213.000
... ...
1914 nan
1915 nan
1916 nan
I need to create another array/DF Y of shape (138, n), which has n (=3 or 5 or 10) values from each row of X_df selected equally spaced. So if X_df row 1 has 100 elements, and row 2 had 50, then for n=10, Y row 1= every 10th element from row 1 of X_df and Y row 2= every 5th element from row 2 of X_df.
I created a function to get an array of number of values in each index.
>>> last= X_df.apply(last_index,axis=1)
>>> last
Out[34]:
Row1 360
... ...
Row45 1438
What might be the best way of getting the required Y array/DF? Want to avoid loops here. I tried X_df[::last] but it gave an error "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
I looked into np.meshgrid but that didn't seem relevant. Also looked at DataFrame.sample but that seems to be useful for random sampling only.