Lets say I have a nested array like this:
[
['2020-06-17 00:10:00' 2345 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' 1233 134 29334 ]
['2020-06-17 00:14:00' 3352 135 28234 ]
]
How can I select a specific "Column" from that if:
A) its a list of lists
B) its an numpy array of numpy arrays
and set it replace it by the values of a 1d list/ array of the same length, for example [ 100, 200, 300, 400, 500]
C) Plus how can I drop one specific column?
A) You can do this by doing the following. Note I have stored this in variable alist and it is not the same array as shown.
alist = [[0, 1, 2, 3, 4, 5],
[6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]]
If you wanted to get, say the 3rd column, this is the code(note: zero-indexing applies):
[row[2] for row in alist]
B) For NumPy, it is even easier. We have the same list turned into an array, but now, we just specify the parameters to get the 3rd column.
import numpy as np
npalist = np.array(alist)
npalist[:,2]
Basically, I just imported numpy, converted the 2d list into a 2d numpy array. Then I entered the parameters. If there is a colon in the first parameter, then there are all rows. If there is a colon in the second parameter, then there are all columns. We specified a specific column(i.e the 3rd column), and you can run this code on your local file system.
We can replace the third column with array [100,200,300,400] by doing this:
npalist[:,2] = [100,200,300,400]
Even though it is not the same shape as your array, it is the same logic.
C) For the third question, I will be performing it on NumPy. We can use the np.delete() function(https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.delete.html).
As you could see, we need to input 3 parameters; the array; the row or column number, and the axis. You will see what I mean later on. If we want to drop the third column, we can run the following code.
np.delete(npalist, 3, 1)
If you don't know what that means, well here it is. The first parameter is obviously, the array name. The second parameter is the row or column number. So if we want the 3rd row, we type in 2(zero-indexing). If we want the 3rd column, we still type in 2. What's the catch?
It's the last parameter. If it is 0, then we are dropping rows. If it is 1, then we are dropping columns. As you can see the above code snippet uses axis 1 as we are dropping columns.
Related
Allright, I'll explain with more details about the project to a good understand about my problem and what I want to achieve.
I have one python script extracting data in a time intervals and another python script reading the csv file.
The first script extract the State values in a period of seconds and put into a dataframe with the time that state was measured and saves to .csv.
The second script reads the .csv file generated by the first script like I'll show below
My main Dataframe is this:
0, Datetime State, G_val, start, ssi
1, 11:02:32, 2, 1, True, 0
2, 11:02:50, 2, 1, True, 1
3, 11:03:19, 3, 1, True, 2
4, 11:03:49, 1, 1, True, 3
5, 11:04:21, 2, 1, True, 4
When my second script reads the .csv file, I define the above formats on a dataframe:
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Datetime 1361 non-null object
1 State 1361 non-null int64
2 Gval 1361 non-null int64
3 start 1361 non-null bool
4 script_start_index 1361 non-null int64
I found a way to find the patterns that I need, but because my pattern appear on a sequence of rows of a State column I need to convert my dataframe into a sequence.
Example of pattern that I need:
Pattern = "2,2,3"
The pattern appears on index positions '0-2' in the State column:
State
0, 11:02:32, 2, 1, True, 0
1, 11:02:50, 2, 1, True, 1
2, 11:03:19, 3, 1, True, 2
Code that I used to put the state columns in sequence for find the pattern with .match():
#Input
df2=(df['State'])
df2.index[df2]
growth = df2.astype(str).str.cat()
print(growth)
#Output
223121111231221232....
Then I used the Re lib for match patterns and give me tuples which represent the first position and the last position of the pattern:
for match in finditer("223", growth):
data=(tuple(match.span()))
print(data)
#output
(0, 2)
(117, 120)
(195, 198)
(247, 250)
(339, 342)
(416, 419)
(423, 426)
(427, 430)
(433, 436)
(517, 520)
(545, 548)
(562, 565)
....
Note: That this file is updating and new patterns are being generating, because of this I need to use variables.
My goal is to show me the datetime column , with the location of the states based on these index showed on data tuple that I mentioned.
I found this formula that works with the values gave directly:
#input (code solution partially)
x=data
df.iloc[x:y,0:3]
example:
x=117
y=119
print(df.iloc[x:y,0:3])
#output
Datetime State G-val
117 11:16:23 2 1
118 11:16:53 2 1
119 11:17:23 3 1
But I need a way to put all tuples on the Iloc formula as variables for extract me all tuples like this abstraction below:
For all tuples values found on below code:
for match in finditer("223", growth):
data=(tuple(match.span()))
print(data)
#output
(0, 2)
(117, 119)
(195, 198)
(247, 250)
(339, 342)
(416, 419)
.....
My goal is to put all tuple values on below formula to give me the desired output for all tuples:
df.iloc[x:y,0:3]
and be like:
df.iloc[(first tuple value dinamically updated):(second tuple value dinamically updated),0:3]
#desired output
Datetime State G-val
0 11:02:32 2 1
1 11:02:50 2 1
2 11:03:19 3 1
117 11:16:23 2 1
118 11:16:53 2 1
119 11:17:23 3 1
195 11:28:34 2 1
196 11:30:34 2 1
197 11:37:23 3 1
247 11:48:44 2 1
248 11:49:14 2 1
249 11:50:00 3 1
...
I already tried to put that values dynamically transforming the tuples on dict or list , or other dataframe but I didn't find how to give me all positions as a variables, the far that I achieved returns me only the last position value do not working with the all values.
I need to find a way that I can split the tuple values to put each index in the df.Iloc[] formula for all these tuples(the tuples values are increasing when the First script update this csv file that the second script are reading in a loop).
If that be a way to do this using this df.Iloc[] formula I think this is the best way for that I need.
One doubt in addition, if anyone can help: Is LSTM the best model to predict when these patterns (state can only assume 3 values) appear around time?
Those patterns occurs under a unknown pattern in time, but is predictable using the last measured values.
Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks
Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)
sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.
If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.
I'm struggling with numpy lib.
I have a tensor of the shape (batch_size, timestep, feature):
For example lets create a dummy:
x = np.arange(42).reshape(2,7,3)
#now make some rows have homogeneous values
x[:,::3,:] =0
x[:,::5,:] =2
Now I need a numpyish way(which is repeatable in tensorflow) to remove rows(axis=-2) where all values are the same. So in the end I need a tensor to look like this:
[[[ 3 4 5]
[ 6 7 8]
[12 13 14]]
[[24 25 26]
[27 28 29]
[33 34 35]]]
Thanks.
P.S. this is not the same question as to "remove all zero rows". Since here we are talking about rows with homo- values. And this is a bit trickier.
If you are okay with losing one dimension (so that your array remains homogeneous), then you can do:
x[~np.all(x == x[:, :, 0, np.newaxis], axis=-1)]
# out:
[[ 3 4 5]
[ 6 7 8]
[12 13 14]
[24 25 26]
[27 28 29]
[33 34 35]]
Credit: #unutbu's answer to a similar problem, here adapted to one more dimension.
Why is the 3rd dimension removed? Imagine if your conditions were such that you wanted to select 2 rows from your first array and 3 from your second: then the result would be heterogeneous, which would have to be stored as a masked array or as a list of arrays.
There might be a more clever way using only numpy. However, you could just iterate over the 2nd dimension and do a comparison.
not_same= []
for n in range(x.shape[1]): # iterate over the 2nd dimension
# test if it is homogeneous i.e. first value equal all values
not_same.append(~np.all(x[:,n,:] ==x[0,n,0]))
out = x[:,not_same,:]
This gives you:
array([[[ 3, 4, 5],
[ 6, 7, 8],
[12, 13, 14]],
[[24, 25, 26],
[27, 28, 29],
[33, 34, 35]]])
Somehow I ended up with a list that looks like this [ 1 36 2 72 37 74] instead of [ 1, 36, 2, 72, 37,74]. How can I convert it so that I can these values to select the rows of matrix A, which is a 5266 x 441 matrix in my case? The output should be a 6 x 441 matrix.
Although I don't see the difference between your lists (why does one have commas, and the other not?), I think you can use the tf.gather function to end up with the matrix you want to get: https://www.tensorflow.org/api_docs/python/tf/gather
I am looking for a fast formulation to do a numerical binning of a 2D numpy array. By binning I mean calculate submatrix averages or cumulative values. For ex. x = numpy.arange(16).reshape(4, 4) would have been splitted in 4 submatrix of 2x2 each and gives numpy.array([[2.5,4.5],[10.5,12.5]]) where 2.5=numpy.average([0,1,4,5]) etc...
How to perform such an operation in an efficient way... I don't have really any ideay how to perform this ...
Many thanks...
You can use a higher dimensional view of your array and take the average along the extra dimensions:
In [12]: a = np.arange(36).reshape(6, 6)
In [13]: a
Out[13]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [14]: a_view = a.reshape(3, 2, 3, 2)
In [15]: a_view.mean(axis=3).mean(axis=1)
Out[15]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
In general, if you want bins of shape (a, b) for an array of (rows, cols), your reshaping of it should be .reshape(rows // a, a, cols // b, b). Note also that the order of the .mean is important, e.g. a_view.mean(axis=1).mean(axis=3) will raise an error, because a_view.mean(axis=1) only has three dimensions, although a_view.mean(axis=1).mean(axis=2) will work fine, but it makes it harder to understand what is going on.
As is, the above code only works if you can fit an integer number of bins inside your array, i.e. if a divides rows and b divides cols. There are ways to deal with other cases, but you will have to define the behavior you want then.
See the SciPy Cookbook on rebinning, which provides this snippet:
def rebin(a, *args):
'''rebin ndarray data into a smaller ndarray of the same rank whose dimensions
are factors of the original dimensions. eg. An array with 6 columns and 4 rows
can be reduced to have 6,3,2 or 1 columns and 4,2 or 1 rows.
example usages:
>>> a=rand(6,4); b=rebin(a,3,2)
>>> a=rand(6); b=rebin(a,2)
'''
shape = a.shape
lenShape = len(shape)
factor = asarray(shape)/asarray(args)
evList = ['a.reshape('] + \
['args[%d],factor[%d],'%(i,i) for i in range(lenShape)] + \
[')'] + ['.sum(%d)'%(i+1) for i in range(lenShape)] + \
['/factor[%d]'%i for i in range(lenShape)]
print ''.join(evList)
return eval(''.join(evList))
I assume that you only want to know how to generally build a function that performs well and does something with arrays, just like numpy.reshape in your example. So if performance really matters and you're already using numpy, you can write your own C code for that, like numpy does. For example, the implementation of arange is completely in C. Almost everything with numpy which matters in terms of performance is implemented in C.
However, before doing so you should try to implement the code in python and see if the performance is good enough. Try do make the python code as efficient as possible. If it still doesn't suit your performance needs, go the C way.
You may read about that in the docs.