I have a 3-D array with dimension of (14,3,5), which correspond to (event, color, taste).
If I want to select all event's second color option and third taste option.
Could someone tell me which is the correct format?
[:,2,3] vs [:,2][:,3]
Are they the same, or different?
If they are different, how are they different?
Do a test:
In [256]: arr = np.arange(2*3*5).reshape(2,3,5)
In [257]: arr
Out[257]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
One way:
In [258]: arr[:,2,3]
Out[258]: array([13, 28])
The other is evaluated in 2 steps:
In [259]: arr[:,2]
Out[259]:
array([[10, 11, 12, 13, 14],
[25, 26, 27, 28, 29]])
In [260]: arr[:,2][:,3]
Out[260]: array([13, 28])
The [:,3] is applied to the result of the [:,2]. Each [] is translated by the interpreter into a __getitem__() call (or a __setitem__ if followed by a =). [:,2,3] is one just call, __getitem__((slice(None),2,3)).
With scalar indices like this, they are the same.
But what if one (or both) index is a list or array?
In [261]: arr[:,[1,2],3]
Out[261]:
array([[ 8, 13],
[23, 28]])
In [262]: arr[:,[1,2]]
Out[262]:
array([[[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
In [263]: arr[:,[1,2]][:,3]
Traceback (most recent call last):
Input In [263] in <cell line: 1>
arr[:,[1,2]][:,3]
IndexError: index 3 is out of bounds for axis 1 with size 2
In [264]: arr[:,[1,2]][:,:,3]
Out[264]:
array([[ 8, 13],
[23, 28]])
At least you are doing the common novice mistake of attempting:
In [265]: arr[:][2][3]
Traceback (most recent call last):
Input In [265] in <cell line: 1>
arr[:][2][3]
IndexError: index 2 is out of bounds for axis 0 with size 2
In the long run you need to read and understand (most of)
https://numpy.org/doc/stable/user/basics.indexing.html
I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array
I am stuck. I want to read a simple csv file into a Numpy array and seem to have dug myself into a hole. I am new to Numpy and I am SURE I have messed this up somehow as usually I can read CSV files easily in Python 3.4. I don't want to use Pandas so I thought I would use Numpy to increase my skillset but I really am not getting this at all. If someone could tell me if I am on the right track using genfromtxt OR is there an easier way and give me a nudge in the right direction I would be grateful.
I want to read in the CSV file manipulate the datetime column to 8/4/2014 then put it in a numpy array together with the remaining columns. Here is what I have so far and the error which I am having trouble coding around. I can get the date part way there but don't see how to add the date.strftime("%Y-%m-%d") to the datefunc. Also I don't see how to format the string for SYM to get round the error. Any help would be appreciated.
the data
2015-08-04 02:14:05.249392, AA, 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99
2015-08-04 02:14:05.325113, AAPL, 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75
2015-08-04 02:14:05.415193, AIG, 0.0080808151, 0.0073296055, 0.0076213535, 12.8278962785, 11.635388035, 12.0985236788, -9.2962105215, 3.980405659, -142.8175077335, 71, 42, 33
2015-08-04 02:14:05.486185, AMZN, 0.0235649449, 0.0305828226, 0.0092703502, 37.4081902773, 48.5487257749, 14.7162247572, 29.7810062852, -69.6877219282, -334.0005615016, 2, 92, 10
the "code" sorry still learning
import numpy as np
from datetime import datetime
from datetime import date,time
datefunc = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d %H:%M:%S.%f')
a = np.genfromtxt('/home/dave/Desktop/development/hvanal2016.csv',delimiter = ',',
converters = {0:datefunc},dtype='object,str,float,float,float,float,float,float,float,float,float,float,float,float',
names = ["date","sym","20sd","10sd","5sd","hv20","hv10","hv5","2010hv","105hv","abshv","2010rank","105rank","absrank"])
print(a["date"])
print(a["sym"])
print(a["20sd"])
print(a["hv20"])
print(a["absrank"])
the error
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "copyright", "credits" or "license()" for more information.
>>>
============================================================================== RESTART: /home/dave/3 9 15 my slope.py ===============================================================================
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
datetime.datetime(2015, 8, 4, 2, 14, 5, 325113)
datetime.datetime(2015, 8, 4, 2, 14, 5, 415193) ...,
datetime.datetime(2016, 3, 18, 1, 0, 25, 925754)
datetime.datetime(2016, 3, 18, 1, 0, 26, 26400)
datetime.datetime(2016, 3, 18, 1, 0, 26, 114828)]
Traceback (most recent call last):
File "/home/dave/3 9 15 my slope.py", line 19, in <module>
print(a["sym"])
File "/usr/lib/python3/dist-packages/numpy/core/numeric.py", line 1615, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 454, in array2string
separator, prefix, formatter=formatter)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 328, in _array2string
_summaryEdgeItems, summary_insert)[:-1]
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 490, in _formatArray
word = format_function(a[i]) + separator
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
So part of your text is
b'2015-08-04 02:14:05.249392 AA 0.0193103612 ...'
(I'm using b because Py3 genfromtxt opens the file a bytestrings).
But you specify a , delimiter. I don't see any commas.
Let's just try a basic load, not fancy business.
In [97]: txt=b"""2015-08-04 02:14:05.249392 AA 0.0193103612 0.0193515212 0.0249713335 30.6542480634 30.7195875454 39.640763021 0.2131498442 29.0406746589 13524.5347810182 89 57 99
2015-08-04 02:14:05.325113 AAPL 0.0170506271 0.0137941891 0.0105915637 27.0670313481 21.8975963326 16.8135861893 -19.0986405157 -23.2172064279 21.5647072302 33 26 75
"""
In [98]: txt=txt.splitlines()
In [99]: data=np.genfromtxt(txt,dtype=None)
In [100]: data
Out[100]:
array([ (b'2015-08-04', b'02:14:05.249392', b'AA', 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99),
(b'2015-08-04', b'02:14:05.325113', b'AAPL', 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75)],
dtype=[('f0', 'S10'), ('f1', 'S15'), ('f2', 'S4'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')])
The datetime information is in 2 fields:
In [101]: data[['f0','f1']]
Out[101]:
array([(b'2015-08-04', b'02:14:05.249392'),
(b'2015-08-04', b'02:14:05.325113')],
dtype=[('f0', 'S10'), ('f1', 'S15')])
Your datefunction does work with a byte substring
In [102]: datefunc(b'2015-08-04 02:14:05.249392')
Out[102]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
But it requires 2 fields (as defined by the ' ' delimiter). So we need to figure out a way of parsing these 2 substrings as one, rather than split into two fields.
Maybe I'll try changing the sample txt to really use , delimiter (but not between date and time) and set what works.
With the , delimited text I get:
In [117]: data=np.genfromtxt(txt,delimiter=',',dtype=None,usecols=[0,1,2,3])
In [118]: data.dtype
Out[118]: dtype([('f0', 'S26'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
In [119]: data['f0']
Out[119]:
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'],
dtype='|S26')
In [120]: [datefunc(d) for d in data['f0']]
Out[120]:
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392),
datetime.datetime(2015, 8, 4, 2, 14, 5, 325113),
datetime.datetime(2015, 8, 4, 2, 14, 5, 415193),
datetime.datetime(2015, 8, 4, 2, 14, 5, 486185)]
I used usecols because the full text has 14 fields in the 1st line, and 13 in the others.
If I specify the dtype (instead of the easy None), I can replace the strings in the 1st field with these datetime objects:
In [122]: data=np.genfromtxt(txt,delimiter=',',dtype='O,S5,f,f',usecols=[0,1,2,3])
In [123]: data
Out[123]:
array([ (b'2015-08-04 02:14:05.249392', b' AA', 0.01931036077439785, 0.019351521506905556),
(b'2015-08-04 02:14:05.325113', b' AAPL', 0.01705062761902809, 0.01379418931901455),....],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
In [124]: data['f0']
Out[124]:
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], dtype=object)
....
In [126]: data['f0']=[datefunc(d) for d in data['f0']]
In [127]: data
Out[127]:
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.01931036077439785, 0.019351521506905556),
(datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.01705062761902809, 0.01379418931901455),...],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
and with the converter, your call works (more or less)
In [133]: data=np.genfromtxt(txt,dtype='object,S5,float,float',
converters = {0:datefunc},delimiter=',',usecols=[0,1,2,3])
In [134]: data
Out[134]:
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
(datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
the numpy datetime64 works with this string. These types can be used a numpy numbers.
In [154]: datefunc(b'2015-08-04 02:14:05.249392')
Out[154]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
In [155]: np.datetime64(b'2015-08-04 02:14:05.249392')
Out[155]: numpy.datetime64('2015-08-04T02:14:05.249392-0700')
From this Importing csv into Numpy datetime64 I got this to work:
In [175]: data=np.genfromtxt(txt,dtype='M8[us],S5,float,float',
delimiter=',',usecols=[0,1,2,3])
In [176]: data
Out[176]:
array([ (datetime.datetime(2015, 8, 4, 9, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
(datetime.datetime(2015, 8, 4, 9, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...],
dtype=[('f0', '<M8[us]'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
See for datetime units: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units