numpy UnicodeDecodeError am I using the right approach with genfromtxt - numpy

I am stuck. I want to read a simple csv file into a Numpy array and seem to have dug myself into a hole. I am new to Numpy and I am SURE I have messed this up somehow as usually I can read CSV files easily in Python 3.4. I don't want to use Pandas so I thought I would use Numpy to increase my skillset but I really am not getting this at all. If someone could tell me if I am on the right track using genfromtxt OR is there an easier way and give me a nudge in the right direction I would be grateful.
I want to read in the CSV file manipulate the datetime column to 8/4/2014 then put it in a numpy array together with the remaining columns. Here is what I have so far and the error which I am having trouble coding around. I can get the date part way there but don't see how to add the date.strftime("%Y-%m-%d") to the datefunc. Also I don't see how to format the string for SYM to get round the error. Any help would be appreciated.
the data
2015-08-04 02:14:05.249392, AA, 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99
2015-08-04 02:14:05.325113, AAPL, 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75
2015-08-04 02:14:05.415193, AIG, 0.0080808151, 0.0073296055, 0.0076213535, 12.8278962785, 11.635388035, 12.0985236788, -9.2962105215, 3.980405659, -142.8175077335, 71, 42, 33
2015-08-04 02:14:05.486185, AMZN, 0.0235649449, 0.0305828226, 0.0092703502, 37.4081902773, 48.5487257749, 14.7162247572, 29.7810062852, -69.6877219282, -334.0005615016, 2, 92, 10
the "code" sorry still learning
import numpy as np
from datetime import datetime
from datetime import date,time
datefunc = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d %H:%M:%S.%f')
a = np.genfromtxt('/home/dave/Desktop/development/hvanal2016.csv',delimiter = ',',
converters = {0:datefunc},dtype='object,str,float,float,float,float,float,float,float,float,float,float,float,float',
names = ["date","sym","20sd","10sd","5sd","hv20","hv10","hv5","2010hv","105hv","abshv","2010rank","105rank","absrank"])
print(a["date"])
print(a["sym"])
print(a["20sd"])
print(a["hv20"])
print(a["absrank"])
the error
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "copyright", "credits" or "license()" for more information.
>>>
============================================================================== RESTART: /home/dave/3 9 15 my slope.py ===============================================================================
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
datetime.datetime(2015, 8, 4, 2, 14, 5, 325113)
datetime.datetime(2015, 8, 4, 2, 14, 5, 415193) ...,
datetime.datetime(2016, 3, 18, 1, 0, 25, 925754)
datetime.datetime(2016, 3, 18, 1, 0, 26, 26400)
datetime.datetime(2016, 3, 18, 1, 0, 26, 114828)]
Traceback (most recent call last):
File "/home/dave/3 9 15 my slope.py", line 19, in <module>
print(a["sym"])
File "/usr/lib/python3/dist-packages/numpy/core/numeric.py", line 1615, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 454, in array2string
separator, prefix, formatter=formatter)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 328, in _array2string
_summaryEdgeItems, summary_insert)[:-1]
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 490, in _formatArray
word = format_function(a[i]) + separator
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

So part of your text is
b'2015-08-04 02:14:05.249392 AA 0.0193103612 ...'
(I'm using b because Py3 genfromtxt opens the file a bytestrings).
But you specify a , delimiter. I don't see any commas.
Let's just try a basic load, not fancy business.
In [97]: txt=b"""2015-08-04 02:14:05.249392 AA 0.0193103612 0.0193515212 0.0249713335 30.6542480634 30.7195875454 39.640763021 0.2131498442 29.0406746589 13524.5347810182 89 57 99
2015-08-04 02:14:05.325113 AAPL 0.0170506271 0.0137941891 0.0105915637 27.0670313481 21.8975963326 16.8135861893 -19.0986405157 -23.2172064279 21.5647072302 33 26 75
"""
In [98]: txt=txt.splitlines()
In [99]: data=np.genfromtxt(txt,dtype=None)
In [100]: data
Out[100]:
array([ (b'2015-08-04', b'02:14:05.249392', b'AA', 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99),
(b'2015-08-04', b'02:14:05.325113', b'AAPL', 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75)],
dtype=[('f0', 'S10'), ('f1', 'S15'), ('f2', 'S4'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')])
The datetime information is in 2 fields:
In [101]: data[['f0','f1']]
Out[101]:
array([(b'2015-08-04', b'02:14:05.249392'),
(b'2015-08-04', b'02:14:05.325113')],
dtype=[('f0', 'S10'), ('f1', 'S15')])
Your datefunction does work with a byte substring
In [102]: datefunc(b'2015-08-04 02:14:05.249392')
Out[102]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
But it requires 2 fields (as defined by the ' ' delimiter). So we need to figure out a way of parsing these 2 substrings as one, rather than split into two fields.
Maybe I'll try changing the sample txt to really use , delimiter (but not between date and time) and set what works.
With the , delimited text I get:
In [117]: data=np.genfromtxt(txt,delimiter=',',dtype=None,usecols=[0,1,2,3])
In [118]: data.dtype
Out[118]: dtype([('f0', 'S26'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
In [119]: data['f0']
Out[119]:
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'],
dtype='|S26')
In [120]: [datefunc(d) for d in data['f0']]
Out[120]:
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392),
datetime.datetime(2015, 8, 4, 2, 14, 5, 325113),
datetime.datetime(2015, 8, 4, 2, 14, 5, 415193),
datetime.datetime(2015, 8, 4, 2, 14, 5, 486185)]
I used usecols because the full text has 14 fields in the 1st line, and 13 in the others.
If I specify the dtype (instead of the easy None), I can replace the strings in the 1st field with these datetime objects:
In [122]: data=np.genfromtxt(txt,delimiter=',',dtype='O,S5,f,f',usecols=[0,1,2,3])
In [123]: data
Out[123]:
array([ (b'2015-08-04 02:14:05.249392', b' AA', 0.01931036077439785, 0.019351521506905556),
(b'2015-08-04 02:14:05.325113', b' AAPL', 0.01705062761902809, 0.01379418931901455),....],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
In [124]: data['f0']
Out[124]:
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], dtype=object)
....
In [126]: data['f0']=[datefunc(d) for d in data['f0']]
In [127]: data
Out[127]:
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.01931036077439785, 0.019351521506905556),
(datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.01705062761902809, 0.01379418931901455),...],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
and with the converter, your call works (more or less)
In [133]: data=np.genfromtxt(txt,dtype='object,S5,float,float',
converters = {0:datefunc},delimiter=',',usecols=[0,1,2,3])
In [134]: data
Out[134]:
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
(datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
the numpy datetime64 works with this string. These types can be used a numpy numbers.
In [154]: datefunc(b'2015-08-04 02:14:05.249392')
Out[154]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
In [155]: np.datetime64(b'2015-08-04 02:14:05.249392')
Out[155]: numpy.datetime64('2015-08-04T02:14:05.249392-0700')
From this Importing csv into Numpy datetime64 I got this to work:
In [175]: data=np.genfromtxt(txt,dtype='M8[us],S5,float,float',
delimiter=',',usecols=[0,1,2,3])
In [176]: data
Out[176]:
array([ (datetime.datetime(2015, 8, 4, 9, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
(datetime.datetime(2015, 8, 4, 9, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...],
dtype=[('f0', '<M8[us]'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
See for datetime units: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units

Related

2 different ways to index 3D array in Numpy?

I have a 3-D array with dimension of (14,3,5), which correspond to (event, color, taste).
If I want to select all event's second color option and third taste option.
Could someone tell me which is the correct format?
[:,2,3] vs [:,2][:,3]
Are they the same, or different?
If they are different, how are they different?
Do a test:
In [256]: arr = np.arange(2*3*5).reshape(2,3,5)
In [257]: arr
Out[257]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
One way:
In [258]: arr[:,2,3]
Out[258]: array([13, 28])
The other is evaluated in 2 steps:
In [259]: arr[:,2]
Out[259]:
array([[10, 11, 12, 13, 14],
[25, 26, 27, 28, 29]])
In [260]: arr[:,2][:,3]
Out[260]: array([13, 28])
The [:,3] is applied to the result of the [:,2]. Each [] is translated by the interpreter into a __getitem__() call (or a __setitem__ if followed by a =). [:,2,3] is one just call, __getitem__((slice(None),2,3)).
With scalar indices like this, they are the same.
But what if one (or both) index is a list or array?
In [261]: arr[:,[1,2],3]
Out[261]:
array([[ 8, 13],
[23, 28]])
In [262]: arr[:,[1,2]]
Out[262]:
array([[[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
In [263]: arr[:,[1,2]][:,3]
Traceback (most recent call last):
Input In [263] in <cell line: 1>
arr[:,[1,2]][:,3]
IndexError: index 3 is out of bounds for axis 1 with size 2
In [264]: arr[:,[1,2]][:,:,3]
Out[264]:
array([[ 8, 13],
[23, 28]])
At least you are doing the common novice mistake of attempting:
In [265]: arr[:][2][3]
Traceback (most recent call last):
Input In [265] in <cell line: 1>
arr[:][2][3]
IndexError: index 2 is out of bounds for axis 0 with size 2
In the long run you need to read and understand (most of)
https://numpy.org/doc/stable/user/basics.indexing.html

Inserting new fields(columns) to mongoDB with pandas

I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)
The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}

MatPlotLib with custom dictionaries convert to graphs

Problem:
I have a list of ~108 dictionaries named list_of_dictionary and I would like to use Matplotlib to generate line graphs.
The dictionaries have the following format (this is one of 108):
{'price': [59990,
59890,
60990,
62990,
59990,
59690],
'car': '2014 Land Rover Range Rover Sport',
'datetime': [datetime.datetime(2020, 1, 22, 11, 19, 26),
datetime.datetime(2020, 1, 23, 13, 12, 33),
datetime.datetime(2020, 1, 28, 12, 39, 24),
datetime.datetime(2020, 1, 29, 18, 39, 36),
datetime.datetime(2020, 1, 30, 18, 41, 31),
datetime.datetime(2020, 2, 1, 12, 39, 7)]
}
Understanding the dictionary:
The car 2014 Land Rover Range Rover Sport was priced at:
59990 on datetime.datetime(2020, 1, 22, 11, 19, 26)
59890 on datetime.datetime(2020, 1, 23, 13, 12, 33)
60990 on datetime.datetime(2020, 1, 28, 12, 39, 24)
62990 on datetime.datetime(2020, 1, 29, 18, 39, 36)
59990 on datetime.datetime(2020, 1, 30, 18, 41, 31)
59690 on datetime.datetime(2020, 2, 1, 12, 39, 7)
Question:
With this structure how could one create mini-graphs with matplotlib (say 11 rows x 10 columns)?
Where each mini-graph will have:
the title of the graph frome car
x-axis from the datetime
y-axis from the price
What I have tried:
df = pd.DataFrame(list_of_dictionary)
df = df.set_index('datetime')
print(df)
I don't know what to do thereafter...
Relevant Research:
Plotting a column containing lists using Pandas
Pandas column of lists, create a row for each list element
I've read these multiple times, but the more I read it, the more confused I get :(.
I don't know if it's sensible to try and plot that many plots on a figure. You'll have to make some choices to be able to fit all the axes decorations on the page (titles, axes labels, tick labels, etc...).
but the basic idea would be this:
car_data = [{'price': [59990,
59890,
60990,
62990,
59990,
59690],
'car': '2014 Land Rover Range Rover Sport',
'datetime': [datetime.datetime(2020, 1, 22, 11, 19, 26),
datetime.datetime(2020, 1, 23, 13, 12, 33),
datetime.datetime(2020, 1, 28, 12, 39, 24),
datetime.datetime(2020, 1, 29, 18, 39, 36),
datetime.datetime(2020, 1, 30, 18, 41, 31),
datetime.datetime(2020, 2, 1, 12, 39, 7)]
}]*108
fig, axs = plt.subplots(11,10, figsize=(20,22)) # adjust figsize as you please
for car,ax in zip(car_data, axs.flat):
ax.plot(car["datetime"], car['price'], '-')
ax.set_title(car['car'])
Ideally, all your axes could share the same x and y axes so you could have the labels only on the left-most and bottom-most axes. This is taken care of automatically if you add sharex=True and sharey=True to subplots():
fig, axs = plt.subplots(11,10, figsize=(20,22), sharex=True, sharey=True) # adjust figsize as you please

How should width be set for a bar in matplotlib?

I'm using python 2, and the following code is just using some example data, my actual data can be of varying lengths and might not be minutely.
import numpy as np
import datetime
import matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x_values = [datetime.datetime(2018, 11, 8, 11, 16),
datetime.datetime(2018, 11, 8, 11, 17),
datetime.datetime(2018, 11, 8, 11, 18),
datetime.datetime(2018, 11, 8, 11, 19),
datetime.datetime(2018, 11, 8, 11, 20),
datetime.datetime(2018, 11, 8, 11, 21),
datetime.datetime(2018, 11, 8, 11, 22),
datetime.datetime(2018, 11, 8, 11, 23),
datetime.datetime(2018, 11, 8, 11, 24),
datetime.datetime(2018, 11, 8, 11, 25),
datetime.datetime(2018, 11, 8, 11, 26),
datetime.datetime(2018, 11, 8, 11, 27),
datetime.datetime(2018, 11, 8, 11, 28),
datetime.datetime(2018, 11, 8, 11, 29),
datetime.datetime(2018, 11, 8, 11, 30),
datetime.datetime(2018, 11, 8, 11, 31)]
y_values = [1392.1017964071857,
1392.2814371257484,
1392.37125748503,
1227.6802721088436,
1083.1,
1317.0461538461539,
1393.059880239521,
1393.4011976047905,
1393.491017964072,
1393.8502994011976,
1318.3461538461538,
1229.4965986394557,
1394.2095808383233,
1394.3892215568862,
1394.6586826347304,
1394.688622754491]
rects1 = ax.bar(x_values, y_values)
fig.tight_layout()
plt.show()
How am I supposed to set the width of the bars automatically? As it is I get the following:
If I set the width to 0.0006 then it looks good for the example data:
from which I've worked out that matplotlib is measuring the x axis in days (since 0.0007 days is almost exactly 1 minute, which matches my time intervals, and 0.0006 gives the gaps between bars) but that's no good if I get hourly values or seconds, or weeks, etc. Surely there's an option for handling this automatically?
If you want the bar width to be no larger than the difference between any successive datetimes, you can calculate that number and supply it to the bar's width argument.
import matplotlib.dates as mdates
width = np.min(np.diff(mdates.date2num(x_values)))
ax.bar(x_values, y_values, width=width, ec="k")

Numpy array changes shape when accessing with indices

I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array