I figured I'd share, since I searched that on SO and couldn't quite find what I needed.
I wanted to dump a pd.DataFrame into a yaml file.
Timestamps should be shown nicely, not as the default:
date: !!python/object/apply:pandas._libs.tslibs.timestamps.Timestamp
- 1589241600000000000
- null
- null
Also, the output should be correct YaML format, i.e., it should be readable back by yaml.load. The output should be reasonably concise, i.e. preferring the 'flow' format.
As an example, here is some data:
df = pd.DataFrame([
dict(
date=pd.Timestamp.now().normalize() - pd.Timedelta('1 day'),
x=0,
b='foo',
c=[1,2,3,4],
other_t=pd.Timestamp.now(),
),
dict(
date=pd.Timestamp.now().normalize(),
x=1,
b='bar',
c=list(range(32)),
other_t=pd.Timestamp.now(),
),
]).set_index('date')
Here is what I came up with. It has some customization of the Dumper to handle Timestamp. The output is more legible, and still valid yaml. Upon loading, yaml recognizes the format of a valid datetime (ISO format, I think), and re-creates those as datetime. In fact, we can read it back into a DataFrame, where these datetimes are automatically converted into Timestamp. After a minor reset of index, we observe that the new df is identical to the original.
import yaml
from yaml import CDumper
from yaml.representer import SafeRepresenter
import datetime
class TSDumper(CDumper):
pass
def timestamp_representer(dumper, data):
return SafeRepresenter.represent_datetime(dumper, data.to_pydatetime())
TSDumper.add_representer(datetime.datetime, SafeRepresenter.represent_datetime)
TSDumper.add_representer(pd.Timestamp, timestamp_representer)
With this, now we can do:
text = yaml.dump(
df.reset_index().to_dict(orient='records'),
sort_keys=False, width=72, indent=4,
default_flow_style=None, Dumper=TSDumper,
)
print(text)
The output is relatively clean:
- date: 2020-05-12 00:00:00
x: 0
b: foo
c: [1, 2, 3, 4]
other_t: 2020-05-13 02:30:23.422589
- date: 2020-05-13 00:00:00
x: 1
b: bar
c: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
other_t: 2020-05-13 02:30:23.422613
Now, we can load this back:
df2 = pd.DataFrame(yaml.load(text, Loader=yaml.SafeLoader)).set_index('date')
And (drum roll, please):
df2.equals(df)
# True
Related
I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)
The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
I have a large array of datetime objects in numpy array. However I am trying to export them as a json object attribute and need them to be represented as a UTC string.
Here is my array ( a small chunk of it )
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=<UTC>)]
json = {
'datetimes': []
};
I know I can iterate over the list and convert them however I was hoping there was an efficient pandas or numpy technique for this.
I think you can create DataFrame, convert to iso format and save to dict, because DataFrame.to_json with orint='list' is not implemented yet:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]
df = pd.DataFrame({'datetimes': datetimes})
#native convert to iso, but not support lists yet
print (df.to_json(date_format='iso'))
{"datetimes":{"0":"2015-07-12T18:33:14.000Z",
"1":"2015-07-12T18:33:32.000Z",
"2":"2015-07-12T18:33:50.000Z"}}
df = pd.DataFrame({'datetimes': datetimes})
df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
print (json.dumps(df.to_dict(orient='l')))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
print(json.dumps({'datetimes': [x.isoformat() for x in datetimes]}))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
I test it more and list comprehension is fastest with isoformat:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]*10000
In [116]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
...: json.dumps(df.to_dict(orient='l'))
...:
1 loop, best of 3: 552 ms per loop
#wrong output format, dictionaries not lists
In [117]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df.to_json(date_format='iso')
...:
10 loops, best of 3: 104 ms per loop
In [118]: %%timeit
...: json.dumps({'datetimes': [x.isoformat() for x in datetimes]})
...:
10 loops, best of 3: 67.5 ms per loop
I have a frequency analysis of words said in episodes of my favorite show. I'm making a plot.barh(s1e1_y, s1e1_x) but it's sorting by words instead of values.
The output of >>> s1e1_y
is
['know', 'go', 'now', 'here', 'gonna', 'can', 'them', 'think', 'come', 'time', 'got', 'elliot', 'talk', 'out', 'night', 'been', 'then', 'need', 'world', "what's"]
and >>>s1e1_x
[42, 30, 26, 25, 24, 22, 20, 19, 19, 18, 18, 18, 17, 17, 15, 15, 14, 14, 13, 13]
When the plots are actually plotted, the graph's y axis ticks are sorted alphabetically even though the plotting list is unsorted...
s1e1_wordlist = []
s1e1_count = []
for word, count in s1e01:
if((word[:-1] in excluded_words) == False):
s1e1_wordlist.append(word[:-1])
s1e1_count.append(int(count))
s1e1_sorted = sorted(list(sorted(zip(s1e1_count, s1e1_wordlist))),
reverse=True)
s1e1_20 = []
for i in range(0,20):
s1e1_20.append(s1e1_sorted[i])
s1e1_x = []
s1e1_y = []
for count, word in s1e1_20:
s1e1_x.append(word)
s1e1_y.append(count)
plot.figure(1, figsize=(20,20))
plot.subplot(341)
plot.title('Season1 : Episode 1')
plot.tick_params(axis='y',labelsize=8)
plot.barh(s1e1_x, s1e1_y)
From matplotlib 2.1 on you can plot categorical variables. This allows to plot plt.bar(["apple","cherry","banana"], [1,2,3]). However in matplotlib 2.1 the output will be sorted by category, hence alphabetically. This was considered as bug and is changed in matplotlib 2.2 (see this PR).
In matplotlib 2.2 the bar plot would hence preserve the order.
In matplotlib 2.1, you would plot the data as numeric data as in any version prior to 2.1. This means to plot the numbers against their index and to set the labels accordingly.
w = ['know', 'go', 'now', 'here', 'gonna', 'can', 'them', 'think', 'come',
'time', 'got', 'elliot', 'talk', 'out', 'night', 'been', 'then', 'need',
'world', "what's"]
n = [42, 30, 26, 25, 24, 22, 20, 19, 19, 18, 18, 18, 17, 17, 15, 15, 14, 14, 13, 13]
import matplotlib.pyplot as plt
import numpy as np
plt.barh(range(len(w)),n)
plt.yticks(range(len(w)),w)
plt.show()
Ok you seem to have a lot of spurious code in your example which isn't relevant to the problem as you've described it but assuming you don't want the y axis to sort alphabetically then you need to zip your two lists into a dataframe then plot the dataframe as follows
df = pd.DataFrame(list(zip(s1e1_y,s1e1_x))).set_index(1)
df.plot.barh()
This then produces the following
in [http://deeplearning.net/tutorial/lenet.html#lenet] it says:
This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
when I use flatten function on a numpy 3d array I get a 1D array. but here it says I get a matrix. How does flatten(2) work in theano?
A similar example on numpy produces 1D array:
a= array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
a.flatten(2)=array([ 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17,
26, 3, 12, 21, 6, 15, 24, 9, 18, 27])
numpy doesn't support flattening only some dimensions but Theano does.
So if a is a numpy array, a.flatten(2) doesn't make any sense. It runs without error but only because the 2 is passed as the order parameter which seems to cause numpy to stick with the default order of C.
Theano's flatten does support axis specification. The documentation explains how it works.
Parameters:
x (any TensorVariable (or compatible)) – variable to be flattened
outdim (int) – the number of dimensions in the returned variable
Return type:
variable with same dtype as x and outdim dimensions
Returns:
variable with the same shape as x in the leading outdim-1 dimensions,
but with all remaining dimensions of x collapsed into the last dimension.
For example, if we flatten a tensor of shape (2, 3, 4, 5) with
flatten(x, outdim=2), then we’ll have the same (2-1=1) leading
dimensions (2,), and the remaining dimensions are collapsed. So the
output in this example would have shape (2, 60).
A simple Theano demonstration:
import numpy
import theano
import theano.tensor as tt
def compile():
x = tt.tensor3()
return theano.function([x], x.flatten(2))
def main():
a = numpy.arange(2 * 3 * 4).reshape((2, 3, 4))
f = compile()
print a.shape, f(a).shape
main()
prints
(2L, 3L, 4L) (2L, 12L)
The following script computes R-squared value between two numpy arrays(x and y).
The R-squared value is very low due to outliers in the data. How can I extract the indices of those outliers?
import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
r2 = stats.linregress(x, y) [3]**2
print r2
plt.scatter(x, y)
plt.show()
An outlier is defined as: value-mean > 2*standard deviation.
You can do this with the line
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
What is does:
A list is constructed from the indices of x, where the element at that index satisfies the condition described above.
A quick test:
x = np.random.random_integers(1,50,50)
this gives me the array:
array([16, 6, 13, 18, 21, 37, 31, 8, 1, 48, 4, 40, 9, 14, 6, 45, 20,
15, 14, 32, 30, 8, 19, 8, 34, 22, 49, 5, 22, 23, 39, 29, 37, 24,
45, 47, 21, 5, 4, 27, 48, 2, 22, 8, 12, 8, 49, 12, 15, 18])
Now I add some outliers manually as there are none initially:
x[4] = 200
x[15] = 178
lets test:
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
result:
[4, 15]
Is this what you was looking for?
EDIT:
I added the abs() function in the line above, because when you are working with negative numbers this might end bad. The abs() function takes the absolute value.
I think Sander's approach is the correct one, but if you must see R2 without those outliers before making a decision here is a way to do it.
Setup data and introduce outlier:
In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
y[5] = 100
Calculate R2 taking out one y value at a time (along with matching x value):
m = np.eye(y.shape[0])
r2 = np.apply_along_axis(lambda a: stats.linregress(np.delete(x, a.argmax()), np.delete(y, a.argmax()))[3]**2, 0, m)
Get index of the biggest outlier:
r2.argmax()
Out[1]:
5
Get R2 when this outlier is taken out:
In [2]:
r2[r2.argmax()]
Out[2]:
0.85892084723588935
Get the value of the outlier:
In [3]:
y[r2.argmax()]
Out[3]:
100
To get top n outliers:
In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
sorted_index[:n]
Out [4]:
array([ 5, 27, 34, 0, 17], dtype=int64)