Inserting new fields(columns) to mongoDB with pandas - pandas

I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)

The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}

Related

empty dataframe on merging two dataframe

import pandas as pd
df1 = pd.DataFrame({'HPI': [10, 20, 30, 40, 50],'INT': [1, 2, 3, 4, 5],'IND': [50, 60, 70, 80, 90]},index=[2001, 2002, 2003, 2004, 2005])
df2 = pd.DataFrame({'HPI': [11, 22, 33, 44, 55],'INT': [6, 7, 8, 9, 0],'IND': [51, 62, 73, 84, 95]},index=[2006, 2007, 2008, 2009, 2010])
merge = pd.merge(df1, df2,on=['HPI', 'INT', 'IND'])
print(merge)
output of the code is
Empty DataFrame
Columns: [HPI, INT, IND]
Index: []
You might be looking for concatenate as BERA pointed out.
concatenated = pd.concat([df1,df2])

pandas compare two data frames and highlight the differences

I'm trying to compare 2 dataframes and highlight the differences in the second one like this:
I have tried using concat and drop duplicates but I am not sure how to check for the specific cells and also how to highlight them at the end
Possible solution is the following:
import pandas as pd
# set test data
data1 = {"A": [10, 11, 23, 44], "B": [22, 23, 56, 55], "C": [31, 21, 34, 66], "D": [25, 45, 21, 45]}
data2 = {"A": [10, 11, 23, 44, 56, 23], "B": [44, 223, 56, 55, 73, 56], "C": [31, 21, 45, 66, 22, 22], "D": [25, 45, 26, 45, 34, 12]}
# create dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# define function to highlight differences in dataframes
def highlight_diff(data, other, color='yellow'):
attr = 'background-color: {}'.format(color)
return pd.DataFrame(np.where(data.ne(other), attr, ''),
index=data.index, columns=data.columns)
# apply style using function
df2.style.apply(highlight_diff, axis=None, other=df1)
Returns

MatPlotLib with custom dictionaries convert to graphs

Problem:
I have a list of ~108 dictionaries named list_of_dictionary and I would like to use Matplotlib to generate line graphs.
The dictionaries have the following format (this is one of 108):
{'price': [59990,
59890,
60990,
62990,
59990,
59690],
'car': '2014 Land Rover Range Rover Sport',
'datetime': [datetime.datetime(2020, 1, 22, 11, 19, 26),
datetime.datetime(2020, 1, 23, 13, 12, 33),
datetime.datetime(2020, 1, 28, 12, 39, 24),
datetime.datetime(2020, 1, 29, 18, 39, 36),
datetime.datetime(2020, 1, 30, 18, 41, 31),
datetime.datetime(2020, 2, 1, 12, 39, 7)]
}
Understanding the dictionary:
The car 2014 Land Rover Range Rover Sport was priced at:
59990 on datetime.datetime(2020, 1, 22, 11, 19, 26)
59890 on datetime.datetime(2020, 1, 23, 13, 12, 33)
60990 on datetime.datetime(2020, 1, 28, 12, 39, 24)
62990 on datetime.datetime(2020, 1, 29, 18, 39, 36)
59990 on datetime.datetime(2020, 1, 30, 18, 41, 31)
59690 on datetime.datetime(2020, 2, 1, 12, 39, 7)
Question:
With this structure how could one create mini-graphs with matplotlib (say 11 rows x 10 columns)?
Where each mini-graph will have:
the title of the graph frome car
x-axis from the datetime
y-axis from the price
What I have tried:
df = pd.DataFrame(list_of_dictionary)
df = df.set_index('datetime')
print(df)
I don't know what to do thereafter...
Relevant Research:
Plotting a column containing lists using Pandas
Pandas column of lists, create a row for each list element
I've read these multiple times, but the more I read it, the more confused I get :(.
I don't know if it's sensible to try and plot that many plots on a figure. You'll have to make some choices to be able to fit all the axes decorations on the page (titles, axes labels, tick labels, etc...).
but the basic idea would be this:
car_data = [{'price': [59990,
59890,
60990,
62990,
59990,
59690],
'car': '2014 Land Rover Range Rover Sport',
'datetime': [datetime.datetime(2020, 1, 22, 11, 19, 26),
datetime.datetime(2020, 1, 23, 13, 12, 33),
datetime.datetime(2020, 1, 28, 12, 39, 24),
datetime.datetime(2020, 1, 29, 18, 39, 36),
datetime.datetime(2020, 1, 30, 18, 41, 31),
datetime.datetime(2020, 2, 1, 12, 39, 7)]
}]*108
fig, axs = plt.subplots(11,10, figsize=(20,22)) # adjust figsize as you please
for car,ax in zip(car_data, axs.flat):
ax.plot(car["datetime"], car['price'], '-')
ax.set_title(car['car'])
Ideally, all your axes could share the same x and y axes so you could have the labels only on the left-most and bottom-most axes. This is taken care of automatically if you add sharex=True and sharey=True to subplots():
fig, axs = plt.subplots(11,10, figsize=(20,22), sharex=True, sharey=True) # adjust figsize as you please

Numpy array changes shape when accessing with indices

I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array

Convert numpy array of Datetime objects to UTC strings

I have a large array of datetime objects in numpy array. However I am trying to export them as a json object attribute and need them to be represented as a UTC string.
Here is my array ( a small chunk of it )
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=<UTC>)]
json = {
'datetimes': []
};
I know I can iterate over the list and convert them however I was hoping there was an efficient pandas or numpy technique for this.
I think you can create DataFrame, convert to iso format and save to dict, because DataFrame.to_json with orint='list' is not implemented yet:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]
df = pd.DataFrame({'datetimes': datetimes})
#native convert to iso, but not support lists yet
print (df.to_json(date_format='iso'))
{"datetimes":{"0":"2015-07-12T18:33:14.000Z",
"1":"2015-07-12T18:33:32.000Z",
"2":"2015-07-12T18:33:50.000Z"}}
df = pd.DataFrame({'datetimes': datetimes})
df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
print (json.dumps(df.to_dict(orient='l')))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
print(json.dumps({'datetimes': [x.isoformat() for x in datetimes]}))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
I test it more and list comprehension is fastest with isoformat:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]*10000
In [116]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
...: json.dumps(df.to_dict(orient='l'))
...:
1 loop, best of 3: 552 ms per loop
#wrong output format, dictionaries not lists
In [117]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df.to_json(date_format='iso')
...:
10 loops, best of 3: 104 ms per loop
In [118]: %%timeit
...: json.dumps({'datetimes': [x.isoformat() for x in datetimes]})
...:
10 loops, best of 3: 67.5 ms per loop