import pandas as pd
df1 = pd.DataFrame({'HPI': [10, 20, 30, 40, 50],'INT': [1, 2, 3, 4, 5],'IND': [50, 60, 70, 80, 90]},index=[2001, 2002, 2003, 2004, 2005])
df2 = pd.DataFrame({'HPI': [11, 22, 33, 44, 55],'INT': [6, 7, 8, 9, 0],'IND': [51, 62, 73, 84, 95]},index=[2006, 2007, 2008, 2009, 2010])
merge = pd.merge(df1, df2,on=['HPI', 'INT', 'IND'])
print(merge)
output of the code is
Empty DataFrame
Columns: [HPI, INT, IND]
Index: []
You might be looking for concatenate as BERA pointed out.
concatenated = pd.concat([df1,df2])
Related
import numpy as np
data = np.array([[10, 20, 30, 40, 50, 60, 70, 80, 90],
[2, 7, 8, 9, 10, 11],
[3, 12, 13, 14, 15, 16],
[4, 3, 4, 5, 6, 7, 10, 12]],dtype=object)
target = data[:,0]
It has this error.
IndexError Traceback (most recent call last)
Input In \[82\], in \<cell line: 9\>()
data = np.array(\[\[10, 20, 30, 40, 50, 60, 70, 80, 90\],
\[2, 7, 8, 9, 10, 11\],
\[3, 12, 13, 14, 15, 16\],
\[4, 3, 4, 5, 6, 7, 10,12\]\],dtype=object)
# Define the target data ----\> 9 target = data\[:,0\]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
May I know how to fix it, please? I mean do not change the elements in the data. Many thanks. I made the matrix in the same size and the error message was gone. But I have the data with variable size.
You have a array of objects, so you can't use indexing on axis=1 as there is none (data.shape -> (4,)).
Use a list comprehension:
out = np.array([a[0] for a in data])
Output: array([10, 2, 3, 4])
I'm trying to compare 2 dataframes and highlight the differences in the second one like this:
I have tried using concat and drop duplicates but I am not sure how to check for the specific cells and also how to highlight them at the end
Possible solution is the following:
import pandas as pd
# set test data
data1 = {"A": [10, 11, 23, 44], "B": [22, 23, 56, 55], "C": [31, 21, 34, 66], "D": [25, 45, 21, 45]}
data2 = {"A": [10, 11, 23, 44, 56, 23], "B": [44, 223, 56, 55, 73, 56], "C": [31, 21, 45, 66, 22, 22], "D": [25, 45, 26, 45, 34, 12]}
# create dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# define function to highlight differences in dataframes
def highlight_diff(data, other, color='yellow'):
attr = 'background-color: {}'.format(color)
return pd.DataFrame(np.where(data.ne(other), attr, ''),
index=data.index, columns=data.columns)
# apply style using function
df2.style.apply(highlight_diff, axis=None, other=df1)
Returns
I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)
The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array
I have a large array of datetime objects in numpy array. However I am trying to export them as a json object attribute and need them to be represented as a UTC string.
Here is my array ( a small chunk of it )
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=<UTC>), datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=<UTC>)]
json = {
'datetimes': []
};
I know I can iterate over the list and convert them however I was hoping there was an efficient pandas or numpy technique for this.
I think you can create DataFrame, convert to iso format and save to dict, because DataFrame.to_json with orint='list' is not implemented yet:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]
df = pd.DataFrame({'datetimes': datetimes})
#native convert to iso, but not support lists yet
print (df.to_json(date_format='iso'))
{"datetimes":{"0":"2015-07-12T18:33:14.000Z",
"1":"2015-07-12T18:33:32.000Z",
"2":"2015-07-12T18:33:50.000Z"}}
df = pd.DataFrame({'datetimes': datetimes})
df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
print (json.dumps(df.to_dict(orient='l')))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
print(json.dumps({'datetimes': [x.isoformat() for x in datetimes]}))
{"datetimes": ["2015-07-12T18:33:14+00:00",
"2015-07-12T18:33:32+00:00",
"2015-07-12T18:33:50+00:00"]}
I test it more and list comprehension is fastest with isoformat:
datetimes = [datetime.datetime(2015, 7, 12, 18, 33, 14, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 32, tzinfo=datetime.timezone.utc),
datetime.datetime(2015, 7, 12, 18, 33, 50, tzinfo=datetime.timezone.utc)]*10000
In [116]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df['datetimes'] = df['datetimes'].map(lambda x: x.isoformat())
...: json.dumps(df.to_dict(orient='l'))
...:
1 loop, best of 3: 552 ms per loop
#wrong output format, dictionaries not lists
In [117]: %%timeit
...: df = pd.DataFrame({'datetimes': datetimes})
...: df.to_json(date_format='iso')
...:
10 loops, best of 3: 104 ms per loop
In [118]: %%timeit
...: json.dumps({'datetimes': [x.isoformat() for x in datetimes]})
...:
10 loops, best of 3: 67.5 ms per loop