pandas compare two data frames and highlight the differences - pandas

I'm trying to compare 2 dataframes and highlight the differences in the second one like this:
I have tried using concat and drop duplicates but I am not sure how to check for the specific cells and also how to highlight them at the end

Possible solution is the following:
import pandas as pd
# set test data
data1 = {"A": [10, 11, 23, 44], "B": [22, 23, 56, 55], "C": [31, 21, 34, 66], "D": [25, 45, 21, 45]}
data2 = {"A": [10, 11, 23, 44, 56, 23], "B": [44, 223, 56, 55, 73, 56], "C": [31, 21, 45, 66, 22, 22], "D": [25, 45, 26, 45, 34, 12]}
# create dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# define function to highlight differences in dataframes
def highlight_diff(data, other, color='yellow'):
attr = 'background-color: {}'.format(color)
return pd.DataFrame(np.where(data.ne(other), attr, ''),
index=data.index, columns=data.columns)
# apply style using function
df2.style.apply(highlight_diff, axis=None, other=df1)
Returns

Related

empty dataframe on merging two dataframe

import pandas as pd
df1 = pd.DataFrame({'HPI': [10, 20, 30, 40, 50],'INT': [1, 2, 3, 4, 5],'IND': [50, 60, 70, 80, 90]},index=[2001, 2002, 2003, 2004, 2005])
df2 = pd.DataFrame({'HPI': [11, 22, 33, 44, 55],'INT': [6, 7, 8, 9, 0],'IND': [51, 62, 73, 84, 95]},index=[2006, 2007, 2008, 2009, 2010])
merge = pd.merge(df1, df2,on=['HPI', 'INT', 'IND'])
print(merge)
output of the code is
Empty DataFrame
Columns: [HPI, INT, IND]
Index: []
You might be looking for concatenate as BERA pointed out.
concatenated = pd.concat([df1,df2])

How to merge classes in multiclass image segmentation

I am performing an image segmentation with a u-net model.
My mask has classes from 0-50.
I also have a text file dictionary with codes representing each class.
For example -
{1: '1234', 2:'5678', 3:'1245'} etc.
How do I combine when the 2 first string characters are the same so for example above key 1 and 3 are the same because they both start with "12".
How can I do this for all classes?
firstTwoCharDict = {}
for key, value in dictionary.items():
if key == 0:
value == value
firstTwoCharDict[key] = value
else:
value = value[:2]
firstTwoCharDict[key] = value
newDict = {}
for key, value in firstTwoCharDict.items():
if value not in newDict:
newDict[value] = [key]
else:
newDict[value].append(key)
This provides this
{'62': [1, 39],
'90': [2, 5, 9, 20, 32, 42, 47, 72, 88, 91, 95],
'97': [3, 49, 55],
'98': [4, 24, 34, 40, 53, 76, 81, 90, 96],
'31': [6, 17, 30, 48, 83],
'69': [7, 13, 15, 16, 27, 44, 51, 54, 56, 75],
'79': [8, 50],
'71': [10, 19, 22, 35, 61, 63, 65],
'99': [11, 12, 21, 46, 52, 69, 78, 84, 89],
'48': [14, 36, 74],
'60': [18],
'64': [23, 38, 66, 97]
```
Now i have an 2d array with integers, how do I replace them with they keys if the array values are equal to the values in the dict?

Inserting new fields(columns) to mongoDB with pandas

I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)
The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}

Numpy array changes shape when accessing with indices

I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array

implementation of Hierarchial Agglomerative clustering

i am newbie and just want to implement Hierarchical Agglomerative clustering for RGB images. For this I extract all values of RGB from an image. And I process image.Next I find its distance and then develop the linkage. Now from linkage I want to extract my original data (i.e RGB values) on specified indices with indices id. Here is code I have done so far.
image = Image.open('image.jpg')
image = image.convert('RGB')
im = np.array(image).reshape((-1,3))
rgb = list(im.getdata())
X = pdist(im)
Y = linkage(X)
I = inconsistent(Y)
based on the 4th column of consistency. I opt minimum value of the cutoff in order to get maximum clusters.
cutoff = 0.7
cluster_assignments = fclusterdata(Y, cutoff)
# Print the indices of the data points in each cluster.
num_clusters = cluster_assignments.max()
print "%d clusters" % num_clusters
indices = cluster_indices(cluster_assignments)
ind = np.array(enumerate(rgb))
for k, ind in enumerate(indices):
print "cluster", k + 1, "is", ind
dendrogram(Y)
I got results like this
cluster 6 is [ 6 11]
cluster 7 is [ 9 12]
cluster 8 is [15]
Means cluster 6 contains the indices of 6 and 11 leafs. Now at this point I stuck in how to map these indices to get original data(i.e rgb values). indices of each rgb values to each pixel in the image. And then I have to generate codebook to implement Agglomeration Clustering. I have no idea how to approach this task. Read a lot of stuff but nothing clued.
Here is my solution:
import numpy as np
from scipy.cluster import hierarchy
im = np.array([[54,101,9],[ 67,89,27],[ 67,85,25],[ 55,106,1],[ 52,108,0],
[ 55,78,24],[ 19,57,8],[ 19,46,0],[ 95,110,15],[112,159,57],
[ 67,118,26],[ 76,127,35],[ 74,128,30],[ 25,62,0],[100,120,9],
[127,145,61],[ 48,112,25],[198,25,21],[203,11,10],[127,171,60],
[124,173,45],[120,133,19],[109,137,18],[ 60,85,0],[ 37,0,0],
[187,47,20],[127,170,52],[ 30,56,0]])
groups = hierarchy.fclusterdata(im, 0.7)
idx_sorted = np.argsort(groups)
group_sorted = groups[idx_sorted]
im_sorted = im[idx_sorted]
split_idx = np.where(np.diff(group_sorted) != 0)[0] + 1
np.split(im_sorted, split_idx)
output:
[array([[203, 11, 10],
[198, 25, 21]]),
array([[187, 47, 20]]),
array([[127, 171, 60],
[127, 170, 52]]),
array([[124, 173, 45]]),
array([[112, 159, 57]]),
array([[127, 145, 61]]),
array([[25, 62, 0],
[30, 56, 0]]),
array([[19, 57, 8]]),
array([[19, 46, 0]]),
array([[109, 137, 18],
[120, 133, 19]]),
array([[100, 120, 9],
[ 95, 110, 15]]),
array([[67, 89, 27],
[67, 85, 25]]),
array([[55, 78, 24]]),
array([[ 52, 108, 0],
[ 55, 106, 1]]),
array([[ 54, 101, 9]]),
array([[60, 85, 0]]),
array([[ 74, 128, 30],
[ 76, 127, 35]]),
array([[ 67, 118, 26]]),
array([[ 48, 112, 25]]),
array([[37, 0, 0]])]