How to compare column values in different dataframes?

How to compare column values in different dataframes? - pandas

I have found this code , and it is working very well.
df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 5, 1], 'c3': [3, 1, 1]})
df2 = pd.DataFrame({'c4': [1, 4, 7], 'c2': [3, 5, 2], 'c3': [3, 7, 5]})
set(df1['c2']).intersection(set(df2['c2']))
But I need to compare them with multiple column names like
set(df1['c2']['c3']).intersection(set(df2['c2']['c3']))
But it is not working. How to fix or what is another way to compare them and find similar values ( matches ) ?

You can try with merge follow by drop_duplicates
df1.merge(df2,on = ['c2','c3'])[['c2','c3']].drop_duplicates()

Related

How to concatenate columns of dataframes in a dictionary to a new dataframe

I have a dictionary of eight very similar looking dataframes. I'd like to pick the equally named column from all these dataframes and concatenate them into a new dataframe, where the columns get the name of the key to the dataframe which they are from.
In small it looks like this:
d1 = {'DE': [1, 2, 3], 'BE': [3, 4, 5], 'AT': [5, 6, 7]}
df1 = pd.DataFrame(data=d1)
d2 = {'DE': [5, 7, 9], 'BE': [4, 6, 2], 'AT': [3, 5, 2]}
df2 = pd.DataFrame(data=d2)
d3 = {'DE': [1, 5, 4], 'BE': [5, 2, 1], 'AT': [3, 6, 1]}
df3 = pd.DataFrame(data=d3)
technology = {'solar' : df1, 'wind_onshore' : df2, 'wind_offshore' : df3}
Now I'd like to pick the 'DE' column of each dataframe and concatenate it into a new dataframe, where each column gets the name it comes from e.g. solar, wind_onshore, wind_offshore.
I hope this is not a trivial question and I'm just not getting it :D
Thanks everyone :)
Edit: I accidentally constructed a dictionary of dictionaries rather than a dictionary of dataframes

You can first add a technology column to each df and then combine the separate dfs using pd.concat into a single long df. You can then use pd.pivot to make the columns be the technology
d1 = {'DE': [1, 2, 3], 'BE': [3, 4, 5], 'AT': [5, 6, 7]}
df1 = pd.DataFrame(data=d1)
df1['technology'] = 'solar'
d2 = {'DE': [5, 7, 9], 'BE': [4, 6, 2], 'AT': [3, 5, 2]}
df2 = pd.DataFrame(data=d2)
df2['technology'] = 'wind_onshore'
d3 = {'DE': [1, 5, 4], 'BE': [5, 2, 1], 'AT': [3, 6, 1]}
df3 = pd.DataFrame(data=d3)
df3['technology'] = 'wind_offshore'
combined_df = pd.concat((df1,df2,df3))
wide_df = combined_df.pivot(
values='DE',
columns='technology',
)
wide_df

How to sum pandas df rows where each cell contains a list?

I'm trying to sum my df's rows as follows,
let's say I have the beneath df (each cell in a row contains a vector/list of the same size!)
In the real problem, I have a large number of columns and it can vary. But I do have a list that contains the names of those columns.
df = pd.DataFrame([
[[1,2,3],[1,2,3],[1,2,3]],
[[1,1,1],[1,1,1],[1,1,1]],
[[2,2,2],[2,2,2],[2,2,2]]
], columns=['a','b','c'])
I'm trying to create a new Column that will contain the sum of all the vectors in every row- as np.array would do! and get this following vectors as a result:
[3,6,9]
[3,3,3]
[6,6,6]
and not like the .sum(axis=1) does..
[1,2,3,1,2,3,1,2,3]
[1,1,1,1,1,1,1,1,1]
[2,2,2,2,2,2,2,2,2]
Can anyone think of an idea, thanks in advance :)

If same lengths of lists create numpy array and sum for improve performance:
df['Sum'] = np.array(df.to_numpy().tolist()).sum(axis=1).tolist()
print (df)
a b c Sum
0 [1, 2, 3] [1, 2, 3] [1, 2, 3] [3, 6, 9]
1 [1, 1, 1] [1, 1, 1] [1, 1, 1] [3, 3, 3]
2 [2, 2, 2] [2, 2, 2] [2, 2, 2] [6, 6, 6]

Another way using pd.Series.explode:
df['sum'] = df.apply(pd.Series.explode).sum(axis=1).groupby(level=0).agg(list)
Output:
a b c sum
0 [1, 2, 3] [1, 2, 3] [1, 2, 3] [3.0, 6.0, 9.0]
1 [1, 1, 1] [1, 1, 1] [1, 1, 1] [3.0, 3.0, 3.0]
2 [2, 2, 2] [2, 2, 2] [2, 2, 2] [6.0, 6.0, 6.0]

How to delete rows from column which have matching values in the list Pandas

I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL

import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.

Expand dimension to Tensor and assign value

My tensor shape is 32,4 like
input_boxes = [
[1,2,3,4],
[2,2,6,4],
[[1,5,3,4],[1,3,3,8]],#some row has two
[1,2,3,4],#some has one row
[[1,2,3,4],[1,3,3,4]],
[1,7,3,4],
......
[1,2,3,4]
]
I like to expand to 32,5 at the first column like tf.expand_dims(input_boxes, 0).
Then assign value to the first column with row number like
input_boxes = [
[0,1,2,3,4],
[1,2,2,6,4],
[[2,1,5,3,4],[2,1,3,3,8]],#some row has two
[3,1,2,3,4],#some has one row
[[4,1,2,3,4],[4,1,3,3,4]],
[5,1,7,3,4],
......
[31,1,2,3,4]
]
How can I do in Tensorflow?

Mentioning the Solution here (Answer Section) even though it is present in the Comments Section (thanks to jdehesa) for the benefit of the Community.
For example, we have a Tensor of Shape (7,4) as shown below:
import tensorflow as tf
input_boxes = tf.constant([[1,2,3,4],
[2,2,6,4],
[1,5,3,4],
[1,2,3,4],
[1,2,3,4],
[1,7,3,4],
[1,2,3,4]])
print(input_boxes)
Code to expand to (7,5) at the First Column with the values of First Columns being the respective Row Number is shown below:
input_boxes = tf.concat([tf.dtypes.cast(tf.expand_dims(tf.range(tf.shape(input_boxes)[0]), 1), input_boxes.dtype), input_boxes], axis=1)
print(input_boxes)
Output of the above code is shown below:
<tf.Tensor: shape=(7, 5), dtype=int32, numpy=
array([[0, 1, 2, 3, 4],
[1, 2, 2, 6, 4],
[2, 1, 5, 3, 4],
[3, 1, 2, 3, 4],
[4, 1, 2, 3, 4],
[5, 1, 7, 3, 4],
[6, 1, 2, 3, 4]], dtype=int32)>
Hope this helps. Happy Learning!

numpy custom array element retrieval

I have a question regarding how to extract certain values from a 2D numpy array
Foo =
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
Bar =
array([[0, 0, 1],
[1, 2, 3]])
I want to extract elements from Foo using the values of Bar as indices, such that I end up with an 2D matrix/array Baz of the same shape as Bar. The ith column in Baz correspond is Foo[(np.array(each j in Bar[:,i]),np.array(i,i,i,i ...))]
Baz =
array([[ 1, 2, 6],
[ 4, 8, 12]])
I could do a couple nested for-loops but I was wondering if there is a more elegant, numpy-ish way to do this.
Sorry if this is a bit convoluted. Let me know if I need to explain further.
Thanks!

You can use Bar as the row index and an array [0, 1, 2] as the column index:
# for easy copy-pasting
import numpy as np
Foo = np.array([[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9], [10, 11, 12]])
Bar = np.array([[0, 0, 1], [1, 2, 3]])
# now use Bar as the `i` coordinate and 0, 1, 2 as the `j` coordinate:
Foo[Bar, [0, 1, 2]]
# array([[ 1, 2, 6],
# [ 4, 8, 12]])
# OR, to automatically generate the [0, 1, 2]
Foo[Bar, xrange(Bar.shape[1])]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to compare column values in different dataframes? - pandas

You can try with merge follow by drop_duplicates df1.merge(df2,on = ['c2','c3'])[['c2','c3']].drop_duplicates()

Related

How to concatenate columns of dataframes in a dictionary to a new dataframe

How to sum pandas df rows where each cell contains a list?

How to delete rows from column which have matching values in the list Pandas

Expand dimension to Tensor and assign value

numpy custom array element retrieval

Categories

Resources