How do I extract information from nested duplicates in pandas? - pandas

I am trying to extract information from duplicates.
data = np.array([[100,1,0, 'GB'],[100,0,1, 'IT'],[101,1,0, 'CN'],[101,0,1, 'CN'],
[102,1,0, 'JP'],[102,0,1, 'CN'],[103,0,1, 'DE'],
[103,0,1, 'DE'],[103,1,0, 'VN'],[103,1,0, 'VN']])
df = pd.DataFrame(data, columns = ['wed_cert_id','spouse_1',
'spouse_2', 'nationality'])
I would like to categorise each wedding as either cross-national or not.
In my actual data set there can be more than 2 spouses to a marriage.
My aim is to obtain a data frame like this:
or like this:
I have tried to find a way to filter the data using .duplicated() and trying to deny .duplicated() with a not operator, but have not succeed in working it out:
df = df.loc[df.wed_cert_id.duplicated(keep=False) ~df.nationality.duplicated(keep=False), :]
df = df.loc[df.wed_cert_id.duplicated(keep=False) not df.nationality.duplicated(keep=False), :]
Dropping the duplicates drops too many observations. My data set allows for >2 spouses per wedding, creating the potential for duplication:
df.drop_duplicates(subset=['wed_cert_id','nationality'], keep=False, inplace=True)
How do I do it?
Many thanks from now

I believe you need:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1'))
print(df)
Or:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1')
.mul(df[['spouse_1','spouse_2']].prod(1)))
print(df)
wed_cert_id spouse_1 spouse_2 nationality cross_national
0 100 1 0 GB 1
1 100 0 1 IT 1
2 101 1 0 CN 0
3 101 0 1 CN 0
4 102 1 0 JP 1
5 102 0 1 CN 1
6 103 0 1 DE 1
7 103 0 1 DE 1
8 103 1 0 VN 1
9 103 1 0 VN 1

Related

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Simple addition of different sizes DataFrames in Pandas

I have 2 very simple addition problems with Pandas, I hope you could help me.
My first question:
Let say I have the following two dataframes: a_df and b_df
a = [[1,1,1,1],[0,0,0,0],[1,1,0,0]]
a_df = pd.DataFrame(a)
a_df =
0 1 2 3
0 1 1 1 1
1 0 0 0 0
2 1 1 0 0
b = [1,1,1,1]
b_df = pd.DataFrame(b).T
b_df=
0 1 2 3
0 1 1 1 1
I would like to add b_df to a_df to obtain c_df such that my expected output would be the follow:
c_df =
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
The current method I use is replicate b_df to the same size of a_df and carry out the addition, shown below. However, this method is not very efficient if my a_df is very very large.
a = [[1,1,1,1],[0,0,0,0],[1,1,0,0]]
a_df = pd.DataFrame(a)
b = [1,1,1,1]
b_df = pd.DataFrame(b).T
b_df = pd.concat([b_df]*len(a_df)).reset_index(drop=True)
c_df = a_df + b_df
Are there any other ways to add b_df(without replicating it) to a_df in order to obtain what I want c_df to be?
My second question is very similar to my first one:
Let say I have d_df and e_df as follows:
d = [1,1,1,1]
d_df = pd.DataFrame(d)
d_df=
0
0 1
1 1
2 1
3 1
e = [1]
e_df = pd.DataFrame(e)
e_df=
0
0 1
I want to add e_df to d_df such that I would get the following result:
0
0 2
1 2
2 2
3 2
Again, current I am replicating e_df using the following method (same as Question 1) before adding with d_df
d = [1,1,1,1]
d_df = pd.DataFrame(d)
e = [1]
e_df = pd.DataFrame(e)
e_df = pd.concat([e_df]*len(d_df)).reset_index(drop=True)
f_df = d_df + e_df
Is there a way without replicating e_df?
Please advise and help me. Thank you so much in advanced
Tommy
Try this :
pd.DataFrame(a_df.to_numpy() + b_df.to_numpy())
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
numpy offers the broadcasting features that allows you to add the way u want, as long as the shape is similar on one end. I feel someone has answered something similar to this before. Once I find it I will reference it here.
This article from numpy explains broadcasting pretty well
For first convert one row DataFrame to Series:
c_df = a_df + b_df.iloc[0]
print (c_df)
0 1 2 3
0 2 2 2 2
1 1 1 1 1
2 2 2 1 1
Same principe is for second:
c_df = d_df + e_df.iloc[0]
print (c_df)
0
0 2
1 2
2 2
3 2
More information is possible find in How do I operate on a DataFrame with a Series for every column.

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

How would you do this task using SQL or R library sqldf?

I need to implement the following function (ideally in R or SQL): given two data frames (have a column for userid and the rest of the colums are booleans attributes (they are just permitted to be 0's or 1's)) I need to return a new data frame with two columns (userid and count) where count is the number of matches for 0's and 1's for each user in both tables. An user F could occur in both data frames or it could occur in just one. In this last case, I need to return NA for that user count. I write an example:
DF1
ID c1 c2 c3 c4 c5
1 0 1 0 1 1
10 1 0 1 0 0
5 0 1 1 1 0
20 1 1 0 0 1
3 1 1 0 0 1
6 0 0 1 1 1
71 1 0 1 0 0
15 0 1 1 1 0
80 0 0 0 1 0
DF2
ID c1 c2 c3 c4 c5
5 1 0 1 1 0
6 0 1 0 0 1
15 1 0 0 1 1
80 1 1 1 0 0
78 1 1 1 0 0
98 0 0 1 1 1
1 0 1 0 0 1
2 1 0 0 1 1
9 0 0 0 1 0
My function must return something like this: (the following is a subset)
DF_Return
ID Count
1 4
2 NA
80 1
20 NA
.
.
.
Could you give me any suggestions to carry this out? I'm not that expert in sql.
I put the codes in R to generate the experiment I used above.
id1=c(1,10,5,20,3,6,71,15,80)
c1=c(0,1,0,1,1,0,1,0,0)
c2=c(1,0,1,1,1,0,0,1,0)
c3=c(0,1,1,0,0,1,1,1,0)
c4=c(1,0,1,0,0,1,0,1,1)
c5=c(1,0,0,1,1,1,0,0,0)
DF1=data.frame(ID=id1,c1=c1,c2=c2,c3=c3,c4=c4,c5=c5)
DF2=data.frame(ID=c(5,6,15,80,78,98,1,2,9),c1=c2,c2=c1,c3=c5,c4=c4,c5=c3)
Many thanks in advance.
Best Regards!
Here's an approach for you. The first hardcodes the columns to compare, while the other is more general and agnostic to how many columns DF1 and DF2 have:
#Merge together using ALL = TRUE for equivlent of outer join
DF3 <- merge(DF1, DF2, by = "ID", all = TRUE, suffixes= c(".1", ".2"))
#Calculate the rowSums where the same columns match
out1 <- data.frame(ID = DF3[, 1], count = rowSums(DF3[, 2:6] == DF3[, 7:ncol(DF3)]))
#Approach that is agnostic to the number of columns you have
library(reshape2)
library(plyr)
DF3.m <- melt(DF3, id.vars = 1)
DF3.m[, c("level", "DF")] <- with(DF3.m, colsplit(variable, "\\.", c("level", "DF")))
out2 <- dcast(data = DF3.m, ID + level ~ DF, value.var="value")
colnames(out)[3:4] <- c("DF1", "DF2")
out2 <- ddply(out, "ID", summarize, count = sum(DF1 == DF2))
#Are they the same?
all.equal(out1, out2)
#[1] TRUE
> head(out1)
ID count
1 1 4
2 2 NA
3 3 NA
4 5 3
5 6 2
6 9 NA
SELECT
COALESCE(DF1.ID, DF2.ID) AS ID,
CASE WHEN DF1.c1 = DF2.c1 THEN 1 ELSE 0 END +
CASE WHEN DF1.c2 = DF2.c2 THEN 1 ELSE 0 END +
CASE WHEN DF1.c3 = DF2.c3 THEN 1 ELSE 0 END +
CASE WHEN DF1.c4 = DF2.c4 THEN 1 ELSE 0 END +
CASE WHEN DF1.c5 = DF2.c5 THEN 1 ELSE 0 END AS count_of_matches
FROM
DF1
FULL OUTER JOIN
DF2
ON DF1.ID = DF2.ID
There's probably a more elegant way, but this works:
x <- merge(DF1,DF2,by="ID",all=TRUE)
pre <- paste("c",1:5,sep="")
x$Count <- rowSums(x[,paste(pre,"x",sep=".")]==x[,paste(pre,"y",sep=".")])
DF_Return <- x[,c("ID","Count")]
We could use safe_full_join from my package safejoin, and apply ==
between conflicting columns. This will yield a new data frame with logical
c* columns that we can use rowSums on.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_full_join(DF1, DF2, by = "ID", conflict = `==`) %>%
transmute(ID, count = rowSums(.[-1]))
# ID count
# 1 1 4
# 2 10 NA
# 3 5 3
# 4 20 NA
# 5 3 NA
# 6 6 2
# 7 71 NA
# 8 15 1
# 9 80 1
# 10 78 NA
# 11 98 NA
# 12 2 NA
# 13 9 NA
You can use the apply function to handle this. To get the sum of each row, you can use:
sums <- apply(df1[2:ncol(df1)], 1, sum)
cbind(df1[1], sums)
which will return the sum of all but the first column, then bind that to the first column to get the ID back.
You could do that on both data frames. I'm not really clear what the desired behavior is after that, but maybe look at the merge function.