Finding occurances by comparing 2 columns in dataframe - pandas

This is my dataframe:
d = {'id':[1,2,3,4,5,6,7,8],
'col1':['A','A','A','B','B','B','C','D'],
'col2':['C','C','D', 'E', 'F', 'F','G','H'],
'data':['abc','def','ghk','lmn','opq','rst','uvw','xyz']
}
df = pd.DataFrame(d)
I want to find all values in col2 for each unique value in col1. Think of col1 as being a house and col2 as the number of devices in it.
Output:
col1 col2 data
A C abc
def
D ghk
B E lmn
F opq
rst
C G uvw
D H xyz
Update:
Since I have a large number of rows in my original dataset(98k rows), would it be great if I could get a list of values from col1 which have more than one row in col2. Based on my Output, I would need a list with values ['A','B']

If you insist on getting exactly that output, here's one way:
df = df.drop_duplicates(subset=[
'col1', 'col2'
]).drop('id', axis=1).reset_index(drop=True)
df['col1'] = np.where(df.col1.duplicated()==True, '', df.col1)
Which produces:
col1 col2
0 A C
1 D
2 B E
3 F
You might even want to go as far as:
df = df.set_index('col1')
Which produces:
col2
col1
A C
D
B E
F
To export to csv or excel simply do one of the following:
df.to_csv('filename.csv')
df.to_excel('filename.xlsx')
UPDATE: Based on the update in the question, the list of values from col1 can be obtained as follows:
list(df.groupby('col1').col1.filter(lambda x: len(x)>1).unique())
Which produces:
['A', 'B']

Try groupby() with an application of unique():
In [26]: df.groupby('col1').col2.unique()
Out[26]:
col1
A [C, D]
B [E, F]
Name: col2, dtype: object

Related

SQL - Add ID column for rows if at least one column have the same value

I have a table with three columna: col1, col2, col3. I want to create a unique ID if at least ONE of the columns have the same value. For example, if col1 equals A in two instances, regardless of the values of col2 and col3, the id should be the same. Same goes for the other cols, so if col2 equals B, the identifier should be the same regardless of the values of col1 or col2.
This is the expected result.
ID
col1
col2
col3
1
A
F
G
1
A
T
Y
2
B
E
U
2
T
E
O
3
H
Y
U
3
H
B
L
3
P
B
P
I've tried using the Dense Rank function but it considers the repeated values in all columns.

Exclude rows distinct except for null values

I'm trying to write a query that will return distinct rows while excluding rows that don't have maximum data.
table1
col1 col2 col3 col4 col5
one a b c d
two a b d
three a b c
four a c d
five a b
six a c
seven a e
Basically, I want a query that will return the following from the table above
col1 col2 col3 col4 col5
one a b c d
six a c
seven a e

search values in dataframe and export to new column

I have a large datatset( about 1M rows)
within this dataset, I want to find some values in one of columns (or multiple columns)
for example,
df contains
col1 col2 col3
-------------------
a b c
d e f
g h i
j k l
m n o
what i'm looking for is that searching each row and if the given value exist, then output a "YES" in new col4
any help?
thanks
Scenario 1: search whole dataframe
We can use DataFrame.eq with any over the column axis, so for each row. This means, if the value a is in any of the column for a row, we get True:
df['indicator'] = df.eq('a').any(axis=1)
col1 col2 col3 indicator
0 a b c True
1 d e f False
2 g h i False
3 j k l False
4 m n o False
Scenario 2: for some columns:
We can apply the same logic for sub selection of columns, if we use iloc, to select the first two columns
df['indicator'] = df.iloc[:, :2].eq('d').any(axis=1)
col1 col2 col3 indicator
0 a b c False
1 d e f True
2 g h i False
3 j k l False
4 m n o False

Pandas is condition on multiple columns

I have a dataframe
col1 col2 col3 col4
A F F F
B F A B
C B A C
D S A F
I want to say if A and F in any of these columns then make a new column and enter "Found"
col1 col2 col3 col4 output
A F F F Found
B F A B Found
C B A C 0
D S A F Found
Use :
df['output']=np.where(df.eq('A').any(1) & df.eq('F').any(1),'Found',0)
Another approach:
df['output']=(df.eq('A').any(1) & df.eq('F').any(1)).map({True:'Found',False:0})
Output:
col1 col2 col3 col4 output
0 A F F F Found
1 B F A B Found
2 C B A C 0
3 D S A F Found
Try this:
df.loc[df.apply(lambda x: ((x=='F').any() & (x=='A').any()).any(),axis=1), 'output'] = 'Found'
df.fillna(0)
You can use pd.DataFrame.where():
df.where(lambda x: (x=='A') | (x=='F')).dropna(thresh=1)

How to extract non grouped column data?

I have a table A with following data:
A:
colA colB
a x
b x
c y
d y
e z
f z
I want the output as:
colA colA_1
a b
c d
e f
I.e. I want to group the data based on colB and fetch the values from colA. I know that the same value will appear exactly twice in colB.
What I am trying to do is:
SELECT a1.colA, a2.colA
FROM A a1
JOIN A a2
ON a1.colA != a2.colA and a1.colB=a2.colB;
But this gives the output as:
colA colA_1
a b
b a
c d
d c
e f
f e
How can I fix this to get the desired output?
No need to join, simply do a GROUP BY:
SELECT min(colA), max(colA)
FROM A
group by colB