search values in dataframe and export to new column - pandas

I have a large datatset( about 1M rows)
within this dataset, I want to find some values in one of columns (or multiple columns)
for example,
df contains
col1 col2 col3
-------------------
a b c
d e f
g h i
j k l
m n o
what i'm looking for is that searching each row and if the given value exist, then output a "YES" in new col4
any help?
thanks

Scenario 1: search whole dataframe
We can use DataFrame.eq with any over the column axis, so for each row. This means, if the value a is in any of the column for a row, we get True:
df['indicator'] = df.eq('a').any(axis=1)
col1 col2 col3 indicator
0 a b c True
1 d e f False
2 g h i False
3 j k l False
4 m n o False
Scenario 2: for some columns:
We can apply the same logic for sub selection of columns, if we use iloc, to select the first two columns
df['indicator'] = df.iloc[:, :2].eq('d').any(axis=1)
col1 col2 col3 indicator
0 a b c False
1 d e f True
2 g h i False
3 j k l False
4 m n o False

Related

SQL - Add ID column for rows if at least one column have the same value

I have a table with three columna: col1, col2, col3. I want to create a unique ID if at least ONE of the columns have the same value. For example, if col1 equals A in two instances, regardless of the values of col2 and col3, the id should be the same. Same goes for the other cols, so if col2 equals B, the identifier should be the same regardless of the values of col1 or col2.
This is the expected result.
ID
col1
col2
col3
1
A
F
G
1
A
T
Y
2
B
E
U
2
T
E
O
3
H
Y
U
3
H
B
L
3
P
B
P
I've tried using the Dense Rank function but it considers the repeated values in all columns.

Azure data factory - collapse row to one column

I have below input
A C
1 X
1 Y
1 Z
2 D
2 E
2 F
where A & C are header, I want to collapse C into one column say "B"
like below
A B
1 x,y,z
2 D,E,F
how can we achieve in Azure data factory
Try with string_agg
select
A,
string_agg(C, ', ') as B
from myTable
group by
A

Finding occurances by comparing 2 columns in dataframe

This is my dataframe:
d = {'id':[1,2,3,4,5,6,7,8],
'col1':['A','A','A','B','B','B','C','D'],
'col2':['C','C','D', 'E', 'F', 'F','G','H'],
'data':['abc','def','ghk','lmn','opq','rst','uvw','xyz']
}
df = pd.DataFrame(d)
I want to find all values in col2 for each unique value in col1. Think of col1 as being a house and col2 as the number of devices in it.
Output:
col1 col2 data
A C abc
def
D ghk
B E lmn
F opq
rst
C G uvw
D H xyz
Update:
Since I have a large number of rows in my original dataset(98k rows), would it be great if I could get a list of values from col1 which have more than one row in col2. Based on my Output, I would need a list with values ['A','B']
If you insist on getting exactly that output, here's one way:
df = df.drop_duplicates(subset=[
'col1', 'col2'
]).drop('id', axis=1).reset_index(drop=True)
df['col1'] = np.where(df.col1.duplicated()==True, '', df.col1)
Which produces:
col1 col2
0 A C
1 D
2 B E
3 F
You might even want to go as far as:
df = df.set_index('col1')
Which produces:
col2
col1
A C
D
B E
F
To export to csv or excel simply do one of the following:
df.to_csv('filename.csv')
df.to_excel('filename.xlsx')
UPDATE: Based on the update in the question, the list of values from col1 can be obtained as follows:
list(df.groupby('col1').col1.filter(lambda x: len(x)>1).unique())
Which produces:
['A', 'B']
Try groupby() with an application of unique():
In [26]: df.groupby('col1').col2.unique()
Out[26]:
col1
A [C, D]
B [E, F]
Name: col2, dtype: object

Pandas is condition on multiple columns

I have a dataframe
col1 col2 col3 col4
A F F F
B F A B
C B A C
D S A F
I want to say if A and F in any of these columns then make a new column and enter "Found"
col1 col2 col3 col4 output
A F F F Found
B F A B Found
C B A C 0
D S A F Found
Use :
df['output']=np.where(df.eq('A').any(1) & df.eq('F').any(1),'Found',0)
Another approach:
df['output']=(df.eq('A').any(1) & df.eq('F').any(1)).map({True:'Found',False:0})
Output:
col1 col2 col3 col4 output
0 A F F F Found
1 B F A B Found
2 C B A C 0
3 D S A F Found
Try this:
df.loc[df.apply(lambda x: ((x=='F').any() & (x=='A').any()).any(),axis=1), 'output'] = 'Found'
df.fillna(0)
You can use pd.DataFrame.where():
df.where(lambda x: (x=='A') | (x=='F')).dropna(thresh=1)

Count duplicate records by using linq

Count duplicate records by using linq
.................................................................................................................................................
Col1 col2
x a
x a
x b
x b
y c
y c
y d
y d
z e
z e
z f
now i want count like follows
x a 2
x b 2
y c 2
y d 2
in linq plese any one assist me
table
.GroupBy(x=>new{x.col1,x.clo2})
.Select(x=>new{ x.key.col1,x.key.col2,x.Count(z=>z.col1)
var Result =
from t in table
group t by new
{
t.col1,
t.col2,
} into gt
select new
{
col1 = gt.Key.col1,
col2 = gt.Key.col2,
count = gt.Count(),
};