If query returns the table:
col
sub_col
A
A
A
B
A
C
A
A
A
B
A
C
B
A
B
B
B
B
B
C
B
A
B
B
B
B
B
C
Output should be like:
col
sub_col
order_by_sub_col
A
A
0
A
B
1
A
C
2
A
A
0
A
B
1
A
C
2
B
A
0
B
B
1
B
B
2
B
C
3
B
A
0
B
B
1
B
B
2
B
C
3
This is my dataframe:
d = {'id':[1,2,3,4,5,6,7,8],
'col1':['A','A','A','B','B','B','C','D'],
'col2':['C','C','D', 'E', 'F', 'F','G','H'],
'data':['abc','def','ghk','lmn','opq','rst','uvw','xyz']
}
df = pd.DataFrame(d)
I want to find all values in col2 for each unique value in col1. Think of col1 as being a house and col2 as the number of devices in it.
Output:
col1 col2 data
A C abc
def
D ghk
B E lmn
F opq
rst
C G uvw
D H xyz
Update:
Since I have a large number of rows in my original dataset(98k rows), would it be great if I could get a list of values from col1 which have more than one row in col2. Based on my Output, I would need a list with values ['A','B']
If you insist on getting exactly that output, here's one way:
df = df.drop_duplicates(subset=[
'col1', 'col2'
]).drop('id', axis=1).reset_index(drop=True)
df['col1'] = np.where(df.col1.duplicated()==True, '', df.col1)
Which produces:
col1 col2
0 A C
1 D
2 B E
3 F
You might even want to go as far as:
df = df.set_index('col1')
Which produces:
col2
col1
A C
D
B E
F
To export to csv or excel simply do one of the following:
df.to_csv('filename.csv')
df.to_excel('filename.xlsx')
UPDATE: Based on the update in the question, the list of values from col1 can be obtained as follows:
list(df.groupby('col1').col1.filter(lambda x: len(x)>1).unique())
Which produces:
['A', 'B']
Try groupby() with an application of unique():
In [26]: df.groupby('col1').col2.unique()
Out[26]:
col1
A [C, D]
B [E, F]
Name: col2, dtype: object
I'm trying to write a query that will return distinct rows while excluding rows that don't have maximum data.
table1
col1 col2 col3 col4 col5
one a b c d
two a b d
three a b c
four a c d
five a b
six a c
seven a e
Basically, I want a query that will return the following from the table above
col1 col2 col3 col4 col5
one a b c d
six a c
seven a e
I have a large datatset( about 1M rows)
within this dataset, I want to find some values in one of columns (or multiple columns)
for example,
df contains
col1 col2 col3
-------------------
a b c
d e f
g h i
j k l
m n o
what i'm looking for is that searching each row and if the given value exist, then output a "YES" in new col4
any help?
thanks
Scenario 1: search whole dataframe
We can use DataFrame.eq with any over the column axis, so for each row. This means, if the value a is in any of the column for a row, we get True:
df['indicator'] = df.eq('a').any(axis=1)
col1 col2 col3 indicator
0 a b c True
1 d e f False
2 g h i False
3 j k l False
4 m n o False
Scenario 2: for some columns:
We can apply the same logic for sub selection of columns, if we use iloc, to select the first two columns
df['indicator'] = df.iloc[:, :2].eq('d').any(axis=1)
col1 col2 col3 indicator
0 a b c False
1 d e f True
2 g h i False
3 j k l False
4 m n o False
Count duplicate records by using linq
.................................................................................................................................................
Col1 col2
x a
x a
x b
x b
y c
y c
y d
y d
z e
z e
z f
now i want count like follows
x a 2
x b 2
y c 2
y d 2
in linq plese any one assist me
table
.GroupBy(x=>new{x.col1,x.clo2})
.Select(x=>new{ x.key.col1,x.key.col2,x.Count(z=>z.col1)
var Result =
from t in table
group t by new
{
t.col1,
t.col2,
} into gt
select new
{
col1 = gt.Key.col1,
col2 = gt.Key.col2,
count = gt.Count(),
};