How to delete repeated rows if column 2 and column 3 matches using awk? - awk

I have a file with 4 columns:
ifile.txt
3 5 2 2
1 4 2 1
4 5 7 2
5 5 7 1
0 0 1 1
I would like to delete the repeated rows whose column 2 & 3 are same. for instance, row 3 & 4 has same values in column 2 & 3. So I wnat to keep the 3rd row and delete 4th row. my output is:
ofile.txt
3 5 2 2
1 4 2 1
4 5 7 2
0 0 1 1

awk 'NR==FNR{a[$2,$3]++;next}a[$2,$3]==1' file file
3 5 2 2
1 4 2 1
0 0 1 1
GNU awk

awk '{a[NR]=$2""$3} a[NR]!=a[NR-1]{print}' file
Save $2 and $3 value to array a with index as NR. Then if value of a in current line and previous line doesn't match print line else ignore.

Related

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random
To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

Comparing two dataframe and output the index of the duplicated row once

I need help with comparing two dataframes. For example:
The first dataframe is
df_1 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 2 2 2 2 2 2
5 5 5 5 5 5 5
6 1 1 1 1 1 1
7 6 6 6 6 6 6
The second dataframe is
df_2 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6
May I know if there is a way (without using for loop) to find the index of the rows of df_1 that have the same row values of df_2. In the example above, my expected output is below
index =
0
1
2
3
5
7
The size of the column of the "index" variable above should have the same column size of df_2.
If the same row of df_2 repeated in df_1 more than once, I only need the index of the first appearance, thats why I don't need the index 4 and 6.
Please help. Thank you so much!
Tommy
Use DataFrame.merge with DataFrame.drop_duplicates and DataFrame.reset_index for convert index to column for avoid lost index values, last select column called index:
s = df_2.merge(df_1.drop_duplicates().reset_index())['index']
print (s)
0 0
1 1
2 2
3 3
4 5
5 7
Name: index, dtype: int64
Detail:
print (df_2.merge(df_1.drop_duplicates().reset_index()))
0 1 2 3 4 5 index
0 1 1 1 1 1 1 0
1 2 2 2 2 2 2 1
2 3 3 3 3 3 3 2
3 4 4 4 4 4 4 3
4 5 5 5 5 5 5 5
5 6 6 6 6 6 6 7
Check the solution
df1=pd.DataFrame({'0':[1,2,3,4,2,5,1,6],
'1':[1,2,3,4,2,5,1,6],
'2':[1,2,3,4,2,5,1,6],
'3':[1,2,3,4,2,5,1,6],
'4':[1,2,3,4,2,5,1,6],
'5':[1,2,3,4,2,5,1,6]})
df1=pd.DataFrame({'0':[1,2,3,4,5,6],
'1':[1,2,3,4,5,66],
'2':[1,2,3,4,5,6],
'3':[1,2,3,4,5,66],
'4':[1,2,3,4,5,6],
'5':[1,2,3,4,5,6]})
df1[df1.isin(df2)].index.values.tolist()
### Output
[0, 1, 2, 3, 4, 5, 6, 7]

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

calculate the total value for each group using Calculated Column in Spotfire

I have a problem about the sum calculation for the rows using calculated column in Spotfire.
For example, the raw data is as below, the raw table is order by id, for each type, the sequence is 2,3,0.
id type value state
1 1 12 2
2 1 7 3
3 1 10 0
4 2 11 2
5 2 6 3
6 3 9 0
7 3 7 2
8 3 5 3
9 2 9 0
10 1 7 2
11 1 3 3
12 1 2 0
for type of each cycle of (2,3,0), I want to sum the value, then the result could be:
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
6 3 7 2
7 3 5 3
8 3 9 0 21
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
note: only the row which its state is 0 will have the sum value , i think it will be easier to see the rules, when we order the type :
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
6 3 7 2
7 3 5 3
8 3 9 0 21
thanks for your time and help!
Here is a solution for you.
Insert a Calculated Column RowId() and name it RowId
Insert a Calculated Column If(Mod([RowId],3)=0,[RowId] / 3,Ceiling([RowId] / 3)) and name it Groups
Insert a Calculated Column Sum([value]) OVER ([Groups]) and name it Running Sum
Insert a Calculated Column If([state] = 0,[RunningSum]) and name it OnlyState=0
The only thing to really explain here is #2. With the data sorted as you listed in your example, the last row for each group, based on the RowId, should be divisible by 3. We have to do it this way since your type field can have multiple groups for any given type. RowId 3, 6, 9, 12 etc will all have a Modulus of 0 since they are divisible by 3. This marks the last row in each set. If it is the last row, we just set it to RowId / 3. This gives us groups 1,2,3,4 etc... For the rows which aren't divisible by 3, we round them up to the nearest whole number of the divisor... which will be the last row in the set.
The last calculated column is the only way I know how to get ONLY the values you care about. If you use the If [state] = 0 logic anywhere else, you negate all other rows.

AWK - removal of the same fields on the basis of the "$1"

I have a file1:
6
3
6
9
2
6
This command prints the result:
awk 'NR==1{a=$1};$0!=a' file1
3
9
2
Now I have file2:
6 1 2 3 4 5
3 3 4 4 4 6
6 5 2 2 5 1
9 1 3 5 4 1
2 5 6 4 8 5
6 1 5 2 3 1
I want to do the same thing, but with file2. I want to print out the result:
3 3 4 4 5 6
9 5 3 2 8 1
2 5 6 5 3 1
5 4 1
2
I want to do it in awk. Thank you for your help.
AWK is not really suited for what you are trying to do, since it is made for processing rows one at a time, while you are trying to shift numbers up and down between different rows. That said, this monster should do what you want:
awk 'NR==1{nc=NF;for(i=1;i<=nc;i++)a[i]=$i}{for(i=1;i<=nc;i++){if($i!=a[i]){v[m[i]++,i]=$i;if(m[i]>nl)nl=m[i]}}}END{for(l=0;l<nl;l++){for(i=1;i<=nc;i++){if(l<m[i]){printf("%d ", v[l,i])}else{printf(" ")}}printf("\n")}}'
If, on the other hand, your matrix of numbers had been transposed, this task would have been far simpler:
awk '{for(i=2;i<=NF;i++)if($i!=$1)printf(" %d",$i);printf("\n")}'