grouping and summarizing the rows in a big text file using awk

grouping and summarizing the rows in a big text file using awk - awk

I have a big text file like this example:
example:
chr11 314980 314981 63 IFITM1 -131
chr11 315025 315026 54 IFITM1 -86
chr5 315085 315086 118 AHRR -53011
chr16 316087 316088 56 ITFG3 -86
chr16 316088 316089 90 ITFG3 -131
chr11 319672 319673 213 IFITM3 -131
chr11 319674 319675 514 IFITM3 -164
I want to group the rows based on the 6th column and sum the values
from the 4th column for every group. the new file would have 2
columns. 1st column would be the group and the 2nd column would be sum
(sum of values from column 4 from similar groups). the expected output
would look like this:
expected output:
-131 366
-86 110
-53011 118
-164 514
I am trying to do that in awk using the following code.
sort myfile.txt | awk -F'\t' '{ sub(/..$/,"**",$6) }1' OFS='\t' | awk '{print $1 "\t" $2}' > outfile.txt
but actually it returns an empty file. do you know how to fix it?

I have no idea what you are thinking with your code: why you are replacing the last 2 chars on the line with asterisks? why aren't you doing any addition anywhere? why do you sort (by column 1) first?
awk -F'\t' '
{sum[$6] += $4}
END {for (key in sum) {print key, sum[key]}}
' file | column -t

Use an associative array:
awk '{a[$NF]+=$4}END{for (i in a){print i, a[i]}}' file

If you're ok with sorted output, you don't need arrays:
sort -k6n file |
awk -F'\t' '
grp != $6 {
grp = $6
printf "%s%s%s%s", sum, sep, grp, FS
sum = 0
sep = ORS
} { sum += $4 } END { print sum }'

Related

how to fetch the data of column from txt file without column name in unix?

This is student.txt file:
RollNo|Name|Marks
123|Raghu|80
342|Maya|45
561|Gita|56
480|Mohan|71
I want to fetch data from "Marks" column and I used awk command
awk -F "|" '{print $3}' student.txt
and it gives the output like
Marks
80
45
56
71
This output included column name that is "Marks" but I only want to fetch the data and show the output like
80
45
56
71

Add a condition to your awk script to print the third field if the input record number is greater 1:
awk -F'|' 'FNR>1{print $3}' student.txt

Could you please try following, this has NOT hard coded field number for Marks keyword,in headers it will look for that string and will print only those columns which are under Marks, so even your Marks column is in any field number this should work fine. Written and test in https://ideone.com/Ufq5E2 link.
awk '
BEGIN{ FS="|" }
FNR==1{
for(i=1;i<=NF;i++){
if($i=="Marks"){ field=i }
}
next
}
{
print $field
}
' Input_file

to fetch data from "Marks" column - -use[ing] awk:
$ awk -F\| '
FNR==1 {
for(i=1;i<=NF;i++)
if($i=="Marks")
next
exit
}
{
print $i
}' file
80
45
56
71

Another example:
awk -F'|' 'NR != 1 {print $3}' input_file

a non-awk alternative
$ sed 1d file | cut -d\| -f3
80
45
56
71

shuffle and edit columns in awk

I want to shuffle and edit column 1 and put it as column 2. I want to keep only the part after "-" in column 1.
I tried to shuffle with
awk '{print $2,$1}'
in:
#9-297 TACCTGAGGTAGTAGGTTGTATAGTTCCTC
#10-276 CACAGCGTTGGTGGTATAGTGGTTAGCCACC
out:
ACCTGAGGTAGTAGGTTGTATAGTTCCTC 297
CACAGCGTTGGTGGTATAGTGGTTAGCCACC 276

You can split the first column to an array on the delimiter "-", then print out the part of the array you want:
awk '{split($1, a, "-");print $2, a[2]}' yourfile.txt

$ awk -F'[-[:space:]]+' '{print $3, $2}' file
TACCTGAGGTAGTAGGTTGTATAGTTCCTC 297
CACAGCGTTGGTGGTATAGTGGTTAGCCACC 276

How to find the position of word in a list of string to use in awk?

Morning guys,
I often have files that I want to grep+awk but that have a lot of fields.
I'm interested in one in particular (so I'd like to awk '{print $i}') but how can I know what position (ie "i" here) my column is, other than counting it manually?
With files of around 70 fields, I'd be saving lot of time! :)
Thanks a lot,
[Edit]
Following Ian McGowan's suggestion, I'll look for the column number in the file's header:
head -1 myFile | awk '{for (i=1; i<=NF; i++) printf("%3d=%s\n", i, $i); }' | grep -i <the_column_Im_looking_for>
Thanks everyone,
Will1v

I was looking for a sample but:
$ cat > file
this is
the sample
$ awk '{
for(i=1;i<=NF;i++)
if($i=="sample")
print NR,i
}' file
2 2

I do this all the time when trying to profile some large text delimited file.
$head -4 myfile
4A 1 321 537 513.30
4B 0.00
8 592 846 905.66
9B2 39 887 658.77
Transpose or pivot by looping over the columns/fields:
$awk '{ for (i=1; i<=NF; i++) printf("%4d %3d=%s\n", NR, i, $i); }' < myfile
1 1=4A
1 2=1
1 3=321
1 4=537
1 5=513.30
2 1=4B
2 2=0.00
3 1=8
3 2=592
3 3=846
3 4=905.66
4 1=9B2
4 2=39
4 3=887
4 4=658.77
You can add printf("row=%-4d col=%-3d:%s\n", NR, i, $i); or whatever in the format mask for printf, and then grep for just the data you care about to find out the column, or if you know the columns grep for col=44 to get the 44th column.

xargs -n1 will print the columns one per line, so you can do this:
head -1 file | xargs -n1 | grep -n "column_name"

awk to print unique latest date & time lines based on column fields

Would like to print unique lines based on first field AND latest Date & Time of third field,
keep the latest date and time occurrence of that line and remove duplicate of other occurrences.
Having around 50 million rows , file is not sorted ...
Input.csv
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
Desired Output:
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
Have attempeted partial commands and in-complete due to Date and Time format of the file un sorting order ...
awk -F, '!seen[$1,$3]++' Input.csv
Looking for your suggestions ...

this awk command will do it for you:
awk -F, -v OFS=',' '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] {b[$1] =d; a[$1] = $0}
END{for(x in a)print a[x]}' file
first line transforms the original $3 into valid date format string and get the seconds from 1970 via date cmd, so that we could later compare.
using a and b two arrays to hold the final result and the latest date (seconds)
the END block print all rows from a
test with your example data:
kent$ cat f
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
kent$ awk -F, '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] { b[$1] =d;a[$1] = $0 }
END{for(x in a)print a[x]}' f
10 ab 25-SEP-14 08:09:26 abc xxx yyy zzz
20 ab 23-SEP-14 08:09:35 abc xxx yyy zzz
58 ab 22-JUL-14 05:07:07 abc xxx yyy zzz
62 ab 12-SEP-14 03:09:23 abc xxx yyy zzz

This should do:
sort -t , -k 3 file | awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}'
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz

How to subtract a constant number from a column

Is there a way to subtract the smallest value from all the values of a column? I need to subtract the first number in the 1st column from all other numbers in the first column.
I wrote this script, but it's not giving the right result:
$ awk '{$1 = $1 - 1280449530}' file
1280449530 452
1280449531 2434
1280449531 2681
1280449531 2946
1280449531 1626
1280449532 3217
1280449532 4764
1280449532 4501
1280449532 3372
1280449533 4129
1280449533 6937
1280449533 6423
1280449533 4818
1280449534 4850
1280449534 8980
1280449534 8078
1280449534 6788
1280449535 5587
1280449535 10879
1280449535 9920
1280449535 8146
1280449536 6324
1280449536 12860
1280449536 11612

What you have essentially works, you're just not outputting it. This will output what you want:
awk '{print ($1 - 1280449530) " " $2}' file
You can also be slightly cleverer and not hardcode the shift amount:
awk '{
if(NR == 1) {
shift = $1
}
print ($1 - shift) " " $2
}' file

You were on the right track:
awk '{$1 = $1 - 1280449530; print}' file
Here is a simplified version of Michael's second example:
awk 'NR == 1 {origin = $1} {$1 = $1 - origin; print}' file

bash shell script
#!/bin/bash
exec 4<"file"
read col1 col2<&4
while read -r n1 n2 <&4
do
echo $((n1-$col1))
# echo "scale=2;$n1 - $col1" | bc # dealing with decimals..
done
exec >&4-

In vim you can select the column with
and go to the bottom of the page with
G
then
e
to go to the end of the number
then you may enter the number like 56
56
this will add 56 to the column

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

grouping and summarizing the rows in a big text file using awk - awk

I have no idea what you are thinking with your code: why you are replacing the last 2 chars on the line with asterisks? why aren't you doing any addition anywhere? why do you sort (by column 1) first? awk -F'\t' ' {sum[$6] += $4} END {for (key in sum) {print key, sum[key]}} ' file | column -t

Use an associative array: awk '{a[$NF]+=$4}END{for (i in a){print i, a[i]}}' file

If you're ok with sorted output, you don't need arrays: sort -k6n file | awk -F'\t' ' grp != $6 { grp = $6 printf "%s%s%s%s", sum, sep, grp, FS sum = 0 sep = ORS } { sum += $4 } END { print sum }'

Related

how to fetch the data of column from txt file without column name in unix?

shuffle and edit columns in awk

How to find the position of word in a list of string to use in awk?

awk to print unique latest date & time lines based on column fields

How to subtract a constant number from a column

Categories

Resources