How do i add the second column based on column 1 in awk ? For example i used the following script - awk

zcat *.gz | awk '{print $1}' |sort| uniq -c | sed 's/^[ ]\+//g' | cut -d' ' -f1 | sort | uniq -c | sort -k1n
I get the following output:
3 648
3 655
3 671
3 673
3 683
3 717
4 18
4 29
4 31
4 34
4 652
5 12
6 24
6 33
7 13
12 10
13 9
14 8
33 7
73 6
166 5
383 4
1178 3
3945 2
26692 1
I don't want repetitions in my 1st column. Example: if my first column is 3 , i should add all the values in the second column that are associated with 3. Thank you

Solution using arrays in awk
{
a[$1]=a[$1]+$2
}
END {
for (i in a)
printf("%d\t%d\n", i, a[i])
}
Pipe the output through sort -n once more to have it in ascending order
$ awk -f num.awk numbers | sort -n
3 4047
4 764
5 12
6 57
7 13
12 10
13 9
14 8
33 7
73 6
166 5
383 4
1178 3
3945 2
26692 1

awk 'NF == 1 {c=$1; print $0} NF>1 {if (c==$1) {print "\t" $2} else {c=$1; print $0}}'
can do it, but please note, that the indentation can be incorrect, as I had used a simple tab \t above.
HTH

Related

awk command for string replacement and printing matched and unmatched strings

I want to replace multiple strings (more than thousand) in File-1 with matching string from File-2
File-1:
Geneid Length s1 s2
1_1 6571 7 8
1_2 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
1_5 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
2_4 1056 5 1
File-2 (map):
1_1
1_2 k0002
1_3
1_4
1_5 k0006
2_1
2_2
2_3
2_4 k0528
Expected output:
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
I used the following awk command:
awk '
NR==FNR {
a[$1]=$2
next
}
{
print (($1 in a)?a[$1]:$1, $2, $3, $4)
}' File-2 File-1 > File-3
which gives me this:
Geneid Length s1 s2
6571 7 8
k0002 5041 3 0
1032 7 3
1212 3 5
k0006 1071 3 5
7171 2 7
1038 1 1
9361 0 6
k0528 1056 5 1
How to modify this awk command to keep unmatched strings?
Sorry, I am a newbie to linux and awk (trying to learn).
The expression ($1 in a)?a[$1]:$1 prints either a[$1] or $1 depending on whether $1 is a key in a. But all your keys are in a, so for example, for the key 1_1, it prints the empty string which is the value of a["1_1"]. The solution is to only populate a when there is a value to add for the key in $1.
awk 'NR==FNR { if (NF > 1) a[$1]=$2; next }
{ print (($1 in a)?a[$1]:$1, $2, $3, $4) }' File-2 File-1
For debugging a script like yours, it helps to add print statements at various points to see what the script is doing. Here's what I ended up doing to figure out what was wrong with your script.
# STILL BUGGY, DEBUGGING RUN
awk 'NR==FNR { print("a[" $1 "]=" $2); a[$1]=$2; next; }
{ print ($1 in a ? a[$1] : $1), $2, $3, $4, ($1 in a), a[$1], $1, ($1 in a ? "yes" : "no"), "end" }' File-2 File-1
$ awk '
NR==FNR { if (NF>1) a[$1]=$2; next }
$1 in a { $1=a[$1] }
1' file2 file1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
if (NF>1) efficiently ensures you only populate a[] with values from file2 that you need, i.e. those that have a 2nd field,
$1 in a ensures you only change $1 from file when an
associated entry existed in file2. Do not test a[$1]=="" or
anything similar instead as that will populate a[] for every $1
in file1 and so use up memory and increase execution time.
1 at the end causes the current, possibly just-modified, line from file1 to be printed.
Given that File-2 won't be empty:
awk 'NR==FNR{a[$1]=$2;next}a[$1]!=""{$1=a[$1]}1' File-2 File-1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
If it can be empty, and with GNU awk, you can replace NR==FNR with ARGIND==1 or FILENAME=="File-2".

Transform a 1xA table into a BxC table in awk

I am trying to turn a 1xA table into a BxC table. Let's say A is 15, B is 3 and C is 5, hence after each 5 entries I want it to start a new row in the same table.
I have a rather tedious way that appears to get close be it misses some values after each 5. I think the issue is with RS, as a new line forgets the "space" needed by RS, but I tried changing this to something else in file.sum and still no luck. Perhaps there is a better way to do it, but feel this should work.
awk -v RS=" " '{getline a1; getline a2; getline a3; getline a4; getline a5; print a1,a2,a3,a4,a5}' OFS='\t' file.sum
file.sum (my 1xA):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Expected results (my BxC):
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Actual results:
1 2 3 4 5
7 8 9 10 11
13 14 15 10 11
This should be one of the simplest solution:
xargs -n5 <file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
To follow up on your awk. I do not like the getline so I always try to avoid it. Also loop slows down awk some.
But using RS=" " you can do like this:
awk -v RS=" " '{$1=$1} {printf NR%5==0?"%s\n":"%s ",$0}' file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
You can remove the {$1=$1}, but will then get a blank line at the end.
The NR%5==0 test if record is every 5th and insert newline when needed.
A tab version:
awk -v RS=" " '{$1=$1} {printf NR%5==0?"%s\n":"%s\t",$0}' file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15

Select current and previous line if certain value is found

To figure out my problem, I subtract column 3 and create a new column 5 with new values, then I print the previous and current line if the value found is equal to 25 in column 5.
Input file
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
I tried
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' file
output
1 1 35 1 35
2 5 50 1 15
2 6 75 1 25
4 7 85 1 10
5 8 100 1 15
6 9 125 1 25
4 1 200 1 75
Desired Output
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Thanks in advance
you're almost there, in addition to previous $3, keep the previous $0 and only print when condition is satisfied.
$ awk '{$5=$3-p3} $5==25{print p0; print} {p0=$0;p3=$3}' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
this can be further golfed to
$ awk '25==($5=$3-p3){print p0; print} {p0=$0;p3=$3}' file
check the newly computed field $5 whether equal to 25. If so print the previous line and current line. Save the previous line and previous $3 for the computations in the next line.
You are close to the answer, just pipe it another awk and print it
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
with Inputs:
$ cat oxxo.txt
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
$ awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
$
Could you please try following.
awk '$3-prev==25{print line ORS $0,$3} {$(NF+1)=$3-prev;prev=$3;line=$0}' Input_file | column -t
Here's one:
$ awk '{$5=$3-q;t=p;p=$0;q=$3;$0=t ORS $0}$10==25' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Explained:
$ awk '{
$5=$3-q # subtract
t=p # previous to temp
p=$0 # store previous for next round
q=$3 # store subtract value for next round
$0=t ORS $0 # prepare record for output
}
$10==25 # output if equals
' file
No checking for duplicates so you might get same record printed twice. Easiest way to fix is to pipe the output to uniq.

rearrange columns using awk or cut command

I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh

How to do the sum in individual line one by one in linux?

How to do the sum in individual line in Linux?
I have one file :
Course Name: Math
Credit: 4
12345 1 4 5 1 1 1 1 1 5 10 1 2 2 20
34567 2 3 4 1 10 5 3 2 5 5 10 20 5
Course Name: English
Credit: 4
12345 1 4 5 1 1 1 1 1 5 10 1 20
34567 4 1 10 5 3 2 5 5 10 20 5
Its output will be come:
Course Name: Math
Credit: 4
12345 55
34567 75
Course Name: English
Credit: 4
12345 51
34567 70
I tried this code:
awk '{for (i=2; i<=NF; i++) {tot += $1}; print $1 "\t" tot; tot =0}' file > file2
The output is like this:
Course Name: 0
Credit: 4
12345 55
34567 75
Course Name: 0
Credit: 4
12345 51
34567 70
Actually I need to display a Course name too (Math and English). I am trying to fix it but I couldn't. Can you please help?
Try:
awk '/^[0-9]/{for (i=2; i<=NF; i++) {tot += $i}; print $1 "\t" tot; tot =0} !/^[0-9]/'
This will only sum lines that start with a digit, and simply print those that don't.
Just with the shell
while read line; do
case $line in
Course*|Credit*) echo "$line" ;;
*) set -- $line
id=$1
shift 1
sum=$(IFS=+; echo "$*" | bc)
printf "%s\t%d\n" $id $sum
;;
esac
done < filename
This might work for you too!
sed '/:/{s/.*/echo "&"/;b};s/ /+/2g;s/\(\S*\) \(.*\)/echo "\1\t\$((\2))"/' file | sh