zcat *.gz | awk '{print $1}' |sort| uniq -c | sed 's/^[ ]\+//g' | cut -d' ' -f1 | sort | uniq -c | sort -k1n
I get the following output:
3 648
3 655
3 671
3 673
3 683
3 717
4 18
4 29
4 31
4 34
4 652
5 12
6 24
6 33
7 13
12 10
13 9
14 8
33 7
73 6
166 5
383 4
1178 3
3945 2
26692 1
I don't want repetitions in my 1st column. Example: if my first column is 3 , i should add all the values in the second column that are associated with 3. Thank you
Solution using arrays in awk
{
a[$1]=a[$1]+$2
}
END {
for (i in a)
printf("%d\t%d\n", i, a[i])
}
Pipe the output through sort -n once more to have it in ascending order
$ awk -f num.awk numbers | sort -n
3 4047
4 764
5 12
6 57
7 13
12 10
13 9
14 8
33 7
73 6
166 5
383 4
1178 3
3945 2
26692 1
awk 'NF == 1 {c=$1; print $0} NF>1 {if (c==$1) {print "\t" $2} else {c=$1; print $0}}'
can do it, but please note, that the indentation can be incorrect, as I had used a simple tab \t above.
HTH
Related
I want to replace multiple strings (more than thousand) in File-1 with matching string from File-2
File-1:
Geneid Length s1 s2
1_1 6571 7 8
1_2 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
1_5 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
2_4 1056 5 1
File-2 (map):
1_1
1_2 k0002
1_3
1_4
1_5 k0006
2_1
2_2
2_3
2_4 k0528
Expected output:
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
I used the following awk command:
awk '
NR==FNR {
a[$1]=$2
next
}
{
print (($1 in a)?a[$1]:$1, $2, $3, $4)
}' File-2 File-1 > File-3
which gives me this:
Geneid Length s1 s2
6571 7 8
k0002 5041 3 0
1032 7 3
1212 3 5
k0006 1071 3 5
7171 2 7
1038 1 1
9361 0 6
k0528 1056 5 1
How to modify this awk command to keep unmatched strings?
Sorry, I am a newbie to linux and awk (trying to learn).
The expression ($1 in a)?a[$1]:$1 prints either a[$1] or $1 depending on whether $1 is a key in a. But all your keys are in a, so for example, for the key 1_1, it prints the empty string which is the value of a["1_1"]. The solution is to only populate a when there is a value to add for the key in $1.
awk 'NR==FNR { if (NF > 1) a[$1]=$2; next }
{ print (($1 in a)?a[$1]:$1, $2, $3, $4) }' File-2 File-1
For debugging a script like yours, it helps to add print statements at various points to see what the script is doing. Here's what I ended up doing to figure out what was wrong with your script.
# STILL BUGGY, DEBUGGING RUN
awk 'NR==FNR { print("a[" $1 "]=" $2); a[$1]=$2; next; }
{ print ($1 in a ? a[$1] : $1), $2, $3, $4, ($1 in a), a[$1], $1, ($1 in a ? "yes" : "no"), "end" }' File-2 File-1
$ awk '
NR==FNR { if (NF>1) a[$1]=$2; next }
$1 in a { $1=a[$1] }
1' file2 file1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
if (NF>1) efficiently ensures you only populate a[] with values from file2 that you need, i.e. those that have a 2nd field,
$1 in a ensures you only change $1 from file when an
associated entry existed in file2. Do not test a[$1]=="" or
anything similar instead as that will populate a[] for every $1
in file1 and so use up memory and increase execution time.
1 at the end causes the current, possibly just-modified, line from file1 to be printed.
Given that File-2 won't be empty:
awk 'NR==FNR{a[$1]=$2;next}a[$1]!=""{$1=a[$1]}1' File-2 File-1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
If it can be empty, and with GNU awk, you can replace NR==FNR with ARGIND==1 or FILENAME=="File-2".
I am trying to turn a 1xA table into a BxC table. Let's say A is 15, B is 3 and C is 5, hence after each 5 entries I want it to start a new row in the same table.
I have a rather tedious way that appears to get close be it misses some values after each 5. I think the issue is with RS, as a new line forgets the "space" needed by RS, but I tried changing this to something else in file.sum and still no luck. Perhaps there is a better way to do it, but feel this should work.
awk -v RS=" " '{getline a1; getline a2; getline a3; getline a4; getline a5; print a1,a2,a3,a4,a5}' OFS='\t' file.sum
file.sum (my 1xA):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Expected results (my BxC):
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Actual results:
1 2 3 4 5
7 8 9 10 11
13 14 15 10 11
This should be one of the simplest solution:
xargs -n5 <file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
To follow up on your awk. I do not like the getline so I always try to avoid it. Also loop slows down awk some.
But using RS=" " you can do like this:
awk -v RS=" " '{$1=$1} {printf NR%5==0?"%s\n":"%s ",$0}' file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
You can remove the {$1=$1}, but will then get a blank line at the end.
The NR%5==0 test if record is every 5th and insert newline when needed.
A tab version:
awk -v RS=" " '{$1=$1} {printf NR%5==0?"%s\n":"%s\t",$0}' file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
To figure out my problem, I subtract column 3 and create a new column 5 with new values, then I print the previous and current line if the value found is equal to 25 in column 5.
Input file
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
I tried
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' file
output
1 1 35 1 35
2 5 50 1 15
2 6 75 1 25
4 7 85 1 10
5 8 100 1 15
6 9 125 1 25
4 1 200 1 75
Desired Output
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Thanks in advance
you're almost there, in addition to previous $3, keep the previous $0 and only print when condition is satisfied.
$ awk '{$5=$3-p3} $5==25{print p0; print} {p0=$0;p3=$3}' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
this can be further golfed to
$ awk '25==($5=$3-p3){print p0; print} {p0=$0;p3=$3}' file
check the newly computed field $5 whether equal to 25. If so print the previous line and current line. Save the previous line and previous $3 for the computations in the next line.
You are close to the answer, just pipe it another awk and print it
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
with Inputs:
$ cat oxxo.txt
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
$ awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
$
Could you please try following.
awk '$3-prev==25{print line ORS $0,$3} {$(NF+1)=$3-prev;prev=$3;line=$0}' Input_file | column -t
Here's one:
$ awk '{$5=$3-q;t=p;p=$0;q=$3;$0=t ORS $0}$10==25' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Explained:
$ awk '{
$5=$3-q # subtract
t=p # previous to temp
p=$0 # store previous for next round
q=$3 # store subtract value for next round
$0=t ORS $0 # prepare record for output
}
$10==25 # output if equals
' file
No checking for duplicates so you might get same record printed twice. Easiest way to fix is to pipe the output to uniq.
I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh
How to do the sum in individual line in Linux?
I have one file :
Course Name: Math
Credit: 4
12345 1 4 5 1 1 1 1 1 5 10 1 2 2 20
34567 2 3 4 1 10 5 3 2 5 5 10 20 5
Course Name: English
Credit: 4
12345 1 4 5 1 1 1 1 1 5 10 1 20
34567 4 1 10 5 3 2 5 5 10 20 5
Its output will be come:
Course Name: Math
Credit: 4
12345 55
34567 75
Course Name: English
Credit: 4
12345 51
34567 70
I tried this code:
awk '{for (i=2; i<=NF; i++) {tot += $1}; print $1 "\t" tot; tot =0}' file > file2
The output is like this:
Course Name: 0
Credit: 4
12345 55
34567 75
Course Name: 0
Credit: 4
12345 51
34567 70
Actually I need to display a Course name too (Math and English). I am trying to fix it but I couldn't. Can you please help?
Try:
awk '/^[0-9]/{for (i=2; i<=NF; i++) {tot += $i}; print $1 "\t" tot; tot =0} !/^[0-9]/'
This will only sum lines that start with a digit, and simply print those that don't.
Just with the shell
while read line; do
case $line in
Course*|Credit*) echo "$line" ;;
*) set -- $line
id=$1
shift 1
sum=$(IFS=+; echo "$*" | bc)
printf "%s\t%d\n" $id $sum
;;
esac
done < filename
This might work for you too!
sed '/:/{s/.*/echo "&"/;b};s/ /+/2g;s/\(\S*\) \(.*\)/echo "\1\t\$((\2))"/' file | sh