I have a file with 5 columns that looks like this:
15642 G A.aa,, 0.77501 107
15643 G A.a,.A, 0.7570 17
15644 C t.TtTt,.T, 0.7501 10
I'm trying to convert the 3rd column of Aa's and Tt's to just "A" or "T".
Output:
15642 G A 0.77501 107
15643 G A 0.7570 17
15644 C T 0.7501 10
I've tried various awk methods without success. I'd sincerely appreciate any help. Thanks!
Following awk may help you on same.
awk '$3~/[Aa]/{$3="A"} $3~/[Tt]/{$3="T"} 1' Input_file
There's many possibilities including:
$ awk '{sub(/\..*/,"",$3)} 1' file
15642 G A 0.77501 107
15643 G A 0.7570 17
15644 C t 0.7501 10
or
$ awk '{$3=substr($3,1,1)} 1' file
15642 G A 0.77501 107
15643 G A 0.7570 17
15644 C t 0.7501 10
or
$ awk '{$3=toupper(substr($3,1,1))} 1' file
15642 G A 0.77501 107
15643 G A 0.7570 17
15644 C T 0.7501 10
This might work for you (GNU sed):
sed -ri 's/(\S)\S*/\U\1/3' file
Convert the first character of the third field to uppercase.
Related
I am having an issue that I almost solved thanks to this post. Using a dataset in the same format:
File 1
32074_32077 1 0.008348 834830 G A
32082_32085 1 0.008349 834928 A G
32085_32088 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
File 2
rs3094315 1 0.020130 752566 G A
rs12124819 1 0.020242 834928 A G
rs28765502 2 0.022137 834928 T C
rs7419119 3 0.022518 846808 T G
I would like to change the 1st column of file one only IF $4 and $2 are the same in FILE2. If it is not I would like to keep the line as it is.
Expected output:
32074_32077 1 0.008348 834830 G A
rs12124819 1 0.008349 834928 A G
rs28765502 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
Using the answer from the linked post, I cannot have the expected output. I tried this:
awk 'FNR==NR{a[$4]=$1; b[$2]=$1; next} ($4 in a && $2 in b){$1=a[$4]} 1' file1 file2
It doesn't work as expected because the condition $2 in b is always true.. I understand but I don't know how I can work around this.
Thank you.
You may use this awk:
awk 'FNR==NR {a[$2,$4]=$1; next} ($2,$4) in a {$1 = a[$2,$4]} 1' file2 file1 |
column -t
32074_32077 1 0.008348 834830 G A
rs12124819 1 0.008349 834928 A G
rs28765502 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
Use composite key array a with key as ($2,$4).
Used column -t for showing tabular output.
I have a file first.txt that looks like this :
45
56
74
62
I want to append this file to second.tsv that looks like this(there are 17 columns) :
2 a ...
3 b ...
5 c ...
6 d ...
The desired output is :
2 45 a ...
3 56 b ...
5 74 c ...
6 62 d ...
How can I append to the second column?
I've tried
awk -F, '{getline f1 <"first.txt" ;print $1,f1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17}' second.tsv
but did not work. This added the columns of first.txt to the last column of second.tsv, and it was not tab separated.
Thank you.
Your code works if you remove the -F, bit. This tells awk that the file is comma-separated, which it is not.
Another option would be to go for a piped version with paste, e.g.:
paste first.tsv second.tsv | awk '{ t=$2; $2=$1; $1=t } 1' OFS='\t'
Output:
2 45 a ...
3 56 b ...
5 74 c ...
6 62 d ...
$ awk 'NR==FNR{a[FNR]=$0;next} {$1=$1 OFS a[FNR]} 1' file1 file2
2 45 a ...
3 56 b ...
5 74 c ...
6 62 d ...
If your files are tab-separated add BEGIN{FS=OFS="\t"} at the front.
The -F option lets you specify the field separator for awk, but using '\n' as the line separator doesn't work, that is, it doesn't make $1 the first line of the input, $2 the second line, and so on.
I suspect that this is because awk looks for the field separator within each line. Is there a way to get around this with awk, or some other Linux command? Basically, I want to separate my input by newline characters and put them into an Excel file.
I'm still warming up to Linux and shell scripts, which is the reason for my lack of creativity with this problem.
Thank you!
You may require to overwrite the input record separator (RS), which default is newline.
See my example below,
$ cat test.txt
a
b
c
d
$ awk 'BEGIN{ RS = "" ; FS = "\n" }{print $1,$2,$3,$4}' test.txt
a b c d
Note that you can change both the input and output record separator so you can do something like this to achieve a similar result to the accepted answer.
cat test.txt
a
b
c
d
$ awk -v ORS=" " '{print $1}' test.txt
a b c d
one can simplify it to just the following, with a minor caveat of extra trailing space without trailing newline :
% echo "a\nb\nc\nd"
a
b
c
d
% echo "a\nb\nc\nd" | mawk 8 ORS=' '
a b c d %
To rectify that, plus handling the edge case of no trailing newline from input, one can modify it to :
% echo -n "a\nb\nc\nd" | mawk 'NF-=_==$NF' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010
% echo "a\nb\nc\nd" | mawk 'NF -= (_==$NF)' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010
From a tab delimited file. I'm trying to extract all rows based on a unique value from column 4 and then save it as a CSV. However, I would like to extract all the distinct values in column 4 and save them as CSV in one go.
I was able to extract one value using this command:
awk -F $'\t' '$4 == "\"C333\"" {print}' dataFile > C333.csv
Let's consider this test file:
$ cat in.csv
a b c d
aa bb cc d
1 2 3 4
12 23 34 4
A B C d
Now, let's write each row to a tab-separated output file that is named after the fourth column:
$ awk -F'\t' '{f=$4".csv"; print>>f; close(f)}' OFS='\t' in.csv
$ cat d.csv
a b c d
aa bb cc d
A B C d
$ cat 4.csv
1 2 3 4
12 23 34 4
I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh