exclude sequences depending on description ID in AWK - awk

I have fasta files which have some description ID ( isoforms 2 , ... Isoform 9 ), i want to exclude them in fasta files.
I used this command line to see which file contain the isoform 2 to 9 ID :
for i in `ls *.fasta`; do l=`grep 'isoform X[2-9]' $i | head -1`; echo $i $l; done | awk '(NF==1){print}' | head
There is a way to include something in my command line for removing them all ?
Thanks.

sed 's/isoform [2-9]\{1,1\}//g' *.fasta

Related

How to display the date of each file as the first element of each lines with bash/awk?

I have 7 txt files which are the output of the df -m command on AIX 7.2.
I need to keep only the first column and the second column for one filesystem. So I do that :
cat *.txt | grep hd4 | awk '{print $1","$2}' > test1.txt
And the output is :
/dev/hd4,384.00
/dev/hd4,394.00
/dev/hd4,354.00
/dev/hd4,384.00
/dev/hd4,484.00
/dev/hd4,324.00
/dev/hd4,384.00
Each files are created from the crontab and their filenames are :
df_command-2019-09-03-12:50:00.txt
df_command-2019-08-28-12:59:00.txt
df_command-2019-08-29-12:51:00.txt
df_command-2019-08-30-12:52:00.txt
df_command-2019-08-31-12:53:00.txt
df_command-2019-09-01-12:54:00.txt
df_command-2019-09-02-12:55:00.txt
I would like to keep only the date on the filename, I'm able to do that :
test=df_command-2019-09-03-12:50:00.txt
echo $test | cut -d'-' -f2,3,4
outout :
2019-09-03
But I would like to put each date as the first element of each line of my test1.txt :
2019-08-28,/dev/hd4,384.00
2019-08-29,/dev/hd4,394.00
2019-08-30,/dev/hd4,354.00
2019-08-31,/dev/hd4,384.00
2019-09-01,/dev/hd4,484.00
2019-09-02,/dev/hd4,324.00
2019-09-03,/dev/hd4,384.00
Do you have any idea to do that ?
This awk may do:
awk '/hd4/ {split(FILENAME,a,"-");print a[2]"-"a[3]"-"a[4]","$1","$2}' *.txt > test1.txt
/hd4/ find line with hd4
split(FILENAME,a,"-") splits the filename in to array a split by -
print a[2]"-"a[3]"-"a[4]","$1","$2 print year-month-date, field 1, field 2
> test1.txt to file test1.txt
Date output file : dates.txt
2019-08-20
2019-08-08
2019-08-01
File system data fsys.txt
/dev/hd4,384.00
/dev/hd4,394.00
/dev/hd4,354.00
paste can be used to append the files as columns. Use -d to specify comma as the separator.
paste -d ',' dates.txt fsys.txt

Awk, order foreach 12 lines to insert query

I have the following script:
curl -s 'https://someonepage=5m' | jq '.[]|.[0],.[1],.[2],.[3],.[4],.[5],.[6],.[7],.[8],.[9],.[10],.[11],.[12]' | perl -p -e 's/\"//g' |awk '/^[0-9]/{print; if (++onr%12 == 0) print ""; }'
This is part of result:
1517773500000
0.10250100
0.10275700
0.10243500
0.10256600
257.26700000
1517773799999
26.38912220
1229
104.32200000
10.70579910
0
1517773800000
0.10256600
0.10268000
0.10231600
0.10243400
310.64600000
1517774099999
31.83806883
1452
129.70500000
13.29758266
0
1517774100000
0.10243400
0.10257500
0.10211800
0.10230000
359.06300000
1517774399999
36.73708621
1296
154.78500000
15.84041910
0
I want to insert this data in a MySQL database. I want for each line this result:
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,151774399999,36.73708621,1296,154.78500000,15.84041910,0)
I need merge lines each 12 lines, any can help me for get this result.
Here's an all-jq solution:
.[] | .[0:12] | #tsv | gsub("\t";",") | "(\(.))"
In the sample, all the subarrays have length 12, so you might be able to drop the .[0:12] part of the pipeline. If using jq 1.5 or later, you could use join(“,”) instead of the #tsv|gsub portion of the pipeline. You might, for example, want to consider:
.[] | join(“,”) | “(\(.))”. # jq 1.5 or later
Invocation: use the -r command-line option
Sample output:
(1517627400000,0.10452300,0.10499000,0.10418200,0.10449400,819.50400000,1517627699999,85.57150693,2340,452.63400000,47.27213035,0)
(1517627700000,0.10435700,0.10449200,0.10366000,0.10370000,717.37000000,1517627999999,74.60582079,1996,321.25500000,33.42273846,0)
(1517628000000,0.10376600,0.10390000,0.10366000,0.10370400,519.59400000,1517628299999,53.88836170,1258,239.89300000,24.88613854,0)
$ awk 'BEGIN {RS=""; OFS=","} {$1=$1; $0="("$0")"}1' file
(1517773500000,0.10250100,0.10275700,0.10243500,0.10256600,257.26700000,1517773799999,26.38912220,1229,104.32200000,10.70579910,0)
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,1517774399999,36.73708621,1296,154.78500000,15.84041910,0)
RS="":
Treat groups of lines separated one or more blank lines as a record
OFS=","
Set the output separator to be a ","
$1=$1
Reconstitute the line, replacing the input separators with the output separator
$0="("$0")"
Surround the record with parens
1
Print the record

base64 decoding from file column

I have a file, every line with 6 columns separated by ",". Last column is zipped and encoded in base 64. Output file should be column 3 and column 6(decoded/unzipped).
I tried to do this by
awk -F',' '{"echo "$6" | base64 -di | gunzip" | getline x;print $3,x }' OFS=',' inputfile.csv >outptfile_decoded.csv
The result for the first lines is ok, but after some lines the decode output is the same as the line before. It seems that decoding & unzipping hungs, but I didn't get error message.
Singe decode/unzipping works fine i.e.
echo "H4sIAAAAAAAAA7NJTkuxs0lMLrEztNEHUTZAgcy8tHw7m7zSXLuS1BwrbRNjMzMTc3MDAzMDG32QqE1uSWVBqh2QB2HYlCYX2xnb6IMoG324ASCWHQAaafi1YQAAAA==" | base64 -di | gunzip
What can be the reason for this effect? (there are no error messages).
Is there another way which works reliable?
without a test case difficult to recommend anything. Here is a working script with input data
create a test data file
$ while read f; do echo $f,$(echo $f | gzip -f | base64); done < <(seq 5) | tee file.g
1,H4sIAJhBuVkAAzPkAgBT/FFnAgAAAA==
2,H4sIAJhBuVkAAzPiAgCQr3xMAgAAAA==
3,H4sIAJhBuVkAAzPmAgDRnmdVAgAAAA==
4,H4sIAJhBuVkAAzPhAgAWCCYaAgAAAA==
5,H4sIAJhBuVkAAzPlAgBXOT0DAgAAAA==
and decode
$ awk 'BEGIN {FS=OFS=","}
{cmd="echo "$2" | base64 -di | gunzip"; cmd | getline v; print $1,v}' file.g
1,1
2,2
3,3
4,4
5,5

Count files that have unique prefixes

I have set of files that looks like the following. I'm looking for a good way to count all files that have unique prefixes, where "prefix" is defined by all characters before the second hyphen.
0406-0357-9.jpg 0591-0349-9.jpg 0603-3887-27.jpg 59762-1540-40.jpg 68180-517-6.jpg
0406-0357-90.jpg 0591-0349-90.jpg 0603-3887-28.jpg 59762-1540-41.jpg 68180-517-7.jpg
0406-0357-91.jpg 0591-0349-91.jpg 0603-3887-29.jpg 59762-1540-42.jpg 68180-517-8.jpg
0406-0357-92.jpg 0591-0349-92.jpg 0603-3887-3.jpg 59762-1540-5.jpg 68180-517-9.jpg
0406-0357-93.jpg 0591-0349-93.jpg 0603-3887-30.jpg 59762-1540-6.jpg
Depending on what you actually want output, either of these might be what you want:
ls | awk -F'-' '{c[$1"-"$2]++} END{for (p in c) print p, c[p]}'
or
ls | awk -F'-' '!seen[$1,$2]++{count++} END{print count+0}'
If it's something else, update your question to show the output you're looking for.
This should do it:
ls *.jpg | cut -d- -s -f1,2 | uniq | wc -l
Or if your prefixes are always 4 digits, one dash, 4 digits, you don't need cut:
ls *.jpg | uniq -w9 | wc -l
Parses ls (bad, but it doesn't look like it will cause a problem with these filenames),
uses awk to set the field separator as -.
!seen[$1,$2]++) uses an associative array with $1,$2 as the key and increments, then checks if the value equals 0 to ensure it is only printed once (based on $1 and $2).
print prints on screen :)
ls | awk 'BEGIN{FS="-" ; printf("%-20s%-10s\n","Prefix","Count")} {seen[$1"-"$2]++} END{ for (k in seen){printf("%-20s%-10i\n",k,seen[k])}}'
Will now count based on prefix with headers :)

How to take a single record in a file and display it vertically with the count of field before the actual field is displayed

I have a file with a single line in it that has a record that is deliminted by semi-colons. So far I have figured out that I can use tr by issuing:
tr ';' '\n' < t
However since the record has 140 fields, I'd like to be able to show the field count when displaying such as the following:
1 23
2 324234
3 AAA
.
.
140 Blah
Help is greatly appreciated!
tr \; '\n' <t|nl
or
awk -v RS=';' '$1=++i" "$1' file
test:
kent$ echo "a;b;c;d"|awk -v RS=';' '$1=++i" "$1'
1 a
2 b
3 c
4 d
You could just run it through cat -n.
tr \; '\n' < t | cat -n
Since this is tagged awk, you could do it that way, too; it's just a little wordier:
awk -F\; '{for (i=1;i<=NF;++i) { print i" "$i }}'
In shell you can use IFS to specify a field separator like so,
IFS=";"
i=0
for s in $(<file)
do
((i++))
echo $i $s
done