how to evaluate awk in a sed statement? - awk

For each .fastq file in a folder, I need to append the filename of the file that read is contained on to the header line.
Say the first 8 lines of fastq file read1.with.long.identifier.fastq are:
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
I would like them the read:
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT read1.with.long.identifier
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT read1.with.long.identifier
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
using:
cat read1.with.long.identifier.fastq | sed "/^#......:/ s/$/
awk "FILENAME" read1.with.long.identifier.fastq/" | tr "\t" "\n" >
read1_new_headers.fastq
However, this yields:
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT awk "FILENAME" read1.with.long.identifier.fastq
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT awk "FILENAME" read1.with.long.identifier.fastq
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
This is a non-iterative version. I know I can just take out awk and FILENAME and paste in the file name "read1.with.new.identifier" and get what I need,
but in the actual data I need to do this iteratively (awk FILENAME i...) for many files with different filenames and need something that will evaluate the filename automatically. I'm obviously thinking about this incorrectly. How do you evaluate awk in a sed statement?

Now that I understand read1.with.long.identifier is actually a filename, my sample codes is even easier and requires no sed.
awk '/^#/{$0=$0 " " FILENAME }1' file1 file2 ... > all_output
Should append the current FILENAME to the end of any line that begins with #.
My test using data.txt as the file produced
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT data.txt
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT data.txt
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
If you need to overwrite each file that will require a for loop and temporary files. But without more feedback, I don't want to spend further time only to discover I'm heading in the wrong direction.
IHTH

Related

How to use filenames having special characters with awk '{system("stat " $0)}'

For example, list.txt is like this:
abc.txt
-abc.txt
I couldn't get the correct answer with either
awk '{system("stat " $0)}' list.txt or awk '{system("stat \"" $0 "\"")}' list.txt.
How could I tell the awk-system to add quotes around the filename?
awk '{system("stat " $0)}' list.txt certainly would not work.
But why awk '{system("stat \"" $0 "\"")}' list.txt wouldn't either? It behaves just like the former.
But with awk '{system("stat \\\"" $0 "\\\"")}' list.txt, I would got this:
stat: cannot stat '"abc.txt"': No such file or directory
First of all, if you want to get the output of the stat command, system() is not the right way to go. It merely returns the return code instead of the execution output.
You may try cmd | getline myOutput in awk. The myOutput variable will hold the output (one line only). Or, you can write on a pipe print ... | cmd to print the output
Regarding your file -abc.txt. Quoting it isn't enough. You can try to execute it in terminal stat "-abc.txt" it won't work, as the filename starts with -. You need to add --: stat -- "-abc.txt" So, you probably want to check if the filename starts with - and add the -- in your awk code.
Finally, about quote, you can declare an awk variable, like awk -v q='"' '{.... then, when you want to use ", you use q, in this way, your code may easier to read. E.g., print "stat " q "myName" q

Remove bad characters to file name while spliting with awk

I have a large file that I split with awk, using the last column as the name for the new files, but one of the columns include a "/", which gives can't open error.
I have tried make a function to transform the name for the file but awk don't use it when I run it, maybe a error from part:
tried_func() {
echo $1 | tr "/" "_"
}
awk -F ',' 'NR>1 {fname="a_map/" tried_func $NF".csv"; print >> fname;
close(fname)}' large_file.csv
Large_file.csv
A, row, I don't, need
plenty, with, columns, good_name
alot, off, them, another_good_name
more, more, more, bad/name
expected res:
list of file i a_map:
good_name.csv
another_good_name.csv
bad_name.csv
actual res:
awk: can't open file a_map/bad/name.csv
Don't need to be a function, if I can just skip the "/" in awk that is fab too.
Awk is not part of the shell, it's an independent programming language, so you can't call shell functions that way. Instead, just do the whole thing within awk:
$ awk -F ',' '
NR>1 {
gsub(/\//,"_",$NF) # replace /s with _s
fname="a_map/" $NF ".csv"
print >> fname
close(fname)
}' file

Count field separators on each line of input file and if missing/exceeding, output filename to error file

I have to validate the input file, Input.txt, for proper number of field separators on each row and if even one row including the header is missing or exceeding the correct number of field separators then print the name of the file to errorfiles.txt and exit.
I have another file to use as reference for the correct number of field separators, valid.txt, then compare the number of field separators on each row of the input file with the number of field separators in the valid.txt file.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
This is not working.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
It is not fully clear what your requirement is, to print the FILENAME on just a single input file provided, perhaps you wanted to loop over a list of files on a directory running this command?
Anyway, to use the content of the file in the context of awk, just use its -v switch and use input re-direction on the file
awk -F '|' -v count="$(<fscount)" -v fname="errorfiles.txt" '(NF-1) != (count+0) { print FILENAME > fname; close(fname); exit}' Input.txt
Notice the use of close(filename) here, which is generally required when you are manipulating files inside awk constructs. The close() call just closes the file descriptor associated with opening the file pointed by filename explicitly, instead of letting the OS do it.
GNU awk solution:
awk -F '|' 'ARGIND==1{aimNF=NF; nextfile} ARGIND==2{if (NF!=aimNF) {print FILENAME > "errorfiles.txt"; exit}}' valid.txt Input.txt
You can do it with just one command,
-- use awk to read two files, store NF number of 1st file, and compare it in the second file.
For other awk you can replace ARGIND==1 with FILENAME==ARGV[1], and so on.
Or if you are sure first file won't be empty, use NR==FNR and NR>FNR instead.

I'm trying to compare two fastq file(paired reads),print line number n of another file

I'm trying to compare two fastq reads(paired reads) such that position(considering line number) of pattern match in file1.fastq is compared to file2.fastq. I want to print what lies on the same position or line number in file2.fastq. I'm trying to do this through awk. Ex. If my pattern match lies in line number 200 in file1, I want to see what is there in line 200 in file 2. Any suggestion on this appreciated.
In general, you want this form:
awk '
{ getline line_2 < "file2" }
/pattern/ { print FNR, line_2 }
' file1
Alternately, paste the files together first (assuming your shell is bash)
paste -d $'\1' file1 file2 | awk -F $'\1' '$1 ~ /pattern/ {print FNR, $2}'
I'm using CtrlA as the field delimiter, assuming that characters does not appear in your files.
My understanding is you have three files. A pattern file and two data files. You want to find the line numbers of the patterns in data file 1 and find the corresponding lines in data file 2. You'll get more help if you can clarify it and perhaps provide input files and expected output.
awk to the rescue!
awk -F: -vOFS=: 'NR==FNR{lines[$1]=$0;next} FNR in lines{print lines[FNR],$0}' <(grep -nf pattern data1) data2
will print line number, pattern matched from data file 1, and corresponding line from data file 2. For my made up files with quasi-random data I got.
1:s1265e:s1265e
2:s28629e:s28629e
3:s6630e:s6630e
4:s24530e:s24530e
5:s23216e:s23216e
6:s25985e:s25985e
My novice attempt so far
zcat file1.fastq.gz|awk '~/pattern/{print NR;}'>matches.csv
awk 'FNR==NR{a[$1]=$0;next;}(FNR in a)' matches.csv file2.fastq.gz

Using grep and awk to search and print the output to new file

I have 100 files and want to search a specific word in the first column of each file and print the content of all columns from this word to a new file
I tried this code but doesn't work well it prints only the content of one file not all:
ls -t *.txt > Filelist.tmp
cat Filelist.tmp | while read line do; grep "searchword" | awk '{print $0}' > outputfile.txt; done
This is what you want:
$ awk '$1~/searchword/' *.txt >> output
This compares the first field against searchword and appends the line to output if it matches. The default field separator with awk is whitespace.
The main problem with your attempt is you are overwriting > the file evertime, you want to be using append >>.