awk output with spaces in first column - awk

I tried using awk splitting the columns to print a sentence but the first column has spaces.
Sample of my beginner code:
$ awk '/Linux/ { print "The filename","\""$1"\"","is located in",$2 }' test.txt
The filename "The" is located in test
The filename "Some" is located in file
The filename "File" is located in name
The filename "Something_here" is located in /ABC
The filename "Another_test" is located in /DEFG
The filename "Label" is located in test
From file: test.txt
Filename Folder Type
-------------------------------------- -------------- ------
The test file /test/folder Linux
Some file / Linux
File name /Temp Linux
Something_here /ABC Linux
Another_test /DEFG Linux
Label test /HIJK Linux
what I want to achieve: (with quotes inclusive)
The filename "Default file" is located in /
The filename "The test file" is located in /test/folder
issue is when i use 'space' or '/' as delimiter i cannot get the whole line when printing

If you have GNU AWK, this should do the trick:
awk 'match($0, /([^\/]+)([^ ]+) *Linux/, arr) { sub(/ +$/, "", arr[1]); printf("The filename \"%s\" is located in %s\n", arr[1], arr[2]) }' test.txt
Explanation:
# match and store groups in 'arr'
# - arr[1]: everything up until the first slash (including a lot of whitespace)
# - arr[2]: first slash until space
# - rest: also ensure there's 'Linux' after that
match($0, /([^\/]+)([^ ]+) *Linux/, arr) {
# trim whitespace from the right hand side of the filename
sub(/ +$/, "", arr[1]);
# print
printf("The filename \"%s\" is located in %s\n", arr[1], arr[2])
}
Note that there is also a less powerful version of match in other flavors of AWK and the same thing could be achieve with those, but you'd have to write a bit more code.

GNU awk has regex field separators, so just require multiple spaces separating your columns.
awk '/Linux/ { print "The file \""$1"\" is in "$2"." }' FS=" *" test.txt
It also offers fixed-width fields, say info gawk fieldwidths, you could use the lengths of the dash lines to set those on the fly.

I would propose sed with the substitution based on regular expression and back references plus a grep command to eliminate the header lines of the source file:
$ cat test.txt | grep -E 'Linux[ ]*$' | sed -E 's%(.+)([^ ])([ ]+)(/.+)[ ]+Linux[ ]*$%The filename "\1\2" is located in \4%'
The filename "The test file" is located in /test/folder
The filename "Some file" is located in /
The filename "File name" is located in /Temp
The filename "Something_here" is located in /ABC
The filename "Another_test" is located in /DEFG
The filename "Label test" is located in /HIJK
A good reference for the regular expressions (regex) is in the Linux manuals
The detailed description as requested in the comment:
grep with -E option accepts the extended regex (reference document above). Here it is used to filter the lines containing "Linux" word followed by some spaces if any at the end of each line
The output of grep goes into the input of sed
sed is passed the -E option like grep to accept extended regex. The s command substitutes a characters matching the regex (first part between % chars = "(.+)([^ ])([ ]+)(/.+)[ ]+Linux[ ]*$") by others (second part between % chars = "The filename "\1\2" is located in \4").
The second part uses back references: '\' followed by a nonzero decimal digit n is substituted by the nth parenthesized sub-expression of the regex. Here, \1 is substituted by the string which matches the 1st "(.+)" which is the filename here, \2 is substituted by the following "([^ ])" which is the last char of the filename (trick to suppress the following blanks from the name)...
This is not a rigorous explanation but at least it provides some inputs to go farther.
Another solution is to pass multiple actions on the sed command line. Hence, you can add a query to delete the first 2 header lines to suppress the pipe with cat and grep. Here "1,2d', means "delete lines numbers 1 and 2":
$ sed -E '1,2d;s%(.+)([^ ])([ ]+)(/.+)[ ]+Linux[ ]*$%The filename "\1\2" is located in \4%' test.txt
The filename "The test file" is located in /test/folder
The filename "Some file" is located in /
The filename "File name" is located in /Temp
The filename "Something_here" is located in /ABC
The filename "Another_test" is located in /DEFG
The filename "Label test" is located in /HIJK
NOTES: According to the manual, the -E option switches to using extended regular expressions. It has been supported for years by GNU sed, and is now included in POSIX.
On older systems, -r may be used if -E is not supported:
$ sed -r '1,2d;s%(.+)([^ ])([ ]+)(/.+)[ ]+Linux[ ]*$%The filename "\1\2" is located in \4%' test.txt
The filename "The test file" is located in /test/folder
The filename "Some file" is located in /
The filename "File name" is located in /Temp
The filename "Something_here" is located in /ABC
The filename "Another_test" is located in /DEFG
The filename "Label test" is located in /HIJK

Related

How to find a file which contains only one string at the top and rest of the lines are empty using sed

I wanted to find from a list of files which contain only one specific string at the top and rest of the lines are empty using sed can someone help me with this. The text file which i wanted to find contain the contact link this
line1:Some text
line2:blank line
line3:blank line
line4:blank line
.
.
.
.
.
.
so on
I have tried this command but it deletes the empty lines. I do not want to delete the empty lines but to find the file which consist of specific string at the top and rest of the lines empty
sed -i "/^$/d" "file.txt"
sed -i "/^$/d" "file.txt"
sed is powerful and terse, but fairly unintelligent. GNU awk picks up the slack:
gawk '
FNR==1 && /Some text/ {a[FILENAME]++; next} #1
/./ {delete a[FILENAME]; nextfile} #2
END {for(f in a) print f} #3
' *.txt
If the first line of a file (File Number Record) matches /regex/ (which you should adjust to match your actual files), record the filename in an array and skip to the next input line.
If a line contains any character, remove filename from array and skip the file. (nextfile is not critical here, but will improve performance at scale)
After all processing is completed, print all indices in array.
*.txt should be adjusted to match all the files you wish to test.
If Perl is your option, would you please try the following:
perl -0777ne 'print "$ARGV\n" if /^\S+(\r?\n)+$/s;' *.txt
sed is for doing simple s/old/new on individual strings, that is all. With GNU awk for nextfile and ENDFILE:
awk '
FNR==1 { if (!/Some text/) nextfile }
/./ { f=1; nextfile }
ENDFILE { if (f) print FILENAME; f=0 }
' *.txt
If you want lines of all-blanks to be considered as empty then change /./ to NF.
This might work for you (GNU sed):
sed -znE '/^[^\n]*some string[^\n]*\n(\s*\n)+$/F' file1 file2 file3 ...
This solution slurps in each file and uses pattern matching to identify file names (F) which contain some string in the first line of an otherwise empty file.

how to evaluate awk in a sed statement?

For each .fastq file in a folder, I need to append the filename of the file that read is contained on to the header line.
Say the first 8 lines of fastq file read1.with.long.identifier.fastq are:
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
I would like them the read:
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT read1.with.long.identifier
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT read1.with.long.identifier
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
using:
cat read1.with.long.identifier.fastq | sed "/^#......:/ s/$/
awk "FILENAME" read1.with.long.identifier.fastq/" | tr "\t" "\n" >
read1_new_headers.fastq
However, this yields:
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT awk "FILENAME" read1.with.long.identifier.fastq
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT awk "FILENAME" read1.with.long.identifier.fastq
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
This is a non-iterative version. I know I can just take out awk and FILENAME and paste in the file name "read1.with.new.identifier" and get what I need,
but in the actual data I need to do this iteratively (awk FILENAME i...) for many files with different filenames and need something that will evaluate the filename automatically. I'm obviously thinking about this incorrectly. How do you evaluate awk in a sed statement?
Now that I understand read1.with.long.identifier is actually a filename, my sample codes is even easier and requires no sed.
awk '/^#/{$0=$0 " " FILENAME }1' file1 file2 ... > all_output
Should append the current FILENAME to the end of any line that begins with #.
My test using data.txt as the file produced
#M04803:91:000000000-D3852:1:1102:14324:1448 1:N:0:GTGTCTCT+TGAGCAGT data.txt
TTTTGTTTCCTCTTCTTATTGTTATTCTTATGTTCATCTGGTATCCCTGCCTGATCCGTGTTCAACCTTGCGAATAGG
+
11111B1133B1111BF3BA33D3B3BDG331DBB33D3A1B1D12BB10BAA0B110//0B2221ABG11//AA/11
#M04803:91:000000000-D3852:1:1102:12470:1826 1:N:0:GTGTCTCT+AGAGCAGT data.txt
CCTGGGAGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAAGTTTCGAGTTTTC
+
1>>1A1B1>>>C1AAEFGGEADFGGHHHHHDGDFHHFHGGCAECGHHGFFHHHHFHHGFFEFHHHHHHHHGGHFGHHH
If you need to overwrite each file that will require a for loop and temporary files. But without more feedback, I don't want to spend further time only to discover I'm heading in the wrong direction.
IHTH

How to prevent new line when using awk

I have seen several variations of this question, but none of the answers are helping for my particular scenario.
I am trying to load some files, adding a column for filename. This works fine only if I put the filename as the first column. If I put the filename column at the end (where I want it) it creates a new line between $0 and the rest of the print that I am unable to stop.
for f in "${FILE_LIST[#]}"
do
awk '{ print FILENAME,"\t",$0 } ' ${DEST_DIR_FILES}/$f > tmp ## this one works
awk '{ print $0,"\t",FILENAME } ' ${DEST_DIR_FILES}/$f > tmp ## this one does not work
mv tmp ${DEST_DIR_FILES}/$f
done > output
Example data:
-- I'm starting with this:
A B C
aaaa bbbb cccc
1111 2222 3333
-- I want this (new column with filename):
A B C FILENAME
aaaa bbbb cccc FILENAME
1111 2222 3333 FILENAME
-- I'm getting this (\t and filename on new line):
A B C
FILENAME
aaaa bbbb cccc
FILENAME
1111 2222 3333
FILENAME
Bonus question
I'm using a variable to pass the filename, but it is putting the whole path. What is the best way to only print the filename (without path) ~OR~ strip out the file path using a variable that holds the path?
It's almost certainly a line endings issues as your awk script a syntactically correct. I suspect your files in "${FILE_LIST[#]}" came from a Windows box and have \r\n line endings. To confirm the line endings for a given file you can run the file command on each file i.e. file filename:
# create a test file
$ echo test > foo
# use unix2dos to convert to Windows style line endings
$ unix2dos foo
unix2dos: converting file foo to DOS format ...
# Use file to confirm line endings
$ file foo
foo: ASCII text, with CRLF line terminators
# Convert back to Unix style line endings
$ dos2unix foo
dos2unix: converting file foo to Unix format ...
$ file foo
foo: ASCII text
To convert your files to Unix style line endings \n run the following command:
$ for "f" in "${FILE_LIST[#]}"; do; dos2unix "$f"; done
Explanation:
When the FILENAME is the first string on the line the carriage returns \r effectively does nothing as we are already at the start of the line. When we try to print FILENAME after any other characters we see the effects that we are brought to the start of the next line, the TAB is printed then the FILENAME.
Side note:
Awk has the variable OFS for setting the output field separator so:
$ awk '{print $0,"\t",FILENAME}' file
Can be rewritten as:
$ awk '{print $0,FILENAME}' OFS='\t' file
Bonus Answer
The best way I.M.O to strip the path of file is to use the utility basename:
$ basename /tmp/foo
foo
Using command substitution:
$ awk '{print FILENAME}' $(basename /tmp/foo)
foo

zcat a file, output its contents to another file based on original filename

I'm looking to create a bash/perl script in Linux that will restore .gz files based on filename:
_path_to_file.txt.gz
_path_to_another_file.conf.gz
Where the underscores form the directory structure.. so the two above would be:
/path/to/file.txt
/path/to/another/file.conf
These are all in the /backup/ directory..
I want to write a script that will cat each .gz file into its correct location by changing the _ to / to find the correct path - so that the contents of _path_to_another_file.conf.gz replaces the text in /path/to/another/file.conf
zcat _path_to_another_file.conf.gz > /path/to/another/file.conf
I've started by creating a file with the correct destination filenames in it.. I could create another file to list the original filenames in it and have the script go through line by line?
ls /backup/ |grep .gz > /backup/backup_files && sed -i 's,_,\/,g' /backup/backup_files && cat /backup/backup_files
Whatcha think?
Here's a Bash script that should do what you want :
#!/bin/bash
for f in *.gz; do
n=$(echo $f | tr _ /)
zcat $f > ${n%.*}
done
It loops over all files that end with .gz, and extracts them into the path represented by their filename with _ replaced with /.
That's not necessarily an invertible mapping (what if the original file is named high_scores for instance? is that encoded specially, e.g., with double underscore as high__scores.gz?) but if you just want to take a name and translate _ to / and remove .gz at the end, sed will do it:
for name in /backup/*.gz; do
newname=$(echo "$name" |
sed -e 's,^/backup/,,' \
-e 's,_,/,g' \
-e 's/\.gz$//')
echo "zcat $name > $newname"
done
Make sure it works right (the above is completely untested!) then take out the echo, leaving:
zcat "$name" > "$newname"
(the quotes protect against white space in the names).

How does awk distinguish between arguments and input file?

This is my shell script that receives a string as an input from the user from the stdin.
#!bin/sh
printf "Enter your query\n"
read query
cmd=`echo $query | cut -f 1 -d " "`
input_file=`echo $query | cut -f 2 -d " "`
printf $input_file
if [ $input_file = "" ]
then
printf "No input file specified\n"
exit
fi
if [ $cmd = "summary" ]
then
awk -f summary.awk $input_file $query
fi
Lets say he enters
summary a.txt /o foo.txt
Now cmd variable will take the value summary and input_file will take a.txt.
Isn't that right?
I want summary.awk to work on $input_file, based on what is present in $query.
My understanding is as follows :
The 1st command line argument passed is treated as input file.
e.g. : awk 'code' file arg1 arg2 arg3
only file is treated as input file
If the input file is piped, it doesn't see any of the arguments as input files.
e.g. : cat file | awk 'code' arg1 arg2 arg3
arg1 is NOT treated as input file.
Am I right?
The problem is
I get awk: cannot open summary (No such file or directory)
Why is it trying to open summary?
It is the next word after $input_file.
How do I fix this issue?
If there's no -f option, the first non-option argument is treated as the script, and the remaining arguments are input files. If there's a -f option, the script comes from that file, and the rest are input files.
If and only if there are no input file arguments, it reads the input from stdin. This means if you pipe to awk and also give it filename arguments, it will ignore the pipe.
This is the same as most Unix file-processing commands.
Try this:
awk -f summary.awk -v query="$query" "$input_file"
This will set the awk variable query to the contents of query.