Concatenating multiple files into a single line in linux - awk

I have 3 fasta files like following
>file_1_head
haszhaskjkjkjkfaiezqbsga
>file_1_body
loizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdja
>file_1_tail
mnnbasnbdnztoaosdhgas
I would like to concatenate them into a single like following
>file_1
haszhaskjkjkjkfaiezqbsgaloizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdjamnnbasnbdnztoaosdhgas
I tried with cat command cat file_1_head.fasta file_1_body.fasta file_1_tail.fasta but it didnt concatenates into a single line like above. Is it possible with "awk" Kindly guide me.

Do you mean your three files have the content
file_1_head.fasta
>file_1_head
haszhaskjkjkjkfaiezqbsga
file_1_body.fasta
>file_1_body
loizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdja
and file_1_tail.fasta
>file_1_tail
mnnbasnbdnztoaosdhgas
including the name of each of them within them as the first line?
Then you could do
(echo ">file_1"; tail -qn -1 file_1_{head,body,tail}.fasta | tr -d "\n\t ") > file_1.fasta
to get file_1.fasta as
>file_1
haszhaskjkjkjkfaiezqbsgaloizztzezzqieovbahsgzqwqoiropoqiwoioioiweoitwwerweuiruwieurhcabccjashdjamnnbasnbdnztoaosdhgas
This also removes some extra whitespace at the end of the lines in your input that I got when I copied them verbatim.

You can do this simply with
cat file1 file2 file3 | tr -d '\n' > new_file
tr deletes the newline character.
EDIT:
For your specific first line just do
echo file_1 > new_file
cat file1 file2 file3 | tr -d '\n' >> new_file
The first command creates the file with one line file_1 in it. Then the cat... command just appends to this file.

What about this?
awk 'BEGIN { RS=""} {for (i=1;i<=NF;i++) { printf "%s",$i } }' f1_head f1_body f1_tail

Related

Convert multiple lines to a line separated by brackets and "|"

I have the following data in multiple lines:
1
2
3
4
5
6
7
8
9
10
I want to convert them to lines separated by "|" and "()":
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|10
I made a mistake. I'm sorry,I want to convert them to lines separated by "|" and "()":
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
What I have tried is:
seq 10 | sed -r 's/(.*)/(\1)/'|paste -sd"|"
What's the best unix one-liner to do that?
This might work for you (GNU sed):
sed 's/.*/(&)/;H;1h;$!d;x;s/\n/|/g' file
Surround each line by parens.
Append all lines to the hold space except for the first line which replaces the hold space.
Delete all lines except the last.
On the last line, swap to the hold space and replace all newlines by |'s.
N.B. When a line is deleted no further commands are invoked and the command cycle begins again. That is why the last two commands are only executed on the last line of the file.
Alternative:
sed -z 's/\n$//;s/.*/(&)/mg;y/\n/|/' file
With your shown samples please try following awk code. This should work in any version of awk.
awk -v OFS="|" '{val=(val?val OFS:"") "("$0")"} END{print val}' Input_file
Using GNU sed
$ sed -Ez ':a;s/([0-9]+)\n/(\1)|/;ta;s/\|$/\n/' input_file
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Here is another simple awk command:
awk 'NR>1 {printf "%s|", p} {p="(" $0 ")"} END {print p}' file
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Here it is:
sed -z 's/^/(/;s/\n/)|(/g;s/|($//' your_input
where -z allows you to treat the whole file as a single string with embedded \ns.
In detail, the sed script above consists of 3 commands separated by ;s:
s/^/(/ inserts a ( at the beginning of the whole file,
s/\n/)|(/g changes every \n to )|(;
s/|($// removes the trailing |( resulting from the \n at EOF, that is likely in your file since you are on linux.
With perl:
$ seq 10 | perl -pe 's/.*/($&)/; s/\n/|/ if !eof'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
s/.*/($&)/ to surround input lines with ()
s/\n/|/ if !eof will change newline to | except for the last input line.
Here's a solution with paste (just for fun):
$ seq 10 | paste -d'()' /dev/null - /dev/null | paste -sd'|'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Using any awk:
$ seq 10 | awk '{printf "%s(%s)", sep, $0; sep="|"} END{print ""}'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)

Extract lines after a pattern

I have 50 files in a folder and all have a common pattern "^^". I want to print everything after "^^" and append the filename and print out all the extracted lines to one output file. While my code works fine with a single file it doesn't work on all the files.
awk '/\^^/{getline; getline; print FILENAME; print}' *.txt > output
Example
1.txt
ghghh hghg
ghfg hghg hjg
jhhkjh
kjhkjh kjh
^^
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
hghjhg hgj
jhgj
jhgjh kjgh
jhg
^^
bbbbbbbbbbbbbbbbbbbbbbb
Desired output.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
My actual output
1.txt
ghghh hghg
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzz
To print the line after ^^, try:
$ awk 'f{print FILENAME ORS $0; f=0} /\^\^/{f=1}' *.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
How it works:
f{print FILENAME ORS $0; f=0}
If variable f is true (nonzero), print the filename, the output record separator, and the current line. Then set f back to zero.
/\^\^/{f=1}
If the current line contains ^^, set f to one.
$ awk 'FNR==1{print FILENAME; f=0} f; $1=="^^"{f=1}' *.txt
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
I like a more "bash(ish)" approach.
grep -Hn '^^' *.txt |
cut -d: -f1,2 --output-delimiter=' ' |
while read f n; do echo $f; tail $f -n+$((n+1)); done
grep -Hn will tell the line number of your pattern.
With cut we get only the needed fields, as we need.
In a loop we read the two informations into variables, to use they freely as we need.
The tail can read not only the last N lines, but also all lines from +N if you use the plus signal.
We can do arithmetic operation inside $((...)) to jump the pattern line.
And it solves your issue. And it can print all lines after the pattern, not only the next one.
use awk:
awk 'FNR==1{print FILENAME} FNR==1,/\^\^/{next}1' *.txt
Where:
print FILENAME when FNR == 1
FNR==1,/\^\^/{next}: all lines between FNR==1 and the first line matching ^^ will be skipped
1 at the end to print the rest of lines after the matched ^^ line
The following outputs only if we have a file that matches our pattern:
awk 'FNR==1 { f=0 }; f; /\^\^/ { f=1; print FILENAME }' *.txt > output
Reset flag f on every new file.
Print if f is set.
Set f and print FILENAME if we match our pattern.
This one prints out the FILENAME regardless of matching pattern:
awk 'FNR==1 { f=0; print FILENAME }; f; /\^\^/ { f=1 }' *.txt > output
We can adjust the pattern matching in step 3 in accord with whatever is required... exact matching for instance can be done with $0=="^^".
let your files' name are 1 to 50 with txt type
for f in {1..50}.txt
{
sed -nE "/^\^\^\s*$/{N;s/.+\n(.+)/$f\n\1/p}" $f>$f.result.txt
}
Stealing from some answers and comments to your previous question on this topic, you can also use grep -A and format the output with sed.
$ grep -A100 '^^' *.txt | sed '/\^^/d;/--/d;s/-/\n/'
1.txt
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
2.txt
bbbbbbbbbbbbbbbbbbbbbbb
Assuming 100 lines is sufficient, and that you don't have hyphens of your own.
If you only need one line, use -A1
This might work for you (GNU sed):
sed -s '1,/^^^/{/^^^/F;d}' file1 file2 file3 ... >fileOut

awk print overwrite strings

I have a problem using awk in the terminal.
I need to move many files in a group from the actual directory to another one and I have the list of the necessary files in a text file, as:
filename.txt
file1
file2
file3
...
I usually digit:
paste filename.txt | awk '{print "mv "$1" ../dir/"}' | sh
and it executes:
mv file1 ../dir/
mv file2 ../dir/
mv file3 ../dir/
It usually works, but now the command changes its behaviour and awk overwrites the last string ../dir/ on the first one, starting again the print command from the initial position, obtaining:
../dire1 ../dir/
../dire2 ../dir/
../dire3 ../dir/
and of course it cannot be executed.
What's happened?
How do I solve it?
Your input file contains carriage returns (\r aka control-M). Run dos2unix on it before running a UNIX tool on it.
idk what you're using paste for though, and you should not be using awk for this at all anyway, it's just a job for a simple shell script, e.g. remove the echo once you've tested this:
$ < file xargs -n 1 -I {} echo mv "{}" "../dir"
mv file1 ../dir
mv file2 ../dir
mv file3 ../dir

How to write awk -F commands

#!/bin/bash
cat $1 | awk ' /Info/, /<\/Body>/ {print $0}' | while read line; do
file=`awk -F '>' "{print $4}"`
echo "$file"
done
Basically the file input file has some information removed in the first line of awk. Now what I'm trying to do is find a variable using awk -F and print what comes after the > which is the 4th field. Now I can not just search for the > because the file has 100s of them because it's html.
ok maybe someone can answer this so when i run the file now it does not look at the forth field however it just removes all of '>' which is not the goal i am trying to locate the field that comes after the 3 '>' so that would be field 4 but thats not what im getting? any help would be great!
AWK requires 2 parts
Options
File to work on
In your example, you have given options i.e. delimiter, what to print, but you have not mentioned which file to work with
Try this
On the command prompt
cat file | awk -F ">" '{print $4}'
In script
result=`cat file | awk -F ">" '{print $4}'`
echo $result
For the text file as "file" containing data as
a>b>c>d>e>f
Both the above will display 'd'.

Removing blank lines

I have a csv file in which every other line is blank. I have tried everything, nothing removes the lines. What should make it easier is that the the digits 44 appear in each valid line. Things I have tried:
grep -ir 44 file.csv
sed '/^$/d' <file.csv
cat -A file.csv
sed 's/^ *//; s/ *$//; /^$/d' <file.csv
egrep -v "^$" file.csv
awk 'NF' file.csv
grep '\S' file.csv
sed 's/^ *//; s/ *$//; /^$/d; /^\s*$/d' <file.csv
cat file.csv | tr -s \n
Decided I was imagining the blank lines, but import into Google Sheets and there they are still! Starting to question my sanity! Can anyone help?
sed -n -i '/44/p' file
-n means skip printing
-i inplace (overwrite same file)
- /44/p print lines where '44' exists
without '44' present
sed -i '/^\s*$/d' file
\s is matching whitespace, ^startofline, $endofline, d delete line
Use the -i option to replace the original file with the edited one.
sed -i '/^[ \t]*$/d' file.csv
Alternatively output to another file and rename it, which is doing the exactly what -i does.
sed '/^[[:blank:]]*$/d' file.csv > file.csv.out && mv file.csv.out file.csv
Given:
$ cat bl.txt
Line 1 (next line has a tab)
Line 2 (next has several space)
Line 3
You can remove blank lines with Perl:
$ perl -lne 'print unless /^\s*$/' bl.txt
Line 1 (next line has a tab)
Line 2 (next has several space)
Line 3
awk:
$ awk 'NF>0' bl.txt
Line 1 (next line has a tab)
Line 2 (next has several space)
Line 3
sed + tr:
$ cat bl.txt | tr '\t' ' ' | sed '/^ *$/d'
Line 1 (next line has a tab)
Line 2 (next has several space)
Line 3
Just sed:
$ sed '/^[[:space:]]*$/d' bl.txt
Line 1 (next line has a tab)
Line 2 (next has several space)
Line 3
Aside from the fact that your commands do not show that you capture their output in a new file to be used in place of the original, there's nothing wrong with them, EXCEPT that:
cat file.csv | tr -s \n
should be:
cat file.csv | tr -s '\n' # more efficient alternative: tr -s '\n' < file.csv
Otherwise, the shell eats the \ and all that tr sees is n.
Note, however, that the above only eliminates only truly empty lines, whereas some of your other commands also eliminate blank lines (empty or all-whitespace).
Also, the -i (for case-insensitive matching) in grep -ir 44 file.csv is pointless, and while using -r (for recursive searches) will not change the fact that only file.csv is searched, it will prepend the filename followed by : to each matching line.
If you have indeed captured the output in a new file and that file truly still has blank lines, the cat -A (cat -et on BSD-like platforms) you already mention in your question should show you if any unusual characters are present in the file, in the form of ^<char> sequences, such as ^M for \r chars.
If you like awk, this should do:
awk '/44/' file
It will only print lines that contains 44