zcat a file, output its contents to another file based on original filename - backup

I'm looking to create a bash/perl script in Linux that will restore .gz files based on filename:
_path_to_file.txt.gz
_path_to_another_file.conf.gz
Where the underscores form the directory structure.. so the two above would be:
/path/to/file.txt
/path/to/another/file.conf
These are all in the /backup/ directory..
I want to write a script that will cat each .gz file into its correct location by changing the _ to / to find the correct path - so that the contents of _path_to_another_file.conf.gz replaces the text in /path/to/another/file.conf
zcat _path_to_another_file.conf.gz > /path/to/another/file.conf
I've started by creating a file with the correct destination filenames in it.. I could create another file to list the original filenames in it and have the script go through line by line?
ls /backup/ |grep .gz > /backup/backup_files && sed -i 's,_,\/,g' /backup/backup_files && cat /backup/backup_files
Whatcha think?

Here's a Bash script that should do what you want :
#!/bin/bash
for f in *.gz; do
n=$(echo $f | tr _ /)
zcat $f > ${n%.*}
done
It loops over all files that end with .gz, and extracts them into the path represented by their filename with _ replaced with /.

That's not necessarily an invertible mapping (what if the original file is named high_scores for instance? is that encoded specially, e.g., with double underscore as high__scores.gz?) but if you just want to take a name and translate _ to / and remove .gz at the end, sed will do it:
for name in /backup/*.gz; do
newname=$(echo "$name" |
sed -e 's,^/backup/,,' \
-e 's,_,/,g' \
-e 's/\.gz$//')
echo "zcat $name > $newname"
done
Make sure it works right (the above is completely untested!) then take out the echo, leaving:
zcat "$name" > "$newname"
(the quotes protect against white space in the names).

Related

Using awk to find and replace strings in every file in directory

I have a directory full of output files, with files with names:
file1.out,file2.out,..., fileN.out.
There is a certain key string in each of these files, lets call it keystring. I want to replace every instance of keystring with newstring in all files.
If there was only one file, I know I can do:
awk '{gsub("keystring","newstring",$0); print $0}' file1.out > file1.out
Is there a way to loop through all N files in awk?
You could use find command for the same. Please make sure you run this on a test file first once it works fine then only run it in your actual path(on all your actual files) for safer side. This also needs gawk newer version which has inplace option in it to save output into files itself.
find your_path -type f -name "*.out" -exec awk -i inplace -f myawkProgram.awk {} +
Where your awk program is as follows: as per your shown samples(cat myawkProgram.awk is only to show contents of your awk program here).
cat myawkProgram.awk
{
gsub("keystring","newstring",$0); print $0
}
2nd option would be pass all .out format files into your gawk program itself with -inplace by doing something like(but again make sure you run this on a single test file first and then run actual command for safer side once you are convinced by command):
awk -i inplace '{gsub("keystring","newstring",$0); print $0}' *.out
sed is the most ideal solution for this and so integrating it with find:
find /directory/path -type f -name "*.out" -exec sed -i 's/keystring/newstring/g' {} +
Find files with the extension .out and then execute the sed command on as many groups of the found files as possible (using + with -exec)

Rename file using fasta header

I have multiple fasta files downloaded from NCBI and want to rename them with some part of the header:
Example of the header: >KY705281.1 Streptococcus phage P7955, complete genome
Example of filename: KY705281.fasta
The idea is to get rid of 'KY705281.1' and 'complete genome' so that only Streptococcus phage P7955 remain
For example, one input file will be:
>KY705281.1 Streptococcus phage P7955, complete genome
AGAAAGAAAAGACGGCTCATTTGTGGGTTGTCTTTTTTTGATTAAGTAATGAAGGAGGTGGATGTATTGG GCTAAATCAACGACAAAAACGATTTGCAGACGAATATTTGATATCTGGTGTCGCTTACAATGCAGCTATC AAAGCTGGGTATTCTGAGAAATACGCTAGAGCAAGAAGTCATACCTTGTTGGAAAATGTCGGCAT
It wlil be renamed to KY705281.fasta with content:
>Streptococcus phage P7955
AGAAAGAAAAGACGGCTCATTTGTGGGTTGTCTTTTTTTGATTAAGTAATGAAGGAGGTGGATGTATTGG GCTAAATCAACGACAAAAACGATTTGCAGACGAATATTTGATATCTGGTGTCGCTTACAATGCAGCTATC AAAGCTGGGTATTCTGAGAAATACGCTAGAGCAAGAAGTCATACCTTGTTGGAAAATGTCGGCAT
I'm a newbie with Linux but somehow with some Google search, I know that this could be done easily with some awk/sed/grep commands.
Any advice would be grateful
One way could be:
awk -F, 'FNR==1{match($1, "^>([^.]+)[^ ]+ (.*)", oFv); $1= ">" oFv[2]; sub(/ *complete genome */, "", $2);}{printf $0>oFv[1] ".fasta"}' somefiles*
This will keep old files and write corresponding new file(s).
Also this assume that the input files only have one line like you gave.
If you want to rename old files as well as change their contents,
Given your system and bash, also I think it's GNU awk & GNU sed,
please backup your files and try this:
#!/usr/bin/bash
for file in somefiles*; do
nn="$(awk -F[\>.] '{printf $2 ".fasta";exit}' "file")"
sed -ri '1{s/^[^ ]* />/;s/, complete genome//;}' "file"
if [ ! -f "$nn"];
then
mv "file" "nn"
else
echo "'$nn' exists, skip '$file', its content already changed." | tee _err_.log
fi
done
Or as oneliner:
for file in somefiles*; do nn="$(awk -F[\>.] '{printf $2 ".fasta";exit}' "$file")"; sed -ri '1{s/^[^ ]* />/;s/, complete genome//;}' "$file"; if [ ! -f "$nn" ]; then mv "$file" "$nn"; else echo "'$nn' exists, skip '$file', its content already changed." | tee _err_.log; fi; done

Replace full path with one string if

I have a lot of paths in one file and i need to change paths with specific string IF there is .jpg, .pdf
For example:
I have right now lines like those:
images/2015/somename.jpg
images/2012/somefile.pdf
2015/2012 could be 1997 or 2010 etc - random year numbers.
I would like to change those lines with sed (if possible) like this
images/somename.jpg
files/somefile.pdf
I tried like so:
echo "images/2015/image12345.jpg" | sed 's/images[^0-9]*[^0-9]/\/images/;' - no result
Or maybe awk will be good tool for this job?
$ cat file
images/2015/somename.c
images/123/somename.jpg
1234/2015/somename.jpg
images/2015/somename.jpg
images/2012/somefile.pdf
$ awk '/\.(jpg|pdf)$/ {sub("/[0-9]{4}/","/")} 1' file
images/2015/somename.c
images/123/somename.jpg
1234/somename.jpg
images/somename.jpg
images/somefile.pdf
With gnu-sed: sed -r '/\.(jpg|pdf)$/s#/[0-9]+/#/#'
/\.(jpg|pdf)$/: Matches lines ending with .jpg or .pdf
/[0-9]+/: Matches any sequence of just digits between slashes
Example:
$ echo "images/2015/image12345.jpg" | sed -r '/\.(jpg|pdf)$/s#/[0-9]+/#/#'
images/image12345.jpg

How to get few lines from a .gz compressed file without uncompressing

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

Shell Script Search and Delete Non Text Files

I want to write a shell script to search and delete all non text files in a directory..
I basically cd into the directory that I want to iterate through in the script and search through all files.
-- Here is the part I can't do --
I want to check using an if statement if the file is a text file.
If not I want to delete it
else continue
Thanks
PS By the way this is in linux
EDIT
I assume a file is a "text file" if and only if its name matches the shell pattern *.txt.
The file program always outputs the word "text" when passed the name of a file that it determines contains text format. You can test for output using grep. For example:
find -type f -exec file '{}' \; | grep -v '.*:[[:space:]].*text.*' | cut -d ':' -f 1
I strongly recommend printing out files to delete before deleting them, to the point of redirecting output to a file and then doing:
rm $(<filename)
after reviewing the contents of "filename". And beware of filenames with spaces, if you have those, things can get more involved.
Use the opposite of, unless an if statement is mandatory:
find <dir-path> -type f -name "*.txt" -exec rm {} \;
What the opposite is exactly is an exercise for you. Hint: it comes before -name.
Your question was ambiguous about how you define a "text file", I assume it's just a file with extension ".txt" here.
find . -type f ! -name "*.txt" -exec rm {} +;