Batch renaming files with text from file as a variable - variables

I am attempting to convert the files with the titles {out1.hmm, out2.hmm, ... , outn.hmm} to unique identifiers based on the third line of the file {PF12574.hmm, PF09847.hmm, PF0024.hmm} The script works on a single file however the variable does not get overwritten and only one file remains after running the command below:
for f in *.hmm;
do output="$(sed -n '3p' < $f |
awk -F ' ' '{print $2}' |
cut -f1 -d '.' | cat)" |
mv $f "${output}".hmm; done;
The first line calls all the outn.hmms as an input. The second line sets a variable to return the desired unique identifier. SED, AWK, and CUT are used to get the unique identifier. The variable supposed to rename the current file by the unique identifier, however the variable remains locked and overwrites the previous file.
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm
How can I overwrite the variable to get the following file structure:
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm PF09847.hmm PF0024.hmm

You're piping the empty output of the assignment statement (to the variable named "output") into the mv command. That variable is not set yet, so what I think will happen is that you will - one after the other - rename all the files that match *.hmm to the file named ".hmm".
Try ls -a to see if that's what actually happened.
The sed, awk, cut, and (unneeded) cat are a bit much. awk can do all you need. Then do the mv as a separate command:
for f in *.hmm
do
output=$(awk 'NR == 3 {print $2}' "$f")
mv "$f" "${output%.*}.hmm"
done
Note that the above does not do any checking to verify that output is assigned to a reasonable value: one that is non-empty, that is a proper "identifier", etc.

Related

Using awk to find and replace strings in every file in directory

I have a directory full of output files, with files with names:
file1.out,file2.out,..., fileN.out.
There is a certain key string in each of these files, lets call it keystring. I want to replace every instance of keystring with newstring in all files.
If there was only one file, I know I can do:
awk '{gsub("keystring","newstring",$0); print $0}' file1.out > file1.out
Is there a way to loop through all N files in awk?
You could use find command for the same. Please make sure you run this on a test file first once it works fine then only run it in your actual path(on all your actual files) for safer side. This also needs gawk newer version which has inplace option in it to save output into files itself.
find your_path -type f -name "*.out" -exec awk -i inplace -f myawkProgram.awk {} +
Where your awk program is as follows: as per your shown samples(cat myawkProgram.awk is only to show contents of your awk program here).
cat myawkProgram.awk
{
gsub("keystring","newstring",$0); print $0
}
2nd option would be pass all .out format files into your gawk program itself with -inplace by doing something like(but again make sure you run this on a single test file first and then run actual command for safer side once you are convinced by command):
awk -i inplace '{gsub("keystring","newstring",$0); print $0}' *.out
sed is the most ideal solution for this and so integrating it with find:
find /directory/path -type f -name "*.out" -exec sed -i 's/keystring/newstring/g' {} +
Find files with the extension .out and then execute the sed command on as many groups of the found files as possible (using + with -exec)

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

extract part of column to make cp command [duplicate]

This question already has answers here:
Bash One Liner: copy template_*.txt to foo_*.txt?
(8 answers)
Closed 3 years ago.
I wan to create copy command to copy files from one directory to just back of it with removing suffix date. There are multiple files are there.
eg file LOAN.DAILY.20191204
want to create command
cp LOAN.DAILY.20191204 ../LOAN.DAILY
My attempt
ls -lrt | awk ' /DAILY/{ print "cp " , $9 , "../" , sub(/\.20191204$/,""); $9 }'
getting output
cp LOAN.DAILY.20191204 ../ 1
why this 1 is coming
This might work for you (GNU sed):
ls *DAILY* | sed -E 's#^(.*)\..*#cp & \1#'
and once the output has been checked use this version to enact the copy.
ls *DAILY* | sed -E 's#^(.*)\..*#cp & \1#e'
or an alternative using GNU parallel:
parallel --dry-run cp {} {.} ::: *DAILY*
again, check the result and if all ok, use:
parallel cp {} {.} ::: *DAILY*
One simple way:
shopt -s nullglob
for file in *.DAILY.* ; do cp "$file" ../"${file%.*}"; done
shopt -s nullglob: To avoid any unecessary copies in case the glob doesn't get a match.
"${file%.*}": Shell's parameter expansion to strip off the everything from strings's end till the first matched . in reverse direction.
I can't recall better and shorter ways to do this, although I suppose there are many.
According to https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html:
As mentioned, the third argument to sub() must be a variable, field, or array element. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub() still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it.
This explains why you get a 1 in the output.
If you want to modify the value of the ninth column you need to specify it in the sub call:
ls -lrt | awk ' /DAILY/{ orig=$9; sub(/\.20191204$/,"", $9); print "cp " , orig , "../", $9 }'
In this command, the original value of $9 is stored in a variable orig, then the date suffix is removed using sub, and finally the cp command is constructed using the old and new values.

How to use sed/awk to replace the original file and get the following desired output?

I'm writing a bash scrip that would translate one file to another, and am encountering an issue.
Whenever the program sees something like this(......not included):
......Mul(-a1+b2-c3...+f+e)......
change it to:
......M(-a1)*M(b2)*M(-c3)*...*M(f)*M(e)......
the number of the variables in Mul is unknown and there could be multiple occurrence of Mul in the file. There are also other places in the file where + or - appears. And Variables could be one or more characters.
I tried grouping in sed, with a group followed by a "*", but it doesn't seem to be working due to the need of replacing unknown amount of variables.
Here is a sed script that will do it:
:a
s/\(Mul(.[^)]*\)\([+-].\)/\1)*Mul(\2/
ta
s/Mul(+\{0,1\}/M(/g
The trick is to use the test to jump back to the beginning after making a substitution (e.g. "Mul(a+b+c)"=>"Mul(a)*Mul(+b+c)").
$ cat tst.awk
match($0,/Mul\([^()]+\)/) {
tgt = substr($0,RSTART+4,RLENGTH-5)
gsub(/[-+][[:alnum:]]+/,"*M(&)",tgt)
gsub(/\+/,"",tgt)
sub(/^\*/,"",tgt)
print substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
......M(-a1)*M(b2)*M(-c3)*M(f)*M(e)......
The above was run on this input file:
$ cat file
......Mul(-a1+b2-c3+f+e)......

Bash script process csv file line by line while updateing $6 with different value but keeping other values unchanged

I am beginner at bash scripting and I have been trying to fix this for more than 8 hours.
I have searched on StackOwerflow and tried the answers to fit my needs, but without success.
I want to use bash script to change csv file's date value to current date.
I am using a dummy .csv file ( http://eforexcel.com/wp/wp-content/uploads/2017/07/100-Sales-Records.zip ) and I want to change the 6th value (date) to the current date.
What I have been doing so far:
I have created one line csv to test the script
cat oneline.csv:
Australia and Oceania,Tuvalu,Baby Food,Offline,H,5/28/2010,669165933,6/27/2010,9925,255.28,159.42,2533654.00,1582243.50,951410.50
then I have tested the one line script:
echo `cat oneline.csv | awk -F, '{ print $1"," $2"," $3"," $4"," $5","}'` `date` `cat oneline.csv |awk -F, '{print $7"," $8"," $9"," $10"," $11"," $12"," $13"," $14"\n"}'
then I have this code for the whole 100 line files in source.sh:
#I want to change 6th value for every line of source.csv to current date and keep the rest and export it to output.csv
while read
do
part1=$(`cat source.csv | awk -F, '{ print $1"," $2"," $3"," $4"," $5","}'`)
datum=$(`date`)
part2=$(`cat source.csv |awk -F, '{print $7"," $8"," $9"," $10"," $11"," $12"," $13"," $14"\n"}'`)
echo `$part1 $datum $part2`
done
and I expect to run the command like ./source.sh > output.csv
What I want for the full 100 lines file is to have result like:
Food,Offline,H,Thu Jan 17 06:34:03 EST 2019,669165933,6/27/2010,9925,255.28,159.42,2533654.00,1582243.50,951410.50
Could you guide me how to change the code to get the result?
Refactor everything to a single Awk script; that also avoids the echo in backticks.
awk -v datum="$(date)" -F , 'BEGIN { OFS=FS }
{ $6 = datum } 1' source.csv >output.csv
Briefly, we split on comma (-F ,) and replace the value of the sixth field with the value of the variable we passed in with -v. OFS=FS sets the output field separator to the input field separator (comma). Then the 1 means "print unconditionally".
Generally speaking, you should probably avoid while read.
Tangentially, your quoting looks wacky; you don't want backticks around $part1 unless it is a command you want the shell to run (which in turn is probably a bad idea in itself). Also, backticks have long been deprecated in favor of $(command) syntax which is more legible and offers some syntactic advantages.