awk to rename txt files using partial match in another file, appending the partial match to the txt files - awk

I am trying to use awk to rename all txt files in a directory based on a partial match to $2 of file. The txt files in the directory are then have the $2 string in file used to rename them. The txt files only need to have the string that matches replaced. That is the portion after the - remains. The awk below, thank you #markp-fuso, executes but returns the attempts I made return the original txt file unchanged. Thank you :).
current directory structure
123_1_000.txt
456_2_101.txt
789_3_200.txt
file
aaa 123_1
bbb 456_2
ccc 789_3
desired directory
aaa_000.txt
bbb_101.txt
ccc_200.txt
awk
t=$(ls *.txt) | echo $t | awk '{sub(/([_^]*:){2}/,"")}1' | # list all txt files in directory
awk '
FNR==NR { map[$2]=$1; next } # store $2 from file in map array, goto next
{ split($0,a,".") # split on . using txt files and read in string into [a]
if (a[1] in map) # if [a] matches map
print "mv \"" $0 "\" \"" map[a[1]]* ".txt\"" # rename txt with map[a] string
}
' file - # read stdin as 2nd input

You were very close, whether you use bash, or other advanced shell, that provides process substitution or you pipe and read from -, the following will rename the files in the current directory removing the prefix [[:digit:]]+_[[:digit:]] and replacing with the $1 value from file that corresponds to the prefix.
The key is to provide a relationship between the array index holding the replacement and the value of the array holding the prefix as its index. Below the a[] array saves the $1 values from file indexed by line-number while the b[] array holds the prefix as index with the line-number stored as its value (you can use a counter if you like, e.g. ++n).
The match() command is used to create the c[] array which holds the prefix taken from the current filename and if the prefix exists in the b[] array the filename is saved and the new filename is created by sub() on $1 and the is moved to the new filename using the system() command with the command string built by concatenation in cmd.
awk '
FNR==NR { # reading from file
a[NR] = $1 # save $1 to array indexed by NR
b[$2] = NR # save $2 as index to array with NR as value
}
NR>FNR { # handling filenames
match($1,/^[[:digit:]]+_[[:digit:]]+/,c) # save up to last _ in c array
if (c[0] in b) { # if prefix present in b array
fname=$1 # save current filename
sub(c[0], a[b[c[0]]], $1) # sub a array with prefix for new filename
cmd="mv "fname" "$1 # create move command string
system(cmd) # move current to new filename
}
}
' file <(ls -1 *.txt)
(note: the redirected command substitution <(ls -1 *.txt) is used to provide the filenames. If your shell does not provide that capability, then pipe the filenames to the command and add - as a filename causing awk to read stdin -- as you have done)
Example Results
With your original filenames in the present directory and the contents in file, e.g.
$ cat file
aaa 123_1
bbb 456_2
ccc 789_3
the result of the above is:
$ ll
total 4
-rw-r--r-- 1 david david 0 Jan 6 16:09 aaa_000.txt
-rw-r--r-- 1 david david 0 Jan 6 16:09 bbb_101.txt
-rw-r--r-- 1 david david 0 Jan 6 16:09 ccc_200.txt
-rw-r--r-- 1 david david 30 Jan 6 16:09 file
Note, if you don't care about matching [[:digit:]] you can remove the prefix with /^[^_]+_[^_]+/ to match anything in the prefix up to the second '_'.

Related

Split a file into multiple gzip files in one line

Is it possible to split a file into multiple gzip files in one line?
Lets say I have a very large file data.txt containing
A somedata 1
B somedata 1
A somedata 2
C somedata 1
B somedata 2
I would like to split each into separate directory of gz files.
For example, if I didnt care about separating, I would do
cat data.txt | gzip -5 -c | split -d -a 3 -b 100000000 - one_dir/one_dir.gz.
And this will generate gz files of 100MB chunks under one_dir directory.
But what I want is separating each based on the first column. So I would like to have say 3 different directory, containing gz files of 100MB chunks for A, B and C respectively.
So the final directory will look like
A/
A.gz.000
A.gz.001
...
B/
B.gz.000
B.gz.001
...
C/
C.gz.000
C.gz.001
...
Can I do this in a 1 liner using cat/awk/gzip/split? Can I also have it create the directory (if it doesnt exist yet)
With awk:
awk '
!d[$1]++ {
system("mkdir -p "$1)
c[$1] = "gzip -5 -c|split -d -a 3 -b 100000000 - "$1"/"$1".gz."
}
{ print | c[$1] }
' data.txt
Assumes:
sufficiently few distinct $1 (there is an implementation-specific limit on how many pipes can be active simultaneously - eg. popen() on my machine seems to allow 1020 pipes per process)
no problematic characters in $1
Incorporating improvements suggested by #EdMorton:
If you have a sort that supports -s (so-called "stable sort"), you can remove the first limit above as only a single pipe will need to be active.
You can remove the second limit by suitable testing and quoting before you use $1. In particular, unescaped single-quotes will interfere with quoting in the constructed command; and forward-slash is not valid in a filename. (NUL (\0) is not allowed in a filename either but should never appear in a text file.)
sort -s -k1,1 data.txt | awk '
$1 ~ "/" {
print "Warning: unsafe character(s). Ignoring line",FNR >"/dev/stderr"
next
}
$1 != prev {
close(cmd)
prev = $1
# escape single-quote (\047) for use below
s = $1
gsub(/\047/,"\047\\\047\047",s)
system("mkdir -p -- \047"s"\047")
cmd = "gzip -5 -c|split -d -a 3 -b 100000000 -- - \047"s"/"s".gz.\047"
}
{ print | cmd }
'
Note that the code above still has gotchas:
for a path d1/d2/f:
the total length can't exceed getconf PATH_MAX d1/d2; and
the name part (f) can't exceed getconf NAME_MAX d1/d2
Hitting the NAME_MAX limit can be surprisingly easy: for example copying files onto an eCryptfs filesystem could reduce the limit from 255 to 143 characters.

How to get data from a file then populate the fields into other file using awk script?

Is there any other way to get data from a file then populate the fields into other file by using any script file, such as awk? (sed script might be okay to use it) Note: it doesn't have to be using grep and find.
file1.txt
AAA Abcd
Watermelon Apple
8/25/19 11/24/19
4
55
file2.txt
[[letters_aaa]]
[[letters_abcd]]
[[fruit_names]]
[[date_start]]
[[date_end]]
[[four_hours]]
[[num_fiftyfive]]
merge into this one
AAA
Abcd
Watermelon Apple
8/25/19
11/24/19
4
55
I tried using awk script but I wasn't sure what to write after locating number of line.
This script is all I can do:
BEGIN {
}
{
if(NR==1) {
print $1 $2
}
It's 2AM here so I'm not sure if your question or my answer make any sense. It reads files in with RS="" so all fields must be filled, empty fields will break the output (in which case use RS="^$" and replace print with printf):
$ awk '
BEGIN {
RS=""
}
NR==FNR { # store file2 as template
t=$0
next
}
{
sub(/\[\[letters_aaa\]\]/,$1,t) # replace template items with file fields
sub(/\[\[letters_abcd\]\]/,$2,t)
for(i=3;i<=NF-4;i++)
fruit=fruit (fruit==""?"":OFS) $i # dealing with space in names
sub(/\[\[fruit_names\]\]/,fruit,t)
sub(/\[\[date_start\]\]/,$(NF-3),t)
sub(/\[\[date_end\]\]/,$(NF-2),t)
sub(/\[\[four_hours\]\]/,$(NF-1),t)
sub(/\[\[num_fiftyfive\]\]/,$NF,t)
print t
}' file2 file1

Setting a threshold for multiple files in directory

I have a list of files in a directory. For instance the below files are the names of each file with the first line displayed (it has several other lines in each file that is not of importance).
Group1:
8 325
quick brown fox jumped
Over the lazy dog
Group2:
8 560
There is more content here
Group3:
7 650
I would like to read the first line of each file and check if the first value is equal to 8 and the second value is more than 500. If this condition is satisfied, print the name of the file into a new textfile.
Result
Group2
I tried using
for f in *.Group;
do head -n1 *.Group > new-file;
done
This gives me a file with header names and the first line of each file in the directory
=> Group1 <=
8 325
=> Group2 <=
8 560
=> Group3 <=
7 650
Now, I want to filter the files based on the threshold, but am not sure how to convert all the headers into first columns and the corresponding values into second column. Then it is easy to apply the threshold and filter the file. Or is there a better way to do this?
You can use awk:
awk 'FNR==1 && $1==8 && $2>500{print FILENAME}' *.Group > Result
Explanation:
# FNR contains the number of line of the current(!) input
# file. Check if the conditions are met and print the filename
FNR==1 && $1==8 && $2>500 {
print FILENAME
}
The above solution should work with any version of awk. If you have GNU awk you can take advantage of the nextfile expression. Using it you skip the remaining lines of an input file once the first line has been processed:
# Check if the conditions are met and print the filename in that case
$1==8 && $2>500 {
print FILENAME
}
# Skip the remaining lines in the current file and continue
# with the next file
{
nextfile
}

cat lines from X to Y of multiple files into one file

I have many huge files in say 3 different folders from which i would like to copy say lines from X to Y of files of the same name and append them into a new file of the same name.
I tried doing
ls seed1/* | while read FILE; do
head -n $Y | tail -n $X seed1/$FILE seed2/$FILE seed3/$FILE > combined/$FILE
done
This does the job for the first value of $FILE, but this does not return the prompt, and hence I am unable to execute this loop.
For example i have the following files in three different folders, seed1, seed2 and seed3:
seed1/foo.dat
seed1/bar.dat
seed1/qax.dat
seed2/foo.dat
seed2/bar.dat
seed2/qax.dat
seed3/foo.dat
seed3/bar.dat
seed3/qax.dat
I would like to combine lines 10 to 20 of all files in to a combined folder:
combined/foo.dat
combined/bar.dat
combined/qax.dat
Each of the files in combined have 30 lines, with 10 each from seed1,seed2 and seed3.
No loop required:
awk -v x=10 -v y=20 '
FNR==1 { out = gensub(/.*\//,"combined/",1,FILENAME) }
FNR>=x { print > out }
FNR==y { nextfile }
' seed*/*.dat
The above assumes the "combined" directory already exists (empty or not) before awk is called and uses GNU awk for gensub() and nextfile and internal file management. Solutions with other awks are less efficient, require a bit more coding, and require you to manage closing files when too many are going to be open.

How to prevent new line when using awk

I have seen several variations of this question, but none of the answers are helping for my particular scenario.
I am trying to load some files, adding a column for filename. This works fine only if I put the filename as the first column. If I put the filename column at the end (where I want it) it creates a new line between $0 and the rest of the print that I am unable to stop.
for f in "${FILE_LIST[#]}"
do
awk '{ print FILENAME,"\t",$0 } ' ${DEST_DIR_FILES}/$f > tmp ## this one works
awk '{ print $0,"\t",FILENAME } ' ${DEST_DIR_FILES}/$f > tmp ## this one does not work
mv tmp ${DEST_DIR_FILES}/$f
done > output
Example data:
-- I'm starting with this:
A B C
aaaa bbbb cccc
1111 2222 3333
-- I want this (new column with filename):
A B C FILENAME
aaaa bbbb cccc FILENAME
1111 2222 3333 FILENAME
-- I'm getting this (\t and filename on new line):
A B C
FILENAME
aaaa bbbb cccc
FILENAME
1111 2222 3333
FILENAME
Bonus question
I'm using a variable to pass the filename, but it is putting the whole path. What is the best way to only print the filename (without path) ~OR~ strip out the file path using a variable that holds the path?
It's almost certainly a line endings issues as your awk script a syntactically correct. I suspect your files in "${FILE_LIST[#]}" came from a Windows box and have \r\n line endings. To confirm the line endings for a given file you can run the file command on each file i.e. file filename:
# create a test file
$ echo test > foo
# use unix2dos to convert to Windows style line endings
$ unix2dos foo
unix2dos: converting file foo to DOS format ...
# Use file to confirm line endings
$ file foo
foo: ASCII text, with CRLF line terminators
# Convert back to Unix style line endings
$ dos2unix foo
dos2unix: converting file foo to Unix format ...
$ file foo
foo: ASCII text
To convert your files to Unix style line endings \n run the following command:
$ for "f" in "${FILE_LIST[#]}"; do; dos2unix "$f"; done
Explanation:
When the FILENAME is the first string on the line the carriage returns \r effectively does nothing as we are already at the start of the line. When we try to print FILENAME after any other characters we see the effects that we are brought to the start of the next line, the TAB is printed then the FILENAME.
Side note:
Awk has the variable OFS for setting the output field separator so:
$ awk '{print $0,"\t",FILENAME}' file
Can be rewritten as:
$ awk '{print $0,FILENAME}' OFS='\t' file
Bonus Answer
The best way I.M.O to strip the path of file is to use the utility basename:
$ basename /tmp/foo
foo
Using command substitution:
$ awk '{print FILENAME}' $(basename /tmp/foo)
foo