Yet another attempt to do the "vlookup" thing using AWK - awk

Just a clarifications before starting: The server where the script needs to run is an AIX box. Shell is KSH, so I do not have the advanced features Bash provides.
Ok, this is what I need to acccomplish:
I have two files. Both of them have two colums, separated by comma.
File "A" has ~170K lines, and it looks like this:
0000A7AED4F0C9FB1ADC14134700CadsevDDD4A000CEDCF.ext,\\server000005\F$\DICE\0035\
0000C3793C4CD6095947E44618D4Eadsev397460011D036.ext,\\server000005\F$\DICE\0020\
0001168DDDA4DF294E37753FE891BadsevB33900011EEA3.ext,\\server000005\F$\DICE\0088\
00014E6A3AFF0911D95A933778895adsev6C81E00088E97.ext,\\server000005\F$\DICE\0009\
0001A65FA90EC0E6640E1589C4B66adsev6FE1F00088EB9.ext,\\server000005\F$\DICE\0009\
0001C5AA0A9AC8E4EDFC69C483160adsev881CC001157ED.ext,\\server000005\F$\DICE\0034\
0003270ED2D2AB11739029711A233adsev55605000CFC63.ext,\\server000005\F$\DICE\0028\
000327C08A0ECD8F23EE6AE42B3C3adsevE35F00011481D.ext,\\server000005\F$\DICE\0061\
0003423C2592EF9D0AD9A7E2B595Cadsev6ABD9000D3501.ext,\\server000005\F$\DICE\0022\
00035862746EFB2098EC965F31328adsev66800000DA8CF.ext,\\server000005\F$\DICE\0021\
File "B" has ~2Million lines, and it looks like this:
0000294A3F3997slredA9D7ADBEE0C0CDE67C100001245C.nlo,\\server000002\F$\RESTORE_DICE\DICE\0083\
00003FFF21F5DAslred8F20FCF0A5CEE9920A4A00016835.nlo,\\server000002\F$\RESTORE_DICE\DICE\0029\
00005B1FFB996Fslred065F708695ADDD987AF9002139AD.nlo,\\server000002\F$\RESTORE_DICE\DICE\0157\
00005CF3C87456slred41FDB077914EB04FFA2B001F9D74.nlo,\\server000002\F$\RESTORE_DICE\DICE\0103\
00006BD33D737FslredD717F08A20F44F2B878500011050.nlo,\\server000002\F$\RESTORE_DICE\DICE\0094\
00008254F4D661slred6C05CFC91D9BCB82EDD800077FFA.nlo,\\server000002\F$\RESTORE_DICE\DICE\0082\
000092391392E3slredB744E98697FA39CEEDCD0004FB66.nlo,\\server000002\F$\RESTORE_DICE\DICE\0032\
0000945EDBB916slredAB08CD7AA8B825E1F55C0000FDC0.nlo,\\server000002\F$\RESTORE_DICE\DICE\0093\
0000C3793C4CD6slredE44618D4E0A2C50397460011D036.nlo,\\server000002\F$\RESTORE_DICE\DICE\0146\
0000D0DA56260DslredF30BCC9CDFF2A4556A7500039400.nlo,\\server000002\F$\RESTORE_DICE\DICE\0054\
In boh cases, column 1 is a filename while column 2 is it's path.
What I need:
for each line in file A, grab the filename, look for it in file B, and compose a newline, to be echoed/print into a new file "C" consisting of the following three fields:
Column 1: Path for the file as shown in file B.
Column 2: Path for the file as shown in file A.
Column 3: File name.
I have tried to keep it short as I have been "accused" of diarrhea of the words in previous questions, but feel free to let me know in case I am missing important details here.
Just so you know, I have working batch and ksh codes for this that work fine with smaller (lesser lines) files, tough none of them do the job in this case. (CMD Batch one only processes a couple thousand lines an hour, while the ksh one refuses to even read such files based on memory limitations..)
Thank you guys for being always there!
Martín.

It's like 3 lines and a couple of temporary files if you don't have access to ksh93 (I think AIX uses ksh88, not ksh93?), bash, zsh, or another shell that understands <(command) redirection...
$ sort -t, -k1 filea > sorted_filea
$ sort -t, -k1 fileb > sorted_fileb
$ join -t, -j1 -o '2.2 1.2 0' sorted_filea sorted_fileb > filec
(fewer if one or both of the files are already sorted on filename.)
If you do have one of those shells:
$ join -t, -j1 -o '2.2 1.2 0' <(sort -t, -k1 filea) <(sort -t, -k1 fileb) > filec

Related

Using awk to find and replace strings in every file in directory

I have a directory full of output files, with files with names:
file1.out,file2.out,..., fileN.out.
There is a certain key string in each of these files, lets call it keystring. I want to replace every instance of keystring with newstring in all files.
If there was only one file, I know I can do:
awk '{gsub("keystring","newstring",$0); print $0}' file1.out > file1.out
Is there a way to loop through all N files in awk?
You could use find command for the same. Please make sure you run this on a test file first once it works fine then only run it in your actual path(on all your actual files) for safer side. This also needs gawk newer version which has inplace option in it to save output into files itself.
find your_path -type f -name "*.out" -exec awk -i inplace -f myawkProgram.awk {} +
Where your awk program is as follows: as per your shown samples(cat myawkProgram.awk is only to show contents of your awk program here).
cat myawkProgram.awk
{
gsub("keystring","newstring",$0); print $0
}
2nd option would be pass all .out format files into your gawk program itself with -inplace by doing something like(but again make sure you run this on a single test file first and then run actual command for safer side once you are convinced by command):
awk -i inplace '{gsub("keystring","newstring",$0); print $0}' *.out
sed is the most ideal solution for this and so integrating it with find:
find /directory/path -type f -name "*.out" -exec sed -i 's/keystring/newstring/g' {} +
Find files with the extension .out and then execute the sed command on as many groups of the found files as possible (using + with -exec)

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

bulk renaming files rearranging file names based on delimiter

I have seen questions that are close to this but I have not seen the exact answer I need and can't seem to get my head wrapped around the regex, awk, sed, grep, rename that I would need to make it happen.
I have files in one directory sequentially named from multiple sub directories of a different directory created using find piped to xargs.
Command I used:
find `<dir1>` -name "*.png" | xargs cp -t `<dir2>`
This resulted in the second directory containing duplicate filenames sequentially named as follows:
<name>.png
<name>.png.~1~
<name>.png.~2~
...
<name>.png.~n~
What I would like to do is take all files ending in ~*~ and rename it as follows:
<name>.#.png where the '#" is the number between the "~"s at the end of the file name
Any help would be appreciated.
With Perl's rename (stand alone command):
rename -nv 's/^([^.]+)\.(.+)\.~([0-9]+)~/$1.$3.$2/' *
If everything looks fine remove option -n.
There might be an easier way to this, but here is a small shell script using grep and awk to achieve what you wanted
for i in $(ls|grep ".png."); do
name=$(echo $i|awk -F'png' '{print $1}');
n=$(echo $i|awk -F'~' '{print $2}');
mv $i $name$n.png;
done

Search file contents recursively when know where in file

I am interested in efficiently searching files for content using bash and related tools (eg sed, grep), in the specific case that I have additional information about where in the file the intended content is. For example, I want to replace a particular string in line #3 of each file that contains a specific string on line 3 of the file. Therefore, I don't want to do a recursive grep -r on the whole directory as that would search the entirety of each file, wasting time since I know that the string of interest is on line #3, if it is there. This full-grep approach could be done with grep -rl 'string_to_find_in_files' base_directory_to_search_recursively. Instead I am thinking about using sed -i ".bak" '3s/string_to_replace/string_to_replace_with' files to search only on line #3 of all files recursively in a directory, however sed seems to only be able to take one file as input argument. How can I apply sed to multiple files recursively? find -exec {} \; and find -print0 | xargs -0 seem to be very slow.. Is there a faster method than using find? I can achieve the desired effect very quickly with awk but only on a single directory, it does not seem to me to be recursive, such as using awk 'FNR==3{print $0}' directory/*. Any way to make this recursive? Thanks.
You can use find to have the list of files and feed to sed or awk one by one by xargs
for example, this will print the first lines of all files listed by find.
$ find . -name "*.csv" | xargs -L 1 sed -n '1p'

How to get few lines from a .gz compressed file without uncompressing

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}