Replace text from select statement from one file to another - sql

I have a bunch of views for my database and need to update the select statements within each view.
I have all the select statements in files called viewname.txt in one dir and in a sub dir called sql; I have the all the views viewname.sql. I want to run a script to take the text from viewname.txt and replace the select statement in the correct viewname.sql in the sql sub dir.
I have tried this to append the text after the SELECT in each .sql file:
for i in */*; do
if ["../$(basename "${i}")" == "$(basename "${i}")"]
then
sed '/SELECT/a "$(basename "$i" .txt)"' "$(basename "$i" .sql)"
fi
done
Any assistance is greatly appreciated!
Dickie

This is an awk answer that's close - the output is placed in the sql directory under corresponding "viewname.sql.new" files.
#!/usr/bin/awk -f
# absorb the whole viewname.txt file into arr when the first line is read
FILENAME ~ /\.txt$/ && FILENAME != last_filename {
last_filename = FILENAME
# get the viewname part of the file name
split( FILENAME, file_arr, "." )
while( getline file_data <FILENAME > 0 ) {
old_data = arr[ file_arr[ 1 ] ]
arr[ file_arr[ 1 ] ] = \
old_data (old_data == "" ? "" : "\n") file_data
}
next
}
# process each line of the sql/viewname.sql files
FILENAME ~ /\.sql$/ {
# strip the "/sql" from the front of FILENAME for lookup in arr
split( substr( FILENAME, 5 ), file_arr, "." )
if( file_arr[ 1 ] in arr ) {
if( $0 ~ /SELECT/ )
print arr[ file_arr[ 1 ] ] > FILENAME ".new"
else
print $0 > FILENAME ".new"
}
}
I put this into a file called awko and chmod +x and ran it like the following
awko *.txt sql/*
You'll have to mv the new files into place, but it's as close as I can get right now.

Related

Swapping / rearranging of columns and its values based on inputs using Unix scripts

Team,
I have an requirement of changing /ordering the column of csv files based on inputs .
example :
Datafile (source File) will be always with standard column and its values example :
PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION
MK3,Biberach,15200100_3,Biologics Downstream
MK3,Biberach,15200100_4,Sciona Upstream
MK3,Biberach,15200100_5,Drag envois
MK3,Biberach,15200100_8,flatsylio
MK3,Biberach,15200100_1,bioCovis
these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred .
Note : Source / Data file will be always comma delimited
Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file .
Something like script_name <output_column> <Source_File_name> <target_file_name>
target file example :
PRODUCTCODE,BATCHID
MK3,15200100_3
MK3,15200100_4
MK3,15200100_5
MK3,15200100_8
MK3,15200100_1
if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below
LV1P_DESCRIPTION,PRODUCTCODE
Biologics Downstream,MK3
Sciona Upstream,MK3
Drag envios,MK3
flatsylio,MK3
bioCovis,MK3
It would be great if any one can help on this.
I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this .
awk code:
BEGIN {
FS = ","
}
NR==1 {
split(c, ca, ",")
for (i = 1 ; i <= length(ca) ; i++) {
gsub(/ /, "", ca[i])
cm[ca[i]] = 1
}
for (i = 1 ; i <= NF ; i++) {
if (cm[$i] == 1) {
cc[i] = 1
}
}
if (length(cc) == 0) {
exit 1
}
}
{
ci = ""
for (i = 1 ; i <= NF ; i++) {
if (cc[i] == 1) {
if (ci == "") {
ci = $i
} else {
ci = ci "," $i
}
}
}
print ci
}
the above code is saves as Remove.awk and this will be called by another scripts as below
var1="BATCHID,LV2P_DESCRIPTION"
## this is input fields values used for testing
awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file
Explanation:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field
BEGIN {
split(flds,map,",") # Split fld into an array map using , as the delimiter
}
NR==1 { for (i=1;i<=NF;i++) {
map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value
}
}
{ printf "%s",$map1[map[1]]; # Print the first field specified (index of map)
for(i=2;i<=length(map);i++) {
printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents
}
printf "\n"
}' file

Run awk in parallel

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.
This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?
Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print $0 >> Fpath & ?
Any help will be appreciated.
Sample log file
"email1#foo.com:datahere2
email2#foo.com:datahere2
email3#foo.com datahere2
email5#foo.com;dtat'ah'ere2
wrongemailfoo.com
nonascii#row.com;data.is.junk-Œœ
email3#foo.com:datahere2
Expected Output
# cat em
email1#foo.com:datahere2
email2#foo.com:datahere2
email3#foo.com:datahere2
email5#foo.com:dtat'ah'ere2
email3#foo.com:datahere2
# cat errorfile
wrongemailfoo.com
nonascii#row.com;data.is.junk-Œœ
Code:
#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}' "$FILE"
done
popd > /dev/null
Look up the man page for the GNU tool named parallel if you want to run things in parallel but we can vastly improve the execution speed just by improving your script.
Your current script makes 2 mistakes that greatly impact efficiency:
Calling awk once per file instead of once for all files, and
Leaving all output files open while the script is running so awk has to manage them
You currently, essentially, do:
for file in *; do
awk '
{
Fpath = substr($1,1,2)
Fpath = gensub(/[^[:alnum:]]/,"_","g",Fpath)
print > Fpath
}
' "$file"
done
If you do this instead it'll run much faster:
sort * |
awk '
{ curr = substr($0,1,2) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{ print > Fpath }
'
Having said that, you're manipulating your input lines before figuring out the output file names so - this is untested but I THINK your whole script should look like this:
#/usr/bin/env bash
pushd "_test2" > /dev/null
awk '
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
sub(/[,|;: \t]+/, ":")
if (/^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+:[\x00-\x7F]+$/) {
print
}
else {
print > "errorfile"
}
}
' * |
sort -t':' -k1,1 |
awk '
{ curr = substr($0,1,2) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{ print > Fpath }
'
popd > /dev/null
Note the use of $0 instead of $1 in the scripts - that's another performance improvement because awk only does field splitting (which takes time of course) if you name specific fields in your script.
Assuming multiple cores are available, the simple way to run parallel is to use xargs, Depending on your config try 2, 3, 4, 5, ... until you find the optimal number. This assumes that there are multiple input files, and that there is NO single files that is much larger than all other files.
Notice added 'fflush' so that lines will not be split. This will have some negative performance impact, but is required, assuming you the individual input files to get merged into single set of output files. Possible to wrokaround this problem by splitting each file, and then merging the combined files.
#! /bin/sh
pushd "_test2" > /dev/null
ls * | xargs --max-procs=4 -L1 awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
fflush(Fpath)
}
else
print $0 >> "errorfile"
fflush("errorfile")
}' "$FILE"
popd > /dev/null
From practical point of view you might want to create an awk script, e.g., split.awk
#! /usr/bin/awk -f -
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+#[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}
And then the 'main' code will look like below, easier to manage.
xargs --max-procs=4 -L1 awk -f split.awk

Split large CSV file based on column value cardinality

I have a large CSV file with following line format:
c1,c2
I would like to split the original file in two files as follows:
One file will contain the lines where the value of c1 appears exactly once in the file.
Another file will contain the lines where the value of c1 appears twice or more in the file.
Any idea how it can be done?
For example, if the original file is:
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
I would like to produce the following files:
3,foo
4,bar
and
1,foo
2,bar
2,foo
1,bar
this one-liner generates two files o1.csv and o2.csv
awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' file file
test:
kent$ cat f
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
kent$ awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' f f
kent$ head o*
==> o1.csv <==
3,foo
4,bar
==> o2.csv <==
1,foo
2,bar
2,foo
1,bar
Note
awk reads your file twice, instead of saving whole file in memory
the order of the file is retained
Depending what you mean by big, this might work for you.
It has to hold the lines, in associative array done, until it sees
2nd use, or until the end of file. When a 2nd use is seen the remembered
data is changed to "!" to avoid printing it again on a 3rd and later match.
>file2
awk -F, '
{ if(done[$1]!=""){
if(done[$1]!="!"){
print done[$1]
done[$1] = "!"
}
print
}else{
done[$1] = $0
order[++n] = $1
}
}
END{
for(i=1;i<=n;i++){
out = done[order[i]]
if(out!="!")print out >>"file2"
}
}
' <csvfile >file1
I'd break out Perl for this job
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
my #lines;
open ( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file
while ( <$input> ) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
push ( #lines, [ $c1 , $c2 ] );
}
close ( $input );
print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } #lines ) {
print join (",", #$pair );
}
print "File 2:\n";
#filter any repeats.
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } #lines ) {
print join (",", #$pair );
}
This will hold the whole file in memory, but given your data - you don't save much space by double processing it and maintaining a count.
However you could do:
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
open( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file counting "c1"
while (<$input>) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
}
open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe, '>', "output_dupes.csv" ) or die $!;
seek( $input, 0, 0 );
while ( my $line = <$input> ) {
my ($c1) = split( ",", $line );
if ( $count_of{$c1} > 1 ) {
print {$output_dupe} $line;
}
else {
print {$output_single} $line;
}
}
close($input);
close($output_single);
close($output_dupe);
This will minimise memory occupancy, by only retaining the count - it reads the file first to count the c1 values, and then processes it a second time and prints lines to different outputs.

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9

How to "do something" for each input text files

Say that I read in the following information stored in three diffrent text files (Can be many more)
File 1
1 2 rt 45
2 3 er 44
File 2
rf r 4 5
3 er 4 t
er t yu 4
File 3
er tyu 3er 3r
der 4r 5e
edr rty tyu 4r
edr 5t yt5 45
When I read in this information I want it to print this information from these two files into separate arrays as for now they are printed out in the same time
Now I Have this script printing out all information at the same time
{
TESTd[NR-1] = $2; g++
}
END {
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" TESTd[i] "\"]"
}
print " _____"
}
But is there a way to read in multiple files and do this for every text file?
Like instead of getting this output when doing awk -f test.awk 1.txt 2.txt 3.txt
["2"]
["3"]
["r"]
["er"]
["t"]
["tyu"]
["4r"]
["rty"]
["5t"]
_____
I get this output
["2"]
["3"]
_____
["r"]
["er"]
["t"]
_____
["tyu"]
["4r"]
["rty"]
["5t"]
_____
And reading in each file at the time is preferably not an option here since I will have like 30 text files.
EDIT________________________________________________________________
I want to do this in awk if possible because I'm going to do something like this
{
PRINTONCE[NR-1] = $2; g++
PRINTONEATTIME[NR-1] = $3
}
END {
#Do this for all arguments once
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" PRINTONCE[i] "\"] \n"
}
print " _____"
#Do this for loop for every .txt file that is read in as an argument
#for(j=0;j<args.length;j++){
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" PRINTONEATTIME[i] "\"] \n"
}
print " _____"
}
From what i understand, you have an awk script that works and you want to run that awk script on many files and want their output to have a new line(or _) in between so you can distinguish which output is from which file.
Try this bash script :-
dir=~/*.txt #all txt files in ~(home) directory
for f in $dir
do
echo "File is $f"
awk 'BEGIN{print "Hello"}' $f #your awk code will take $f file as input.
echo "------------------"; echo;
done
Also, if you do not want to do this to all files you can write the for loop as for f in 1.txt 2.txt 3.txt.
If you don't want to do it in awk directly. You can call it like this in bash or zsh for example:
for fic in test*.txt; awk -f test.awk $fic
It's quite simple to do it directly in awk:
# define a function to print out the array
function dump(array, n) {
for (i = 0 ; i <= n-1; i ++ ) {
print " [\"" array[i] "\"]"
}
print " _____"
}
# dump and reset when starting a new file
FNR==1 && NR!=1 {
dump(TESTd, g)
delete TESTd
g = 0
}
# add data to the array
{
TESTd[FNR-1] = $2; g++
}
# dump at the end
END {
dump(TESTd, g)
}
N.B. using delete TESTd is a non-standard gawk feature, but the question is tagged as gawk so I assumed it's OK to use it.
Alternatively you could use one or more of ARGIND, ARGV, ARGC or FILENAME to distinguish the different files.
Or as suggested by see https://stackoverflow.com/a/10691259/981959, with gawk 4 you can use an ENDFILE group instead of END in your original:
{
TESTd[FNR-1] = $2; g++
}
ENDFILE {
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" TESTd[i] "\"]"
}
print " _____"
delete TESTd
g = 0
}
Write a bash shell script or a basic shell script. Try to put below into test.sh. Then call /bin/sh test.sh or /bin/bash test.sh, see which one will work
for f in *.txt
do
echo "File is $f"
awk -F '\t' 'blah blah' $f >> output.txt
done
Or write a bash shell script to call your awk script
for f in *.txt
do
echo "File is $f"
/bin/sh yourscript.sh
done