Delete file based on condition - awk

I have a directory that contain text file and the files are used for some calculation and it produces four column files
Like:
0.5000 -0.9650 6.6554 3.4228
when column 2 greater than zero and column 3 less than zero then I want to delete that file from folder. I tried script below:
#!bin/sh
for file in /home/dew/*.txt
do
some calulation for producing four column `file1`
if awk '{print ($2 > 0 && $3 < 0)}' file1 | rm -rf $file
done
But it gives some errors

You may use this awk + xargs:
cd /home/dew
awk -v ORS='\0' '$2 > 0 && $3 < 0 {print FILENAME; nextfile}' *.txt |
xargs -0 rm

Related

To read and print 1st 1000 rows from a csv using awk command and then next 1000 and so on

I have a csv that has around 25k rows. I have to pick 1000 rows from column#1 and column#2 at a time and then next 1000 rows and so on.
I am using below command, and its working fine in picking up all the values from column#1 and Column#2 i.e 25K fields from both the columns, I want to pick value like 1-1000, put them in my sql export query then 1001-2000,2001-3000 and so on and put the value in WHERE IN in my export query and append the result in dbData.csv file.
My code is below:
awk -F ',' 'NR > 2 {print $1}' $INPUT > column1.txt
i=$(cat column1.txt | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}')
awk -F ',' 'NR > 2 {print $2}' $INPUT > column2.txt
j=$(cat column2.txt | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}')
echo "Please wait - connecting to database..."
db2 connect to $sourceDBStr user user123 using pas123
db2 "export to dbData.csv of del select partnumber,language_id as LanguageId from CATENTRY c , CATENTDESC cd where c.CATENTRY_ID=cd.CATENTRY_ID and c.PARTNUMBER in ($i) and cd.language_id in ($j)"
Let's assume the two first fields of your input CSV are "simple" (no spaces, no commas...) and do not need any kind of quoting. You could generate the tricky part of your query string with an awk script:
# foo.awk
NR >= first && NR <= last {
c1[n+0] = $1
c2[n++] = $2
}
END {
for(i = 0; i < n-1; i++) printf("%s,", c1[i])
printf("%s) %s (%s", c1[n-1], midstr, c2[0])
for(i = 1; i < n; i++) printf(",%s", c2[i])
}
And then use it in a bash loop to process 1000 records per iteration, store the result of the query in a temporary file (e.g., tmp.csv in the following bash script) that you concatenate to your dbData.csv file. The following example bash script uses the same parameters as you do (INPUT, sourceDBStr) and the same constants (dbData.csv, 1000, user123, pas123). Adapt if you need more flexibility. Error management (input file not found, DB connection error, DB query error...) is left as a bash exercise (but should be done).
prefix="export to tmp.csv of del select partnumber,language_id as LanguageId from CATENTRY c , CATENTDESC cd where c.CATENTRY_ID=cd.CATENTRY_ID and c.PARTNUMBER in"
midstr="and cd.language_id in"
rm -f dbData.csv
len=$(cat "$INPUT" | wc -l)
for (( first = 2; first <= len - 999; first += 1000 )); do
(( last = len < first + 999 ? len : first + 999 ))
query=$(awk -F ',' -f foo.awk -v midstr="$midstr" -v first="$first" \
-v last="$last" "$INPUT")
echo "Please wait - connecting to database..."
db2 connect to $sourceDBStr user user123 using pas123
db2 "$prefix ($query)"
cat tmp.csv >> dbData.csv
done
rm -f tmp.csv
But there are other ways using split, bash arrays and simpler awk or sed scripts. Example:
declare -a arr=()
prefix="export to tmp.csv of del select partnumber,language_id as LanguageId from CATENTRY c , CATENTDESC cd where c.CATENTRY_ID=cd.CATENTRY_ID and c.PARTNUMBER in"
midstr="and cd.language_id in"
awk -F, 'NR>1 {print $1, $2}' "$INPUT" | split -l 1000 - foobar
rm -f dbData.csv
for f in foobar*; do
arr=($(awk '{print $1 ","}' "$f"))
i="${arr[*]}"
arr=($(awk '{print $2 ","}' "$f"))
j="${arr[*]}"
echo "Please wait - connecting to database..."
db2 connect to $sourceDBStr user user123 using pas123
db2 "$prefix (${i%,}) $midstr (${j%,})"
cat tmp.csv >> dbData.csv
rm -f "$f"
done
rm -f tmp.csv

Filtering using awk returns empty files

I have a similar problem to this question: How to do filtering of multiple files in a directory using awk?
The solution in the answers of the question above does not work for me.
I have tab-delimited txt files (all in folder Observation_by_pracid). For each file, I want to create a new file that only contains rows with a specific value in column $9 (medcodeid). The specific values are to be found in medicalcode_list.txt.
There is no error, however it returns only empty files.
Codelist
medcodeid
2576
3199
Format of input files
patid consid ... medcodeid
500470520002 3062539302 ... 2576
951924020002 3062538414 ... 310803013
503478020002 3061587464 ... 257619018
951924020002 3062537807 ... 55627011
503576720002 3062537720 ... 3199
Desired output
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
My code
mkdir HBA1C_observation_bypracid
awk '
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Solution
mkdir HBA1C_observation_bypracid
awk '
BEGIN{ FS=OFS="\t" }
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Adding "BEGIN..." solved my problem.
You can join two files on a column using join.
Files must be sorted on the joined column. To perform a numerical sort on a column, use sort this way, where N is the column number:
sort -kN -n FILE
You also need to get ride of the first line (column names) of each files. You can use tail command the way below, where N is the number of line from which you want to output the content (so 2nd line):
tail -n +N
... But still need to display the column values:
head -n 1 FILE
To join two files f1 and f2, on the fields c1 of f1 and c2 of f2, and output fields y of files x:
join -1 c1 -2 c2 f1 f2 -o "x.y, x.y"
Working sample:
head -n 1 input_file
for input_file in *.txt ; do
join -1 1 -2 9 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9" \
<(tail -n +2 PATH/medicalcode_list.txt | sort -k1 -n) \
<(tail -n +2 "$input_file" | sort -k3 -n)
done
Result (for the input file you gave):
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
Note: the column names arent aligned with the values. Don't know if it's a prerequisite. You can format the display with printf command.
Personally I think it would be simpler to loop over in the shell (understanding that this will reread the code list more than once), with a simpler awk function that you should be able to test and debug. Something like:
for file in *.txt; do
awk 'FNR == NR { mlist[$1] } FNR != NR && ($9 in mlist) { print }' \
PATH/medicalcode_list.txt "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done
You should be able to start without the redirection to make sure that for a single file, you get the results printed to the terminal that you were expected. If you don't there might be some incorrect assumption about the files.
Another option would be to write a separate awk script that writes the code to hard-code the list in another awk script. Also gives the advantage to check the contents of the variable mlist.
printf 'BEGIN {\n%s\n}\n $9 in mlist { print }' \
"$(awk '{ print "mlist[" $1 "]" }' PATH/medicalcode_list.txt)" > filter.awk
for file in *.txt; do
awk -f filter.awk "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done

Merge multiple files and split output to multiple files based in each column (post 2)

I have many csv files with exactly same format on rows and columns. In the example below I present only 2 files as input, but i have a lot files with same characteristics
The purpose is for each input file do:
Take value in row 1, 2 and 3.
example in first file
6174
15
3
Then, print first column from row 4 to 6.
Do same process for all input files and output a file with all information of all readed files.
When the process is done for all files and first column. Do the same of the rest columns
At the end the total files output created will be 4 files as there is 4 columns in each file.
Input1
Record Number 6174
Vibrator Identification 15
Start Time Error 3 us
1.6,19.5,,,
1.7,23.2,28.3,27.0
1.8,26.5,27.0,25.4
Input2
Record Number 6176
Vibrator Identification 17
Start Time Error 5 us
1.6,18.6,,,
1.5,23.5,19.7,19.2
1.3,26.8,19.2,18.5
Using the code below, I got the 4 output files as desired, although files 3-4, are not good as spected, because in the first lines there is empty values and my code does not work as supposed. Also I have an issue to get the good value in row 3 in each file.. I get us instead of a number.
output file1
6174,15,3,1.6,1.7,1.8
6176,17,5,1.6,1.5,1.3
output file2
6174,15,3,19.5,23.2,26.5
6176,17,5,18.6,23.5,26.8
output file3
6174,15,3,0,0,28.3,27.0
6176,17,5,0,0,19.7,19.2
output file4
6174,15,3,0,0,27.0,25.4
6176,17,5,0,0,19.2,18.5
code used
The code works almost fine, merge the csv files and output the 4 files requerides, but there is a problem for files 3-4, when there is empty values.
for f in *.csv ; do
awk -F, 'NR==1 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
awk -F, 'NR==2 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
awk -F, 'NR==3 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
sed -i 's/\r$//' a-"$f"
for i in seq $(1...4); do
awk -F, 'NR>=4{f=1} f{print '"$""$i"'} f==6{exit}' "$f" > "a""$i"-"$f"
cat a-"$f" a"$i""-""$f" >> t"$i"
sed -i 's/\r$//' t"$i"
done
for i in seq $(1...4); do
awk -v RS= -v OFS=',' -v ORS='\n' '{$1=$1}1' t"$i" > file"$i".csv
done
done
rm -f ./a* ./t*
Appreciate your help
With GNU awk for ENDFILE and automatic handling of multiple open files and assuming your posted sample output showing file3 and file4 each having more fields than file1 and file2 is a mistake:
$ cat tst.awk
BEGIN { FS=OFS=","; numHdrFlds=3 }
FNR <= numHdrFlds {
gsub(/[^0-9]/,"")
hdr = (FNR==1 ? "" : hdr OFS) $0
next
}
{
for (i=1; i<=NF; i++) {
data[i] = (FNR==(numHdrFlds+1) ? "" : data[i] OFS) ($i)+0
}
}
ENDFILE {
for ( fileNr=1; fileNr<=NF; fileNr++ ) {
print hdr, data[fileNr] > ("outputFile" fileNr)
}
}
.
$ awk -f tst.awk file1 file2
$ for i in outputFile*; do echo "$i"; cat "$i"; echo "---"; done
outputFile1
6174,15,3,1.6,1.7,1.8
6176,17,5,1.6,1.5,1.3
---
outputFile2
6174,15,3,19.5,23.2,26.5
6176,17,5,18.6,23.5,26.8
---
outputFile3
6174,15,3,0,28.3,27
6176,17,5,0,19.7,19.2
---
outputFile4
6174,15,3,0,27,25.4
6176,17,5,0,19.2,18.5
---

use awk to calculate percentage of column 1 derived from column 2 and add it to column 3

I have a file consisting out of 2 columns, both contain only whole numbers. I want awk to add a third column which shows the percentage of column 1, derived from column 2.
So, for example, column 1 shows:
cat file
15 150
I want awk to add column 3 to show 10 (because 15 is 10% of 150, right?) like this:
15 150 10
The columns are separated by tabs.
Thank you for your help!
Another awk
awk '$3=100*$1/$2' file
To overwrite file
awk '$3=100*$1/$2' file > tmp && mv tmp file
If for some reason you have 0s in your file
awk '$2>0{$3=100*$1/$2}1' file > tmp && mv tmp file
or
awk '$2>0&&$3=100*$1/$2' file > tmp && mv tmp file
Untested, but an educated guess at what might work:
awk '{ print $1, $2, 100*$1/$2 }' yourfile.txt
To save it to somewhere, you'll have to redirect ´stdout` to a file. If you want this to overwrite your original file (don't do this until you've tested that it works!) you could wrap it in a bash script:
#!/bin/bash
awk '{ print $1, $2, 100*$1/$2 }' "$1" > "$1.tmp"
mv "$1.tmp" "$1"
and run it like
./thebashscript.sh yourfile.txt

Using awk to pull specific lines from a file

I have two files, one file is my data, and the other file is a list of line numbers that I want to extract from my data file. Can I use awk to read in my lines file, and then extract the lines that match the line numbers?
Example:
Data file:
This is the first line of my data
This is the second line of my data
This is the third line of my data
This is the fourth line of my data
This is the fifth line of my data
Line numbers file
1
4
5
Output:
This is the first line of my data
This is the fourth line of my data
This is the fifth line of my data
I've only ever used command line awk and sed for really simple stuff. This is way beyond me and I have been googling for an hour without an answer.
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
simply referring to an array subscript creates the entry. Looping over the first file, while NR (record number) is equal to FNR (file record number) using the next statement stores all the line numbers in the array. After that when FNR of the second file is present in the array (true) the line is printed (which is the default action for "true").
One way with sed:
sed 's/$/p/' linesfile | sed -n -f - datafile
You can use the same trick with awk:
sed 's/^/NR==/' linesfile | awk -f - datafile
Edit - Huge files alternative
With regards to huge number of lines it is not prudent to keep whole files in memory. The solution in that case can be to sort the numbers-file and read one line at a time. The following has been tested with GNU awk:
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Run it like this:
awk -v linesfile=$linesfile -f extract.awk infile
Testing:
echo "2
4
7
8
10
13" | awk -v linesfile=/dev/stdin -f extract.awk <(paste <(seq 50e3) <(seq 50e3 | tac))
Output:
2 49999
4 49997
7 49994
8 49993
10 49991
13 49988
Here is an awk example. inputfile is loaded up front, then matching records of datafile are output.
awk \
-v RS="[\r]*[\n]" \
-v FILE="inputfile" \
'BEGIN \
{
LINES = ","
while ((getline Line < FILE))
{
LINES = LINES Line ","
}
}
LINES ~ "," NR "," \
{
print
}
' datafile
I had the same problem. This is the solution already posted by Thor:
cat datafile \
| awk 'BEGIN{getline n<"numbers"} n==NR{print; getline n<"numbers"}'
If like me you don't have a numbers file, but it is instead passed on from stdin and you don't want to generate a temporary numbers file, then this is an alternative solution:
cat numbers \
| awk '{while((getline line<"datafile")>0) {n++; if(n==$0) {print line;next}}}'
This solution...
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
...only prints unique numbers in the numberfile. What if the numberfile contains repeated entries? Then sed is a better (but much slower) alternative:
sed -nf <(sed 's/.*/&p/' numberfile) datafile
while read line; do echo $(sed -n '$(echo $line)p' Datafile.txt); done < numbersfile.txt