How to get random lines with a given distribution with awk?

How to get random lines with a given distribution with awk? - awk

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks

#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.

This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.

Related

How do i compare alphanumeric characters in non sequential order?

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B

It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

Delete every occurrence in a file

Let's say I have a file with this format content:
6 8
6 9
12 20
6
8
9
12
20
35
I want to delete all the lines if the number (from 1st or 2nd column) is found in next lines whether it is in the first or 2nd column inluding the line where the initial number is found.
So I should have this as a result:
35
I've tried using
awk '{for(i=1;i<=NF;i++){if($i in a){next};a[$i]}} 1'
Is there a simple way to do this ?

This line should help:
awk 'NR==FNR{a[$0]++;next}{for(i=1;i<=NF;i++)if(a[$i]>1)next}7'
RS=" |\n" file RS="\n" file
With this input file:
100 200 300
6 8
6 9
12 20
6
8
9
12
20
35
The one-liner above is gonna output:
100 200 300
35

You want to find numbers that only occur once. Let me add "42" to your file on the same line as "35"
$ cat file
6 8
6 9
12 20
6
8
9
12
20
35 42
To find all the number that only appear once:
$ tr -s "[:blank:]" "\n" < file | sort -n | uniq -u
35
42
But that doesn't show the line in the file where they appear, so
$ grep -n -Fw -f <(tr -s "[:blank:]" "\n" < file | sort -n | uniq -u) file
9:35 42
I added the -n option just for effect, to show that we have line 9 from the file.

The code you originally wrote works perfectly under the condition: print the current line if and only if any of the previous lines do not have a number listed in the current line. However, you do not request previous but next.
The simples solution for you would be to tac the input and tac it back.
$ tac <file> | awk '{for(i=1;i<=NF;i++){if($i in a) next;a[$i]}}1' | tac
If you want to use a single awk then you have to store the lines first and process them afterward, it is also a bit more cumbersome because :
you need to count how many times a number appeared
and you need a double pass algorithm (by running the file twice or storing it in memory)
This gives you the following :
double pass:
$ awk '(NR==FNR){ for(i=1;i<=NF;i++) a[$i]++; next }'
{ b=0; for(i=1;i<=NF;i++) b+=--a[$i] }
!b; <file> <file>
memory:
$ awk '{ m[NR]=$0; for(i=1;i<=NF;i++) a[$i]++; next }
END { for (n=1;n<=NR;++n) {
b=0; $0=m[n];
for(i=1;i<=NF;i++) b+=--a[$i]
if (!b) { print }
}
}' <file>
This outputs the expected :
6
8
9
12
20
35
If you want to remove all lines which contain a number which appears more then ones in the file, then you simply do :
$ awk '(NR==FNR) { for(i=1;i<=NF;i++) a[$i]++; next }
{ b=1; for(i=1;i<=NF;i++) b = b && (a[$i]==1) }
b' <file> <file>
note: from your input you might also be suffering from the windows \r problem, so you might want to do gsub("\r","") before everything.

Another in awk, processes the data once:
awk '
{
for(i=1;i<=NF;i++) { # process numbers in record
r[$i]=r[$i] NR OFS # add NR on a record list array
if(p=split(r[$i],t)>1) { # if there are more than one record in r
for(j in t) # delete records on the list
delete a[t[j]]
f=1 # flag up
}
}
if(f=="") # if flag is down
a[NR]=$0 # store current record
f="" # reset flag
}
END {
for(i=1;i<=NR;i++) # in the end
if(i in a) # print all records stored
print a[i]
}' file
35
A revised version:
awk '
{
for(i=1;i<=NF;i++) {
if($i in r) { # just store the NR of last duplicate
delete a[r[$i]]
f=1
}
r[$i]=NR
}
if(f=="")
a[NR]=$0
f=""
}
END {
for(i=1;i<=NR;i++)
if(i in a)
print a[i]
}' file
35

You could try something like this:
cat yourFile | tr ' ' '\n' | sort | uniq -u

AWK script for two columns

I have two columns like this:
(A) (B)
Adam 30
Jon 55
Robert 35
Jokim 99
Adam 32
Adam 31
Jokim 88
I want an AWK script to check if Adam( or any name ) in column A becomes 30 in column B then delete all Adam names from column A, it does not matter whether Adam becomes 31 or 32 later, and then print the results.
I have a log list in reality and I do not want the code to be depended on "Adam". So, What I want exactly is basically wherever 30 is existed in $2 so delete the respective value in $1 and also search in $1 to find all values which are the same as the deleted value.

You can read the columns into variables and check the value of the second column for the value you are looking for then sed the file to delete all the column 1 entries:
cp test.txt out.txt && CHK=30 && while read a b; do
[ "${b}" = "${CHK}" ] && sed -i "/^${a}/d" out.txt
done < test.txt
Note: If you may have regex values in the columns you may need to escape them, also if you possibly have blanks you may want to check for null first before the test on column 2.
And since you specified AWK here is a somewhat elegant awk way to do this, using a check flag to look ahead prior to printing:
awk -vCHK=30 '{if($2~CHK)block=$1; if($1!=block)print}' test.txt

To remove the entries from the first occurence of Adam, 30:
$1 == "Adam" && $2 == 30 { found = 1 }
!(found && $1 == "Adam")
To remove all Adam entries if any Adam, 30 exists:
$1 == "Adam" && $2 == 30 { found = 1 }
!(found && $1 == "Adam") { lines[nlines++] = $0 }
END { for (i in lines) print lines[i] }
To remove all names which have a 30 the second column:
NR == FNR && $2 == 30 { foundnames[$1] = 1 }
NR != FNR && !($1 in foundnames)
You must call this last version with the input filename twice, ie awk process.awk file.txt file.txt

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?

can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.

I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

How to print columns containing value

Let's say I have a data file containing the following:
1 2 3 4 5
67 88 12 32 22
9 99 34 59 86
17 0 78 0 77
11 0 0 0 43
I would like to have a code that searches through each column for the number 0. If the number 0 is found, the code will print out that entire column in a separate file.
With this data, the outputted file would look like so:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
It'd be great if the code didn't require me knowing the exact number of columns and/or row.

This will do what you want. It does not requiring knowing anything about how many rows or columns are present.
$ awk 'FNR==NR{for (i=1;i<=NF;i++)if ($i==o)a[i]=1;next} {tab="";for (i=1;i<=NF;i++)if (a[i]){printf "%s%s",tab,$i; tab="\t"};print ""}' file file
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
How it works
Because the file name is specified twice on the command line, the awk script will read the file twice, the first time to look for zeros, the second time to print.
FNR==NR{for (i=1;i<=NF;i++)if ($i==o)a[i]=1;next}
One the first run through the file, a[i] is set to one for any column i that has a zero in it.
This code only applies to the first run through because of the condition FNR==NR. NR is the total number of records (lines) that we have read so far. FNR is the number of records (lines) that we have read so far from the current file. Thus, when FNR==NR, we are still reading the first file. The next at the end of the commands tells awk to skip the remaining commands and start over on the next line.
tab="";for (i=1;i<=NF;i++)if (a[i]){printf "%s%s",tab,$i; tab="\t"};print ""
When we are reading through the file for the second time, we print out each column i for which a[i] is non-zero. I chose tab-separated output but, by simply adjusting the printf statement, any format could be used.

sed '#n
# init and load line in buffer (1st line copied, other added)
s/.*/>& /;1!H;1h
# at end of file, load buffer in working area
$ {x
:cycle
# keep column if zero inside
/>[[:blank:]]*0[[:blank:]]/ s/>\(\([[:blank:]]*[0-9]\{1,\}\)[[:blank:]][[:graph:][:blank:]]*\)/\2>\1/g
# remove treated column
s/>[[:blank:]]*[0-9]\{1,\}\([[:blank:]]\{1,\}[[:graph:][:blank:]]*\)/>\1/g
# is there another colum to treat ?
/>[[:blank:]]*[0-9][[:graph:][:blank:]]/ b cycle
# print result after cleanup
s/>//gp
}' YourFile
Self commented sed
posix versioj so --posix on GNU sed

Here is an interesting way of doing it with GNU awk:
parse.awk
# Record number of columns (assuming all columns have the same number of fields)
NR == 1 { n = NF }
# First parse: Remember which columns contain `pat`
FNR == NR {
for(i=1; i<=NF; i++)
if($i == pat) {
h[i] = i
last = i>last ? i : last
}
next
}
# Before second parse: switch to reading one field at a time
ENDFILE {
RS="[ \t\n]+"
}
# Second parse: print field if current-record-number modulo
# number-of-columns is in the `h` hash
{ m = FNR % n }
m in h {
ORS = (m == last) ? "\n" : OFS # print new-line after last column
print $1
}
Run it like this for example:
awk -f parse.awk pat=0 infile infile
Output:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
Or with OFS='\t':
awk -f parse.awk pat=0 OFS='\t' infile infile
Output:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get random lines with a given distribution with awk? - awk

This might work for you: grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3 grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3 N.B. This assumes file1 is sorted.

Related

How do i compare alphanumeric characters in non sequential order?

Delete every occurrence in a file

AWK script for two columns

Print every second consequtive field in two columns - awk

How to print columns containing value

Categories

Resources