Variation with reputation for rows of file - awk

I have a file
4 5 6 6
1 7 5 5
7 0 2 1
7 8 0 6
and I would like to produce files that have random chosen rows from this file with reputation. So the outputs can be for instance:
4 5 6 6
1 7 5 5
1 7 5 5
7 8 0 6
7 8 0 6
1 7 5 5
1 7 5 5
7 8 0 6
I mean that some of the rows will be in the output more times, some of the rows zero times. Is it possible to produce a list of random numbers with reputation and according to it choose rows from input? Is it possible in awk or is some other language more appropriate?

If this isn't all you need:
$ shuf -n $(wc -l < file) -r file
4 5 6 6
7 8 0 6
1 7 5 5
1 7 5 5
then edit your question to clarify your requirements.

Don't understand the concept of reputation but here is a way using just randomness:
$ awk -v seed=$RANDOM '{ # set the random seed externally
a[NR]=$0 # hash records to a
}
END {
srand(seed)
for(i=1;i<=4;i++) # 4 is the number of records to output
print a[int(1+rand()*NR)] # get a random array element and output it
}' file
An example of output:
7 8 0 6
7 8 0 6
7 8 0 6
1 7 5 5

You could also do this with coreutils shuf and sed, e.g.:
n=$(wc -l < infile)
shuf -n $n -i 1-$n -r | sed 's/$/p/' | sed -nf - infile
Output example:
4 5 6 6
4 5 6 6
1 7 5 5
1 7 5 5

Related

How can I force pytest to render newlines in reprs (e.g. for Pandas DataFrames) in test failure output?

I have this test:
import pandas as pd
def test_foo():
df = pd.DataFrame({k: list(range(10)) for k in 'abcdefghij'})
assert len(df) == 0
When I run this with python3 -m pytest ., I see this in the failure output:
E assert 10 == 0
E + where 10 = len( a b c d e f g h i j\n0 0 0 0 0 0 0 0 0 0 0\n1 1 1 1 1 1 1 1 1 1 1\n2 2 2 2 2 2 2 2... 6 6 6 6 6 6 6 6\n7 7 7 7 7 7 7 7 7 7 7\n8 8 8 8 8 8 8 8 8 8 8\n9 9 9 9 9 9 9 9 9 9 9)
The repr of the dataframe is hard to read. I want to see the whole dataframe instead of an ellipsized version, and I want to see newlines rendered as actual newlines instead of \n. How can I tell pytest to
render the dataframe more legibly in the failure output?

Pandas read_csv BZ2 File Always Includes File Name

Every time I load a .bz2 file into a Pandas dataframe I get the name of the file as the first column of the first row of the dataframe. I'm using tar to compress the files. I have written the following something.txt file:
1 2 3 4 5
2 3 4 5 6
6 7 8 9 10
I compress it via tar -cvjf something.txt.bz2 something.txt. Then, I decompress the data and move it to a secure file location:
tar -xvjf something.txt.bz2
mv something.txt something.txt.2
Now I load the data in a python script in three different ways:
>>> data1 = pd.read_csv("something.txt")
>>> data2 = pd.read_csv("something.txt.2")
>>> data3 = pd.read_csv("something.txt.bz2")
and here's what i get when I read these data back again:
>>> data1
1 2 3 4 5
0 2 3 4 5 6
1 6 7 8 9 10
>>> data2
1 2 3 4 5
0 2 3 4 5 6
1 6 7 8 9 10
>>> data3
something.txt 2 3 4 5
0 2.0 3.0 4.0 5.0 6.0
1 6.0 7.0 8.0 9.0 10.0
2 NaN NaN NaN NaN NaN
Does anybody know why this is happening???
This is how it works on my end. First, consider your data set something.txt:
"c0" "c1" "c2" "c3" "c4"
1 2 3 4 5
2 3 4 5 6
6 7 8 9 10
where I named the columns and used a single space as a separator for consistency. Then, compress it by using bzip2 (not tar):
bzip2 -z9 something.txt
This will replace something.txt with something.txt.bz2 in your base directory. Finally, initiate a python session and perform the following:
import pandas as pd
df = pd.read_csv("something.txt.bz2", compression="bz2", sep=" ")
df
The last line will print:
c0 c1 c2 c3 c4
0 1 2 3 4 5
1 2 3 4 5 6
2 6 7 8 9 10
which shows the expected values.
I hope this helps!

Merging multiple files with null values using AWK

Sorry I am posting it again as i messed up in my earlier post:
I am interesting in joining multiple files (e.g., file 1 file2 file 3...) using matching values in column 1 and get this desired output. Would appreciate any help please:
file1:
A 2 3 4
B 3 7 8
C 4 6 9
file2:
A 7 6 3
C 2 4 7
D 1 6 4
file3:
A 3 2 7
B 4 7 3
E 3 6 8
Output:
A 2 3 4 7 6 3 3 2 7
B 3 7 8 n n n 4 7 3
C 4 6 9 2 4 7 n n n
D n n n 1 6 4 n n n
E n n n n n n 3 6 8
Here is one for awk. Tested with GNU awk, mawk, original-awk ie. awk 20121220 and Busybox awk:
$ awk '
function nn(c, b,i) {
if(c)
for(i=1;i<=c;i++)
b=b "n "
return b
}
FNR==1{nf+=(NF-1)}
{
for(i=2;i<=NF;i++)
b[$1]=b[$1] $i OFS
a[$1]=a[$1] (n[$1]<(nf-NF+1)?nn(nf-NF+1-n[$1]):"") b[$1]
n[$1]=nf+0
delete b[$1]
}
END{
for(i in a)
print i,a[i] (n[i]<(nf)?nn(nf-n[i]):"")
}' file1 file2 file3
Output:
A 2 3 4 7 6 3 3 2 7
B 3 7 8 n n n 4 7 3
C 4 6 9 2 4 7 n n n
D n n n 1 6 4 n n n
E n n n n n n 3 6 8

awk cumulative sum in on dimension

Good afternoon,
I would like to make a cumulative sum for each column and line in awk.
My in file is :
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
And I would like : per column
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
And I would like : per line
1 3 5 9 9
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
I did the sum per column as :
awk '{ for (i=1; i<=NF; ++i) sum[i] += $i}; END { for (i in sum) printf "%s ", sum[i]; printf "\n"; }' test.txt # sum
And per line .
awk '
BEGIN {FS=OFS=" "}
{
sum=0; n=0
for(i=1;i<=NF;i++)
{sum+=$i; ++n}
print $0,"sum:"sum,"count:"n,"avg:"sum/n
}' test.txt
But I would like to print all the lines and columns.
Do you have an idea?
It looks like you have all the correct information available, all you are missing is the printout statements.
Is this what you are looking for?
accumulated sum of the columns:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' foo
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
accumulated sum of the rows:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ sum=0; for (i=1; i<=NF; ++i) {sum+=$i; $i=sum }; print $0}' foo
1 3 6 10
2 7 13 20
2 5 11 16
1 3 4 6
Both these make use of the following :
each variable has value 0 by default (if used numerically)
I replace the field $i with what the sum value
I reprint the full line with print $0
row sums with repeated last element
$ awk '{s=0; for(i=1;i<=NF;i++) $i=s+=$i; $i=s}1' file
1 3 6 10 10
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
$i=s sets the index value (now incremented to NF+1) to the sum and 1 prints the line with that extra field.
columns sums with repeated last row
$ awk '{for(i=1;i<=NF;i++) c[i]=$i+=c[i]}1; END{print}' file
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
END{print} repeats the last row
ps. your math seems to be wrong for the row sums

AWK - removal of the same fields on the basis of the "$1"

I have a file1:
6
3
6
9
2
6
This command prints the result:
awk 'NR==1{a=$1};$0!=a' file1
3
9
2
Now I have file2:
6 1 2 3 4 5
3 3 4 4 4 6
6 5 2 2 5 1
9 1 3 5 4 1
2 5 6 4 8 5
6 1 5 2 3 1
I want to do the same thing, but with file2. I want to print out the result:
3 3 4 4 5 6
9 5 3 2 8 1
2 5 6 5 3 1
5 4 1
2
I want to do it in awk. Thank you for your help.
AWK is not really suited for what you are trying to do, since it is made for processing rows one at a time, while you are trying to shift numbers up and down between different rows. That said, this monster should do what you want:
awk 'NR==1{nc=NF;for(i=1;i<=nc;i++)a[i]=$i}{for(i=1;i<=nc;i++){if($i!=a[i]){v[m[i]++,i]=$i;if(m[i]>nl)nl=m[i]}}}END{for(l=0;l<nl;l++){for(i=1;i<=nc;i++){if(l<m[i]){printf("%d ", v[l,i])}else{printf(" ")}}printf("\n")}}'
If, on the other hand, your matrix of numbers had been transposed, this task would have been far simpler:
awk '{for(i=2;i<=NF;i++)if($i!=$1)printf(" %d",$i);printf("\n")}'