Add a result of another script into a table with awk - awk

I have an awk script (let's call it my_script) which gives output as one column, like this:
DD
EE
AA
CC
BB
I also have another table:
C1 123
C3 222
C5 175
C4 318
C8 299
I want to add it to my existing table as the last column:
C1 123 DD
C3 222 EE
C5 175 AA
C4 318 CC
C8 299 BB
I tried the following script but it didn't work:
awk '{print $0, $3=myscript}' file.txt

You can use paste:
my_script |
paste -d ' ' table.txt -

Related

how to remove every n block of rows after every mth row?

I have a data file like this :
0000 0f 13 45 54 23 24 ae e1 f6
0001 f8 31 35 23 24 e7 e6 e1 f5
0002 0f 13 45 54 23 24 ae e1 f6
0003 0f 13 45 54 23 24 ae e1 f6
0004 f8 31 35 23 24 e7 e6 e1 f5
0005 0f 13 45 54 23 24 ae e1 f6
0006 0f 13 45 54 23 24 ae e1 f6
So let's say i would like to remove every 2nd 3rd row starting from top and leaving 2 rows each after which the output should be:
0000 0f 13 45 54 23 24 ae e1 f6
0003 0f 13 45 54 23 24 ae e1 f6
0004 f8 31 35 23 24 e7 e6 e1 f5
This might work for you (GNU sed):
sed '2~4,+1d' file
Start at line 2 and modulo 4 thereafter, remove 2 lines.
Perl to the rescue:
perl -ne 'print if $. % 4 < 2' file
-n reads the input line by line and runs the code for each line;
$. contains the current line number;
% is the modulo operator, the formula here selects lines 1, 4, 5, 8, 9, etc. which is what we need.
I would like to remove lines 4,5,6 10,11,12 16,17,18
First we need to express that using in terms of remainder of division, you want to jettison lines where remainder from division by 6 is 4 or 5 or 0 this might be expressed in GNU AWK following way, let file.txt content be
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
then
awk 'NR%6!=4&&NR%6!=5&&NR%6!=0' file.txt
gives output
1
2
3
7
8
9
13
14
15
19
20
Explanation: NR is number of row, % remainder from division, && logical and, I am selected line where remainder from division of number of row by 6 is not 4 and is not 5 and is not 0. If you want to know more about NR read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
This sed command should do the trick:
sed -n 'p;n;n;n;p' file

How to shift a specific cell left in a text file

I have a very large text file (tab-delimited, first line is header) like this:
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD AC 150 2367 0.02 123 I
FA AFQ ASB 123 2473 0.4 630 I
As you can see, there are two strings in column 3 at the lines 3 and 4 columns 3 (A3). Could you please help me out with how I can delete these strings and shift cells left using awk, sed or any Linux codes to have the corrected file like:
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD 150 2367 0.02 123 I
FA AFQ 123 2473 0.4 630 I
I tried:
awk 'if($3!~/[0-9]+/') $3=$4}1', file
It removes any strings in column 3 and replaces them with column 4, but without shifting cells left.
Using sed
$ sed '1!s/^\([^ ]* \+[^ ]* \+\)[A-Z][^ ]* \+/\1/' input_file
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD 150 2367 0.02 123 I
FA AFQ 123 2473 0.4 630 I
1! - Do not match line 1
^\([^ ]* \+[^ ]* \+\) - Using backreferencing, we can store to memory the contents of a parenthesis, this will match everything up to the second space.
[A-Z][^ ]* \+ - Anything not within the parenthesis will be excluded from the matched return. If the third column contains capital alphabetic characters, then evrything up to the next space is excluded.
\1 - Return anything captured within the parenthesis
You may use this awk:
awk 'BEGIN{FS=OFS="\t"} NR > 1 && $3+0 != $3 {
$3 = ""; sub(FS FS, FS)} 1' file
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD 150 2367 0.02 123 I
FA AFQ 123 2473 0.4 630 I
This might work for you (GNU sed):
sed -E '1!s/^((\S+\s+){2})[A-Z]\S+\s+/\1/' file
Remove the third field and the space(s) following if it begins with a character A through Z.
$ awk -F'\t+' -v OFS='\t' 'NF>7{$3=""; $0=$0; $1=$1} 1' file
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD 150 2367 0.02 123 I
FA AFQ 123 2473 0.4 630 I
$ awk -v OFS='\t' '{print $1, $2, $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF}' file
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD 150 2367 0.02 123 I
FA AFQ 123 2473 0.4 630 I
Using cppawk:
cppawk '
#include <cons.h> // numberp lives here
#include <field.h> // delf here
NR > 1 && !numberp($3) { delf(3); } 1' file
A1 A2 A3 A4 A5 A6 A7
FA1 AB 234 231 0.02 456 I
FA2 ACE 241 2154 0.1 324 O
FA3 AD 150 2367 0.02 123 I
FA AFQ 123 2473 0.4 630 I

Issue in joining two sparse data - SparkSQL

I want to join 2 sparse tables in sparkSQL.
Table A -
C1
C2
C3
J1
F1
34
J2
F2
42
J3
34
F4
42
J5
F5
50
Table B -
C1
C2
C4
J1
10
F2
50
J3
F3
3
F4
2
To Get Result -
C1
C2
C3
C4
J1
F1
34
10
J2
F2
42
50
J3
F3
34
3
F4
42
2
J5
F5
50
How can I achieve this in SparkSQL?

Memory efficient transpose - Awk

i am trying to transpose a table (10k rows X 10K cols) using the following script.
A simple data example
$ cat rm1
t1 t2 t3
n1 1 2 3
n2 2 3 44
n3 1 1 1
$ sh transpose.sh rm1
n1 n2 n3
t1 1 2 1
t2 2 3 1
t3 3 44 1
However, I am getting memory error. Any help would be appreciated.
awk -F "\t" '{
for (f = 1; f <= NF; f++)
a[NR, f] = $f
}
NF > nf { nf = NF }
END {
for (f = 1; f <= nf; f++)
for (r = 1; r <= NR; r++)
printf a[r, f] (r==NR ? RS : FS)
}'
Error
awk: cmd. line:2: (FILENAME=input FNR=12658) fatal: dupnode: r->stptr: can't allocate 10 bytes of memory (Cannot allocate memory)
Here's one way to do it, as I mentioned in my comments, in chunks. Here I show the mechanics on a tiny 12r x 10c file, but I also ran a chunk of 1000 rows on a 10K x 10K file in not much more than a minute (Mac Powerbook).6
EDIT The following was updated to consider an M x N matrix with unequal number of rows and columns. The previous version only worked for an 'N x N' matrix.
$ cat et.awk
BEGIN {
start = chunk_start
limit = chunk_start + chunk_size - 1
}
{
n = (limit > NF) ? NF : limit
for (f = start; f <= n; f++) {
a[NR, f] = $f
}
}
END {
n = (limit > NF) ? NF : limit
for (f = start; f <= n; f++)
for (r = 1; r <= NR; r++)
printf a[r, f] (r==NR ? RS : FS)
}
$ cat t.txt
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
$ cat et.sh
inf=$1
outf=$2
rm -f $outf
for i in $(seq 1 2 12); do
echo chunk for rows $i $(expr $i + 1)
awk -v chunk_start=$i -v chunk_size=2 -f et.awk $inf >> $outf
done
$ sh et.sh t.txt t-transpose.txt
chunk for rows 1 2
chunk for rows 3 4
chunk for rows 5 6
chunk for rows 7 8
chunk for rows 9 10
chunk for rows 11 12
$ cat t-transpose.txt
10 20 30 40 50 60 70 80 90 A0 B0 C0
11 21 31 41 51 61 71 81 91 A1 B1 C1
12 22 32 42 52 62 72 82 92 A2 B2 C2
13 23 33 43 53 63 73 83 93 A3 B3 C3
14 24 34 44 54 64 74 84 94 A4 B4 C4
15 25 35 45 55 65 75 85 95 A5 B5 C5
16 26 36 46 56 66 76 86 96 A6 B6 C6
17 27 37 47 57 67 77 87 97 A7 B7 C7
18 28 38 48 58 68 78 88 98 A8 B8 C8
19 29 39 49 59 69 79 89 99 A9 B9 C9
And then running the first chunk on the huge file looks like:
$ time awk -v chunk_start=1 -v chunk_size=1000 -f et.awk tenk.txt > tenk-transpose.txt
real 1m7.899s
user 1m5.173s
sys 0m2.552s
Doing that ten times with the next chunk_start set to 1001, etc. (and appending with >> to the output, of course) should finally give you the full transposed result.
There is a simple and quick algorithm based on sorting:
1) Make a pass through the input, prepending the row number and column number to each field. Output is a three-tuple of row, column, value for each cell in the matrix. Write the output to a temporary file.
2) Sort the temporary file by column, then row.
3) Make a pass through the sorted temporary file, reconstructing the transposed matrix.
The two outer passes are done by awk. The sort is done by the system sort. Here's the code:
$ echo '1 2 3
2 3 44
1 1 1' |
awk '{ for (i=1; i<=NF; i++) print i, NR, $i}' |
sort -n |
awk ' NR>1 && $2==1 { print "" }; { printf "%s ", $3 }; END { print "" }'
1 2 1
2 3 1
3 44 1

SAS SQL query to solve, adding rows to distinct variable

Need help to sort out this scenario,
Table
pr_id lob_id prec1 prec2 prec3
112 1a 3478 56 77
112 1b 3466 65 43
112 1c 5677 57 68
112 1d 5634 49 52
215 2a 1234 43 45
215 2b 9787 32 43
215 2c 4566 39 90
388 3a 8797 88 99
388 3b 6579 58 72
388 3c 9087 76 67
Required output : need distinct observations in pr_id and respective lob_id observation rows to the distinct pr_id row. As shown below
pr_id lob_id prec1 prec2 prec3 lob_id prec1 prec2 prec3 lob_id prec1 prec2 prec3 lob_id prec1 prec2 prec3
112 1a 3478 56 77 1b 3466 65 43 1c 5677 57 68 1d 5634 49 52
215 2a 1234 43 45 2b 9787 32 43 2c 4566 39 90 . . . .
388 3a 8797 88 99 3b 6579 58 72 3c 9087 76 67 . . . .
I have tried doing it with proc transpose, but the variable names are differ from required output, could you please help me in this.
Thank you.
This will do as close as you can get to your desired answer. It's far more convoluted than is probably needed, but it does ensure the lob_id's stay with their prec1-3's. You cannot have the same variable name for multiple variables, but you can have the same label, so I keep the label the same while adding _1 _2 _3 etc.
You could then PROC PRINT the dataset, if you want this in the output window (and that should show the label, thus getting your desired repeated variable names in the output).
data have;
input pr_id lob_id $ prec1 prec2 prec3;
datalines;
112 1a 3478 56 77
112 1b 3466 65 43
112 1c 5677 57 68
112 1d 5634 49 52
215 2a 1234 43 45
215 2b 9787 32 43
215 2c 4566 39 90
388 3a 8797 88 99
388 3b 6579 58 72
388 3c 9087 76 67
;;;;
run;
data have_pret;
set have;
by pr_id;
array precs prec:;
if first.pr_id then counter=0;
counter+1;
varnamecounter+1;
valuet=lob_id;
idname=cats("lob_id",'_',counter);
idlabel="lob_id";
output;
call missing(valuet);
do __t = 1 to dim(precs);
varnamecounter+1;
valuen=precs[__t];
idname=cats('prec',__t,'_',counter);
idlabel=vlabel(precs[__t]);
output;
end;
call missing(valuen);
keep pr_id valuet valuen idname idlabel varnamecounter;
run;
proc sort data=have_pret out=varcounter(keep=idname varnamecounter);
by idname varnamecounter;
run;
data varcounter_fin;
set varcounter;
by idname varnamecounter;
if first.idname;
run;
proc sql;
select idname into :varlist separated by ' '
from varcounter_fin order by varnamecounter;
quit;
proc transpose data=have_pret(where=(not missing(valuen))) out=want_n;
by pr_id;
var valuen;
id idname;
idlabel idlabel;
run;
proc transpose data=have_pret(where=(missing(valuen))) out=want_t;
by pr_id;
var valuet;
id idname;
idlabel idlabel;
run;
data want;
retain pr_id &varlist.;
merge want_n want_t;
by pr_id;
drop _name_;
run;
To do this in SQL is irritating; SAS doesn't support the advanced SQL table functions that would permit you to transpose it neatly without hardcoding everything. It would be something like
proc sql;
select pr_id,
max(lob_id1) as lob_id1, max(prec1_1) as prec1_1, max(prec2_1) as prec2_1, max(prec3_1) as prec3_1,
max(lob_id2) as lob_id2, max(prec1_2) as prec1_2, max(prec2_2) as prec2_2, max(prec3_2) as prec3_2 from (
select pr_id,
case when substr(lob_id,2,1)='a' then lob_id else ' ' end as lob_id1,
case when substr(lob_id,2,1)='a' then prec1 else . end as prec1_1,
case when substr(lob_id,2,1)='a' then prec2 else . end as prec2_1,
case when substr(lob_id,2,1)='a' then prec3 else . end as prec3_1,
case when substr(lob_id,2,1)='b' then lob_id else ' ' end as lob_id2,
case when substr(lob_id,2,1)='b' then prec1 else . end as prec1_2,
case when substr(lob_id,2,1)='b' then prec2 else . end as prec2_2,
case when substr(lob_id,2,1)='b' then prec3 else . end as prec3_2
from have )
group by pr_id;
quit;
but extended to include 3 and 4. You can see why it's silly to do this in SQL I hope :) The SAS code is probably actually shorter, and is doing far more work to make this easily extendable - you could skip half of it if you just hardcoded that retain statement, for example.