I'm trying to find a number of Not A Time values in a grouping:
ID Date eVal
ddd 2014-02-12 2
ddd 2014-02-13 2
ddd NaT 2
aaa 2014-02-12 3
aaa 2014-02-13 3
aaa 2014-02-14 3
I basically need to add a new column which indicates the (number of NaT incidences for that ID)
How do I find that number?
Something like this:
df['NaT'] = df['Date'].isnull() * 1
df.groupby('ID')['NaT'].sum()
Two files:
f1:
1 aaa 123
2 bbb 555
3 ccc 666
f2:
1 aaa 444
2 ddd 666
3 eee 777
When f2 has same as f1 like aaa value, then the r1c3 of f2 replace by r1c3 of f1, using awk or sed.
Output:
1 aaa 123
2 ddd 666
3 eee 777
Thanks
Try this;
awk 'FNR==NR { a[$2]=$3; next }{ print $1,$2,( a[$2] ? a[$2] : $3 )}' f1 f2
I want to output the sum of every N lines, for example, every 4 lines:
cat file
1
11
111
1111
2
22
222
2222
3
33
333
3333
The output should be:
6 #(1+2+3)
66 #(11+22+33)
666 #(111+222+333)
6666 #(1111+2222+3333)
How can I do this with awk?
Basically you can use the following awk command:
awk -vN=4 '{s[NR%N]+=$0}END{for(i=0;i<N;i++){print s[i]}}' input.txt
You can choose N like you wish.
Output:
6666
6
66
666
But you see, the output isn't sorted as expected when iterating through an awk array. You can fix this by shifting the line number by -1:
awk -vN=4 '{s[(NR-1)%N]+=$0}END{for(i=0;i<N;i++){print s[i]}}' a.txt
Output:
6
66
666
6666
I have bash and awk script that I use to extract data from the text file.
However it is too slow with large datasets and doesn't work perfectly. I believe that it is possible to write all my bash loop in one awk command and I ask somebody to help me with this.
cat dummy_list
AAA
AAAA
AAAAA
cat dummy_table
13 19 AAA 69 96 "ID-999" 34
23 42 AAA 12 19 "ID-999" 64
53 79 AAA 43 58 "ID-482" 36
13 43 AAA 12 15 "ID-492" 75
23 90 AAA 45 87 "ID-492" 34
12 41 AAAA 76 79 "ID-923" 23
19 58 AAAA 15 87 "ID-923" 75
10 40 AAAA 18 82 "ID-482" 23
11 18 AAAA 18 82 "ID-482" 52
15 19 AAAA 18 82 "ID-482" 62
59 69 AAAA 10 18 "ID-482" 83
78 89 AAAA 32 41 "ID-983" 24
23 53 AAAAA 78 99 "ID-916" 82
What I want from this table:
For every dummy_list item (AAA or AAAA or AAAAA) extract how many different times ID range was mentioned ( by this I mean unique columns 4+5+6 (like 69 96 "ID-999")). There are duplicate ID's (like 18 82 "ID-482") and I have to discard them.
My script looks like this:
while read a; do
awk -v VAR="$a" '($3==VAR) {print $4"\t"$5"\t"$6}' dummy_table |
sort -u |
cut -f 3 |
sort |
uniq -c |
awk '{print $1}' |
tr '\n' ' ' |
awk -v VAR="$a" '{print VAR"\t"$0}'
done < dummy_list
AAA 1 2 2
AAAA 2 2 1
AAAAA 1
It's the same as AAA "ID-482" mentioned once; "ID-492" mentioned twice; "ID-999" mentioned twice.
This is the output I want.
For every dummy_list item get average number of how many times it gets mentioned with the same ID. For example AAA occurs twice with "ID-999", one time with "ID-482" and two times with "ID-492" - so it's (2+1+2)/3=1.66
My script looks like this:
while read a ; do
ID_TIMES=$(awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table |
sort -u |
wc -l) &&
awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table |
sort |
uniq -c |
awk -v VAR="$ID_TIMES" '{sum+=$1} END {print sum/VAR}'
done < dummy_list
AAA 1.666
AAAA 2.333
AAAAA 1
For every dummy_list item extract ID range and calculate proportion between columns.
For example:
for AAA's ID-999:
RANGE1=sum $5-$4(96-69) + $5-$4(19-12)
RANGE2=sum $7(34+64)
then RANGE2*100/RANGE1=288
For the output like this:
AAA 288 240 242
....
AAAAA 390
I wasn't able to write such script by myself as I got stuck with two variables $RANGE1 and $RANGE2.
If it it possible it would be great to discard duplicate ranges like 18 82 "ID-482" in this step as well.
I believe that all these there operations can be calculated with only one awk command and I feel desperate about my scripts. I really hope that someone will help me in this operation.
You can try this.
file a.awk:
BEGIN {
# read list of items
while ( ( getline < "dummy_list" ) > 0 )
{
items[$1] = 0
}
}
{
# calculate ammountof uniqur ids
key = $3 SUBSEP $6
if ( ! ( key in ids ) && ( $3 in items ) )
{
unique_ids[$3] += 1
}
# calculate ammount of duplication
ids [$3,$6] += 1
# calculate range parameters
range1 [$3,$6] += $5 - $4
range2 [$3,$6] += $7
}
END {
for ( item in items )
{
print "--- item = " item " ---\n"
for ( key in ids )
{
split ( key, s, SUBSEP );
if ( s[1] != item ) continue;
range = range2[key] * 100 / range1[key]
average[item] += float ( ids[key] ) / unique_ids[item];
print "id = " s[2] "\tammount of dup = " ids[key] " range = " int ( range )
}
print "\naverage = " average[item] "\n"
}
}
run:
awk -f a.awk dummy_table
output:
--- item = AAAA ---
id = "ID-983" ammount of dup = 1 range = 266
id = "ID-923" ammount of dup = 2 range = 130
id = "ID-482" ammount of dup = 4 range = 110
average = 2.33333
--- item = AAAAA ---
id = "ID-916" ammount of dup = 1 range = 390
average = 1
--- item = AAA ---
id = "ID-999" ammount of dup = 2 range = 288
id = "ID-482" ammount of dup = 1 range = 240
id = "ID-492" ammount of dup = 2 range = 242
average = 1.66667
There is one moment - I can't understand how you got 225
for "ID-482" and item AAA in question #3.
RANGE2 * 100 / RANGE1 = 36 * 100 / ( 58 - 43 ) = 240.
Are you sure, that your example on question #3 is correct?
Only a partial answer but here is a one-liner solution for your first problem:
awk -F' ' '{group[$3]++;ind[$6]++};{count[$3][$6]+=1}; END{for (i in group){for (j in ind) if(count[i][j] > 0) print i, j, count[i][j]}}' dummy_variable.txt
Output:
AAA "ID-482" 1
AAA "ID-999" 2
AAA "ID-492" 2
AAAA "ID-923" 2
AAAA "ID-482" 4
AAAA "ID-983" 1
AAAAA "ID-916" 1
It is then fairly trivil to use this output to calculate the answer to your second question.
I have two files I merged them based key using below code
file1
-------------------------------
1 a t p bbb
2 b c f aaa
3 d y u bbb
2 b c f aaa
2 u g t ccc
2 b j h ccc
file2
--------------------------------
1 11 bbb
2 22 ccc
3 33 aaa
4 44 aaa
I merged these two file based key using below code
awk 'NR==FNR{a[$3]=$0;next;}{for(x in a){if(x==$5) print $1,$2,$3,$4,a[x]};
My question is how I can save $2 of file2 in variable or array and print after a[x] again.
My desired result is :
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
As you see the first 7 columns is the result of my merge code. I need add the last column (field 2 of a[x]) to my result.
Important:
My next question is if I have .awk file, how I can use some bash script code like (| column -t) or send result to file (awk... > result.txt)? I always use these codes in command prompt. Can I use them inside my code in .awk file?
Simply add all of file2 to an array, and use split to hold the bits you want:
awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[3] == $NF) print $1,$2,$3,$4, i, one[2] } }' file2 file1
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Regarding your last question; you can also add 'pipes' and 'writes' inside of your awk. Here's an example of a pipe to column -t:
Contents of script.awk:
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
Run like: awk -f script.awk file2 file1
EDIT:
Add the following to your shell script:
results=$(awk '
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
' $1 $2)
echo "$results"
Run like:
./script.sh file2.txt file1.txt
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Your current script is:
awk 'NR==FNR { a[$3]=$0; next }
{ for (x in a) { if (x==$5) print $1,$2,$3,$4,a[x] } }'
(Actually, the original is missing the second close brace for the second pattern/action pair.)
It seems that you process file2 before you process file1.
You shouldn't need the loop in the second code. And you can make life easier for yourself by using the splitting in the first phase to keep the values you need:
awk 'NR==FNR { c1[$3] = $1; c2[$3] = $2; next }
{ print $1, $2, $3, $4, c1[$5], c2[$5], $5, c2[$5] }'
You can upgrade that to check whether c1[$5] and c2[$5] are defined, presumably skipping the row if they are not.
Given your input files, the output is:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Give or take column spacing, that's what was requested. Column spacing can be fixed by using printf instead of print, or setting OFS to tab, or ...
The c1 and c2 notations for column 1 and 2 is OK for two columns. If you need more, then you should probably use the 2D array notation:
awk 'NR==FNR { for (i = 1; i <= NF; i++) col[i,$3] = $i; next }
{ print $1, $2, $3, $4, col[1,$5], col[2,$5], $5, col[2,$5] }'
This produces the same output as before.
To achieve what you ask, save the second field after the whole line in the processing of your first file, with a[$3]=$0 OFS $2. For your second question, awk has a variable to separate fields in output, it's OFS, assign a tabulator to it and play with it. Your script would be like:
awk '
BEGIN { OFS = "\t"; }
NR==FNR{
a[$3]=$0 OFS $2;
next;
}
{
for(x in a){
if(x==$5) print $1,$2,$3,$4,a[x]
}
}
' file2 file1
That yields:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22