AWK Sum and group by : output with headers - awk

I have a huge csv with this structure (sample):
| DATE | WEEKDAY | Shop Code |Shop Manager|Item Presentation Time|Item Sell|
|02-Mar |MONDAY | BOG | Tom |1030 |0 |
|02-Mar |TUESDAY | TEF | Lucas |1300 |1 |
|02-Mar |WEDNESDAY | TDC | Eriberto |1300 |1 |
|02-Mar |THURSDAY | TEF | Lucas |1300 |1 |
|02-Mar |FRIDAY | TEF | Lucas |1300 |1 |
|02-Mar |SATURDAY | GTY | Maya |1600 |1 |
|02-Mar |SUNDAY | TDC | Eriberto |1300 |1 |
I am interested in the sum of successful event ($6)per weekday, the count of presentation per weekday ($2), and the percentual of successful event ( sum $6/count $2 *100)
I wrote the following script:
#!/bin/awk -f
BEGIN {FS = OFS = ","}
{if (NR!=1) a[$2]+=$6;count[$2]++$2}END{for (i in a){ print i","a[i] "," count[i]","a[i]/count[i]*100}}
The script runs:
$ awk -f script.awk raw_file.csv > new_file.csv
It works out perfectly and the output is:
|MONDAY | 2 | 10 |0.20|
|TUESDAY | 18 | 30 |0.60|
|WEDNESDAY | 10 | 20 |0.50|
|THURSDAY | 1 | 20 |0.05|
|FRIDAY | 1 | 15 |0.07|
|SATURDAY | 60 | 100 |0.60|
|SUNDAY | 47 | 80 |0.59|
However I would like to add in the output the header (WEEKDAY,SUCCESSFUL_EVENTS,TOTAL_EVENTS and SUCCESSFUL_RATE. I have no idea how to put in the same script the NR with the header.
I can show the output with:
awk 'NR==1 {print
"WEEKDAY","SUCCESSFUL_EVENTS","TOTAL_EVENTS","SUCCESSFUL_RATE"}{print
$0}' new_file.csv
but no way to integrate this in the script
Any suggestion is really appreciated

You can do this in the begin section of your script:
#!/bin/awk -f
BEGIN {
FS = OFS = ","
print "WEEKDAY", "SUCCESSFUL_EVENTS", "TOTAL_EVENTS", "SUCCESSFUL_RATE"
}
# ...

Related

check if column matches any line in file with awk

say I have some output from the command openstack security group list:
+--------------------------------------+---------+------------------------+----------------------------------+------+
| ID | Name | Description | Project | Tags |
+--------------------------------------+---------+------------------------+----------------------------------+------+
| 1dda8a57-fff4-4832-9bac-4e806992f19a | default | Default security group | 0ce266c801ae4611bb5744a642a01eda | [] |
| 2379d595-0fdc-479f-a211-68c83caa9d42 | default | Default security group | 602ad29db6304ec39dc253bcbba408a7 | [] |
| 431df666-a9ba-4643-a3a0-9a70c89e1c05 | tempest | tempest test | b320a32508a74829a0563078da3cba2e | [] |
| 5b54e63c-f2e5-4eda-b2b9-a7061d19695f | default | Default security group | 57e745b9612941709f664c58d93e4188 | [] |
| 6381ebaf-79fb-4a31-bc32-49e2fecb7651 | default | Default security group | f5c30c42f3d74b8989c0c806603611da | [] |
| 6cce5c94-c607-4224-9401-c2f920c986ef | default | Default security group | e3190b309f314ebb84dffe249009d9e9 | [] |
| 7402fdd3-0f1e-4eb1-a9cd-6896f1457567 | default | Default security group | d390b68f95c34cefb0fc942d4e0742f9 | [] |
| 76978603-545b-401d-9959-9574e907ec57 | default | Default security group | 3a7b5361e79f4914b09b022bcae7b44a | [] |
| 7705da1e-d01e-483d-ab82-c99fdb9eba9c | default | Default security group | 1da03b5e7ce24be38102bd9c8f99e914 | [] |
| 7fd52305-850c-4d9a-a5e9-0abfb267f773 | default | Default security group | 5b20d6b7dfab4bfbac0a1dd3eb6bf460 | [] |
| 82a38caa-8e7f-468f-a4bc-e60a8d4589a6 | default | Default security group | d544d2243caa4e1fa027cfdc38a4f43e | [] |
| a4a5eaba-5fc9-463a-8e09-6e28e5b42f80 | default | Default security group | 08efe6ec9b404119a76996907abc606b | [] |
| e7c531e3-cdc3-4b7c-bf32-934a2f2de3f1 | default | Default security group | 539c238bf0e84463b8639d0cb0278699 | [] |
| f96bf2e8-35fe-4612-8988-f489fd4c04e3 | default | Default security group | 2de96a1342ee42a7bcece37163b8dfa0 | [] |
+--------------------------------------+---------+------------------------+----------------------------------+------+
And I have a list of Project IDs:
0ce266c801ae4611bb5744a642a01eda
b320a32508a74829a0563078da3cba2e
57e745b9612941709f664c58d93e4188
f5c30c42f3d74b8989c0c806603611da
e3190b309f314ebb84dffe249009d9e9
d390b68f95c34cefb0fc942d4e0742f9
3a7b5361e79f4914b09b022bcae7b44a
5b20d6b7dfab4bfbac0a1dd3eb6bf460
d544d2243caa4e1fa027cfdc38a4f43e
08efe6ec9b404119a76996907abc606b
539c238bf0e84463b8639d0cb0278699
2de96a1342ee42a7bcece37163b8dfa0
which is the intersection of two files I get from runnning fgrep -x -f projects secgrup
how can I extract the rows from the ID column for which the Project column IDs match this list that I have?
It would be something like:
openstack security group list | awk '$2 && $2!="ID" && $10 in $(fgrep -x -f projects secgrup) {print $2}'
which should yield:
1dda8a57-fff4-4832-9bac-4e806992f19a
431df666-a9ba-4643-a3a0-9a70c89e1c05
5b54e63c-f2e5-4eda-b2b9-a7061d19695f
6381ebaf-79fb-4a31-bc32-49e2fecb7651
6cce5c94-c607-4224-9401-c2f920c986ef
7402fdd3-0f1e-4eb1-a9cd-6896f1457567
76978603-545b-401d-9959-9574e907ec57
7fd52305-850c-4d9a-a5e9-0abfb267f773
82a38caa-8e7f-468f-a4bc-e60a8d4589a6
a4a5eaba-5fc9-463a-8e09-6e28e5b42f80
e7c531e3-cdc3-4b7c-bf32-934a2f2de3f1
f96bf2e8-35fe-4612-8988-f489fd4c04e3
but obviously this doesn't work.
You can use this awk for this:
awk -F ' *\\| *' 'FNR == NR {arr[$1]; next}
$5 in arr {print $2}' projects secgrup
1dda8a57-fff4-4832-9bac-4e806992f19a
431df666-a9ba-4643-a3a0-9a70c89e1c05
5b54e63c-f2e5-4eda-b2b9-a7061d19695f
6381ebaf-79fb-4a31-bc32-49e2fecb7651
6cce5c94-c607-4224-9401-c2f920c986ef
7402fdd3-0f1e-4eb1-a9cd-6896f1457567
76978603-545b-401d-9959-9574e907ec57
7fd52305-850c-4d9a-a5e9-0abfb267f773
82a38caa-8e7f-468f-a4bc-e60a8d4589a6
a4a5eaba-5fc9-463a-8e09-6e28e5b42f80
e7c531e3-cdc3-4b7c-bf32-934a2f2de3f1
f96bf2e8-35fe-4612-8988-f489fd4c04e3
Here:
-F ' *\\| *' sets input field separator to | surrounded with 0 or more spaces on both sides.
With your shown samples only, please try following awk code. Written and tested in GNU awk.
awk '
FNR==NR{
arr1[$0]
next
}
match($0,/.*default \| Default security group \| (\S+)/,arr2) && (arr2[1] in arr1){
print arr2[1]
}
' ids Input_file
Explanation:
Checking FNR==NR condition which will be TRUE when first Input_file named ids(where your ids are stored) is being read.
Then creating an array named arr1 is being created with index of current line.
next keyword will skip all further statements from here.
Then using match function with regex .*default \| Default security group \| (\S+) which will create 1 capturing group and share its value to array named arr2.
Then checking condition if arr2 value is present inside arr1 then print its value else do nothing.

awk code to filter sequences below 80 bp and 80% coverage

I want to filter my table based on those sequences that have at least 80 base pair (end-begin+1 >= 80) which spans over 80% of their total length (base pairs left should be =< 20% of the total length: (end-begin+1) + left = total length)
| query sequence | begin | end | (left)|
| -------------- | ------| --- | ----- |
| D1 | 1 | 330 | (1939)|
| D2 | 2180 | 2269| (0) |
| D3 | 4 | 168 | (0) |
| D4 | 1 | 1610| (0) |
| D5 | 1 | 402 | (84) |
| D6 | 1 | 58 | (0) |
| D7 | 1 | 79 | (0) |
| D8 | 4 | 167 | (437) |
| D9 |310 | 478 | (214) |
| D10 |1 | 227 | (234) |
| D11 |2 | 604 | (141) |
that is my awk code:
awk '{print $0, $7-$6+1, $7+$8, ($7-$6+1)/($7+$8)}' | awk '$18 >= 0.8 {print $0}'
however there are sequences that are not filtered according to the minimum 80 base pair nor the 80% of the total length rule, where am I wrong?
the expected output:
| query sequence | begin | end | (left)|
| -------------- | ------| --- | ----- |
| D2 | 2180 | 2269| (0) |
| D3 | 4 | 168 | (0) |
| D4 | 1 | 1610| (0) |
| D5 | 1 | 402 | (84) |
Column $8 (left) has parentheses around the numbers, therefore awk fails to interpret $8 as a number and uses 0 instead. Example: awk '{print $1+2}' <<< '(3)' prints 2 instead of 5.
You can extract the number inside the parentheses into a variable using left=$8; gsub(/[()]/,"",left).
By the way: No need for 2 awk scripts. You can do everything in one script:
awk '{left=$8; gsub(/[()]/,"",left); bp=$7-$6+1; tl=bp+left} bp>=80 && bp>0.8*tl'
You might set custom field separator to get just numbers in $8 (and other columns) rather than digits inside ( and ), i.e. replace
awk '{print $0, $7-$6+1, $7+$8, ($7-$6+1)/($7+$8)}'
using
awk 'BEGIN{FS="[)[:space:](]+"}{print $0, $7-$6+1, $7+$8, ($7-$6+1)/($7+$8)}'
Explanation: treat any combination of ) whitespace ( as field separator (FS). Not tested due to lack of sample input as text. If you want to know more about FS read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

PostgreSQL: show trips within a bounding box

I have a trips table containing user's trip information, like so:
select * from trips limit 10;
trip_id | daily_user_id | session_ids | seconds_start | lat_start | lon_start | seconds_end | lat_end | lon_end | distance
---------+---------------+-------------+---------------+------------+------------+-------------+------------+------------+------------------
594221 | 16772 | {170487} | 1561324555 | 41.1175475 | -8.6298934 | 1561325119 | 41.1554091 | -8.6283493 | 5875.39697884959
563097 | 7682 | {128618} | 1495295471 | 41.1782829 | -8.5950303 | 1495299137 | 41.1783908 | -8.5948965 | 5364.81067787512
596303 | 17264 | {172851} | 1578011699 | 41.5195598 | -8.6393526 | 1578012513 | 41.4614024 | -8.717709 | 11187.7956426909
595648 | 17124 | {172119} | 1575620857 | 41.1553116 | -8.6439528 | 1575621885 | 41.1621821 | -8.6383042 | 1774.83365424607
566061 | 8720 | {133624} | 1509005051 | 41.1241975 | -8.5958988 | 1509006310 | 41.1424158 | -8.6101461 | 3066.40306678979
566753 | 8947 | {134662} | 1511127813 | 41.1887996 | -8.5844238 | 1511129839 | 41.2107519 | -8.5511712 | 5264.64026582458
561179 | 7198 | {125861} | 1493311197 | 41.1776935 | -8.5947254 | 1493311859 | 41.1773815 | -8.5947254 | 771.437257541019
541328 | 2119 | {46950} | 1461103381 | 41.1779 | -8.5949738 | 1461103613 | 41.1779129 | -8.5950202 | 177.610819150637
535519 | 908 | {6016} | 1460140650 | 41.1644658 | -8.6422775 | 1460141201 | 41.1642646 | -8.6423309 | 1484.61552373019
548460 | 3525 | {102026} | 1462289206 | 41.177689 | -8.594679 | 1462289843 | 41.1734476 | -8.5916326 | 1108.05119077308
(10 rows)
The task is to filter trips that start and end within the bounding box defined by upper left: 41.24895, -8.68494 and lower right: 41.11591, -8.47569.
If I understand correctly, you can just compare that starting and ending coordinates:
select t.*
from trips t
where lat_start >= 41.11591 and lat_start <= 41.24895 and
lat_end >= 41.11591 and lat_end <= 41.24895 and
long_start >= -8.68494 and long_start <= -8.47569 and
long_end >= -8.68494 and long_end <= -8.47569
Since your coordinates are stored in x,y columns, you have to use ST_MakePoint to create a proper geometry. After that, you can create a BBOX using the function ST_MakeEnvelope and check if start and end coordinates are inside the BBOX using ST_Contains, e.g.
WITH bbox(geom) AS (
VALUES (ST_MakeEnvelope(-8.68494,41.24895,-8.47569,41.11591,4326))
)
SELECT * FROM trips,bbox
WHERE
ST_Contains(bbox.geom,ST_SetSRID(ST_MakePoint(lon_start,lat_start),4326)) AND
ST_Contains(bbox.geom,ST_SetSRID(ST_MakePoint(lon_end,lat_end),4326));
Note: the CTE isn't really necessary and is in the query just for illustration purposes. You can repeat the ST_MakeEnvelope function on both conditions in the WHERE clause instead of bbox.geom. This query also assumes the SRS WGS84 (4326).

Compare two columns and count the result rows

I want to count how many times the first and last column of a sqlite file are the same for each row in my data set. the data set has 16+ million rows and efficiency is very important.
I have tried:
SELECT * FROM tab WHERE [0] = [3]
but it doesn't work. probably because it compares the first column of each row with the last column of the same row.
Let's assume this is my data set:
0 |1 |2 |3 |
--------------------------------------
2005:67 |ytg |6utgjgt |786:09 |
2005:903 |467 |009 |2005:67 |
2005:444 |355 |785 |2005:450|
2005:450 |355 |785 |N/A |
2005:934 |467 |009 |N/A |
2005:000 |355 |785 |2005:450|
2005:987 |355 |785 |2005:450|
--------------------------------------
the output should be this:
0 |1 |2 |3 |4 |
-----------------------------------------------
2005:67 |ytg |6utgjgt |786:09 |1 |
2005:450 |355 |785 |N/A |3 |
2005:934 |467 |009 |N/A |0 |
-----------------------------------------------
the rows whose 4th column were the same as the first column of one of the rows are dropped but were counted. (It is not possible that the 4th column of a row is the same as the first column of more than one row. And the first column's values for each row are identical)
Can everybody please help me? I am a rookie and greatly appreciate some explanation along with the code. Thank you
With NOT EXISTS:
select t.*,
(select count(*) from tab where [3] = t.[0]) [4]
from tab t
where not exists (
select 1 from tab
where [0] = t.[3]
)
See the demo.
Results:
| 0 | 1 | 2 | 3 | 4 |
| -------- | --- | ------- | ------ | --- |
| 2005:67 | ytg | 6utgjgt | 786:09 | 1 |
| 2005:450 | 355 | 785 | N/A | 3 |
| 2005:934 | 467 | 009 | N/A | 0 |

how to edit '|' result from sqllite3 using sed command

hey guys so i have this database:
id | item_name | number_of_store| store_location|
+----+---------+-------------------+-------------+
| 3 | margarine | 2 | QLD |
| 4 | margarine | 2 | NSW |
| 5 | wine | 3 | QLD |
| 6 | wine | 3 | NSW |
| 7 | wine | 3 | NSW |
| 8 | laptop | 1 | QLD |
+----+---------+-------------------+-------------+
i got the result i wanted using the distinct from sqllite3 syntax which are the following:
id | item_name | number_of_store| store_location|
+----+---------+-------------------+-------------+
| 3 | margarine | 2 | QLD |
| 4 | margarine | 2 | NSW |
syntax are :
sqlite3 store.sqlite 'select item_name,number_of_store,store_location from store where item_name = 'margarine'> store.txt
but when i saved it to txt i got
3|margarine|2|QLD
4|margarine|2|NSW
however my desired output in the txt are
3,margarine,2,QLD
4,margarine,2,NSW
i think i should use SED but not quite sure how to do it
i tried with
'|sed 's/|//g' |sed 's/|//g'|sed 's/^//g'|sed 's/$//g'
however the result only erase the '|' i'm not too sure how to change it to ','
Though you should sql itself but as per your request you could use following sed.
awk '{gsub("|",",")} 1' Input_file
Or in sed:
sed 's#|#,#g' Input_file
In case you want to save output into Input_file itself use sed -i.bak option it will take backup of Input_file and save output into Input_file itself.