Grep multiple inputs from huge file, but only the first occurance each - awk

I am trying to build a "unique" zipcode list based on the data from geojson.
The goal is to grep one whole line per zip code. There are multiple entries per zipcode possible, all i care is about grabbing one per Zip.
ive prepared a "unique" zip code file to pass as grep to run as a "filter" against the list.
However, this still returns multiple results per zip code.
When limiting the results with -m 1 then i only get the very first match.
How can i filter one entry per line from the "big file"?
The input (example)
9417 TG
9423 TA
9431 HK
9883 TB
9965 TN
The command:
grep -f infile.txt bigfile.txt
the output:
9417 TG Spier Drenthe NLD Netherlands 52.8178 6.4592 ;
9423 TA Hoogersmilde Drenthe NLD Netherlands 52.9098 6.3685 ;
9417 TG Spier Drenthe NLD Netherlands 52.8178 6.4658 ;
9423 TA Hoogersmilde Drenthe NLD Netherlands 52.9066 6.3802 ;
9431 HK Westerbork Drenthe NLD Netherlands 52.8613 6.6029 ;
9431 HK Oosterwolde Friesland NLD Netherlands 52.9851 6.2986 ;
9883 TB Zuurdijk Groningen NLD Netherlands 53.3147 6.3558 ;
9965 TN Zuurdijk Groningen NLD Netherlands 53.3506 6.3691 ;
9965 TN Leens Groningen NLD Netherlands 53.3523 6.37 ;
9883 TB Oldehove Groningen NLD Netherlands 53.3108 6.3632 ;
As you can see, there are two entries for 9423 TA and 9965 TN
How can I crunch that down to one entry per list?
Thank you kindly for your help!

This is the job that is more suitable for awk:
awk '
NR == FNR {
zip[$1] = $2
next
}
$2 == zip[$1] {
print
delete zip[$1]
}' infile.txt bigfile.txt
9417 TG Spier Drenthe NLD Netherlands 52.8178 6.4592 ;
9423 TA Hoogersmilde Drenthe NLD Netherlands 52.9098 6.3685 ;
9431 HK Westerbork Drenthe NLD Netherlands 52.8613 6.6029 ;
9883 TB Zuurdijk Groningen NLD Netherlands 53.3147 6.3558 ;
9965 TN Zuurdijk Groningen NLD Netherlands 53.3506 6.3691 ;

You might use GNU AWK to deduplicate records at 1st field in bigfile.txt as follows, let bigfile.txt content be
9417 TG Spier Drenthe NLD Netherlands 52.8178 6.4592 ;
9423 TA Hoogersmilde Drenthe NLD Netherlands 52.9098 6.3685 ;
9417 TG Spier Drenthe NLD Netherlands 52.8178 6.4658 ;
9423 TA Hoogersmilde Drenthe NLD Netherlands 52.9066 6.3802 ;
9431 HK Westerbork Drenthe NLD Netherlands 52.8613 6.6029 ;
9431 HK Oosterwolde Friesland NLD Netherlands 52.9851 6.2986 ;
9883 TB Zuurdijk Groningen NLD Netherlands 53.3147 6.3558 ;
9965 TN Zuurdijk Groningen NLD Netherlands 53.3506 6.3691 ;
9965 TN Leens Groningen NLD Netherlands 53.3523 6.37 ;
9883 TB Oldehove Groningen NLD Netherlands 53.3108 6.3632
then
awk '!arr[$1]++' bigfile.txt
gives output
9417 TG Spier Drenthe NLD Netherlands 52.8178 6.4592 ;
9423 TA Hoogersmilde Drenthe NLD Netherlands 52.9098 6.3685 ;
9431 HK Westerbork Drenthe NLD Netherlands 52.8613 6.6029 ;
9883 TB Zuurdijk Groningen NLD Netherlands 53.3147 6.3558 ;
9965 TN Zuurdijk Groningen NLD Netherlands 53.3506 6.3691 ;
Explanation: ++ does return then increase by 1, arr is array (associative), if row with given 1st field was not yet encountered zero is assumed, ! is negate, therefore for each line arr[$1] is starting at 0 number of occurence of given 1st field, as this is negated, only 0 cause default action of printing, so only 1st line for each unique first column value is printed.
Use > to save effect to file e.g.
awk '!arr[$1]++' bigfile.txt > bigfileunique.txt
and then just use it in your command that is
grep -f infile.txt bigfileunique.txt
(tested in gawk 4.2.1)

Related

I want help in a query: What is the average population of the districts in each country?

I have only one City Table :
ID
Name
Country Code
District
Population
6
Rotterdam
NLD
Zuid-Holland
593321
3878
Scottsdale
USA
Arizona
202705
3965
Corona
USA
California
124966
3973
Concord
USA
California
121780
3977
Cedar Rapids
USA
Iowa
120758
3982
Coral Springs
USA
Florida
117549
1613
Neyagawa
JPN
Osaka
257315
1630
Ageo
JPN
Saitama
209442
The Expected Result is :
countrycode
avg(population)
JPN
xxxxxx
NLD
xxxxxxx
USA
xxxxxxx
I have used the shared code but was not getting the expected answer:
select avg(population)
from city
where countrycode='JPN' and 'USA' and 'NLD'
group by district;
The above code gives me a blank result " avg(population)" - blank.
I am using SQL workbench
Try:
select countrycode, avg(population) as avg_population
from city
where countrycode in('JPN', 'USA', 'NLD')
group by countrycode;
This should do the job :
select countrycode, avg(population)
from city
where countrycode in('JPN','USA','NLD')
group by countrycode;

SQL- how would i Join these two queries

I have the following queries, and am attempting to join them
SELECT COUNTRY_NAME, COUNTRY_ID
FROM OEHR_COUNTRIES;
these results
COUNTRY_NAME CO
---------------------------------------- --
Argentina AR
Australia AU
Belgium BE
Brazil BR
Canada CA
Switzerland CH
China CN
Germany DE
Denmark DK
Egypt EG
France FR
HongKong HK
Israel IL
India IN
Italy IT
Japan JP
Kuwait KW
Mexico MX
Nigeria NG
Netherlands NL
Singapore SG
United Kingdom UK
United States of America US
Zambia ZM
Zimbabwe ZW
my second query
SELECT COUNTRY_ID, COUNT(COUNTRY_ID) AS "LCOUNT"
FROM OEHR_LOCATIONS
GROUP BY COUNTRY_ID;
results
CO LCOUNT
-- -------
US 4
SG 1
CA 2
CH 2
IT 2
MX 1
CN 1
DE 1
JP 2
IN 1
AU 1
UK 3
BR 1
NL 1
When i attempt to join these two results, so each country has the count after it
SELECT OEHR_COUNTRIES.COUNTRY_NAME, OEHR_COUNTRIES.COUNTRY_ID, COUNT(OEHR_LOCATIONS.COUNTRY_ID) AS LCOUNT
FROM OEHR_COUNTRIES
OUTER JOIN OEHR_LOCATIONS
ON OEHR_COUNTRIES.COUNTRY_ID = OEHR_LOCATIONS.COUNTRY_ID
ORDER BY LCOUNT;
i get this error
ON OEHR_COUNTRIES.COUNTRY_ID = OEHR_LOCATIONS.COUNTRY_ID
*
ERROR at line 4:
ORA-00904: "OEHR_COUNTRIES"."COUNTRY_ID": invalid identifier
ON OEHR_COUNTRIES.COUNTRY_ID = OEHR_LOCATIONS.COUNTRY_ID
*
ERROR at line 4:
ORA-00904: "OEHR_COUNTRIES"."COUNTRY_ID": invalid identifier
what is causing this error?
is there a simpler way to do what i am trying to achieve?
I assume this is something you need. It would list 0 for countries with no count. If you dont want to list countries with no count, use INNER JOIN
SELECT C.COUNTRY_NAME,
case
when L.LCOUNT is null
then 0
else L.LCOUNT
END as LCOUNT
FROM OEHR_COUNTRIES C
LEFT JOIN
(SELECT COUNTRY_ID, COUNT(COUNTRY_ID) AS LCOUNT
FROM OEHR_LOCATIONS
GROUP BY COUNTRY_ID) L
on C.COUNTRY_ID=L.COUNTRY_ID
order by LCOUNT DESC
You're missing the mandatory LEFT (or, in other scenarios, RIGHT) before the optional OUTER in the join syntax.
At the moment the word OUTER is being misinterpreted as a table alias, which is what is causing the error you're getting - there is, to the parser, now an OUTER.COUNTRY_ID but not a OEHR_COUNTRIES.COUNTRY_ID.
Add the missing word to stop it being seen as an alias, and to stop it defaulting to an inner join:
SELECT OEHR_COUNTRIES.COUNTRY_NAME, OEHR_COUNTRIES.COUNTRY_ID,
COUNT(OEHR_LOCATIONS.COUNTRY_ID) AS LCOUNT
FROM OEHR_COUNTRIES
LEFT OUTER JOIN OEHR_LOCATIONS
ON OEHR_COUNTRIES.COUNTRY_ID = OEHR_LOCATIONS.COUNTRY_ID
GROUP BY OEHR_COUNTRIES.COUNTRY_NAME, OEHR_COUNTRIES.COUNTRY_ID
ORDER BY LCOUNT;
I've added the missing group-by clause too. With your sample data that gets:
COUNTRY_NAME CO LCOUNT
------------------------ -- ----------
Belgium BE 0
Argentina AR 0
Zimbabwe ZW 0
...
Zambia ZM 0
Mexico MX 1
China CN 1
...
Germany DE 1
Switzerland CH 2
Canada CA 2
Japan JP 2
Italy IT 2
United Kingdom UK 3
United States of America US 4
25 rows selected.
Without adding that missing word, changing the other references to the table to use the (wrong) OUTER alias instead would have meant it would execute, again with the group-by clause added:
SELECT OUTER.COUNTRY_NAME, OUTER.COUNTRY_ID, COUNT(OEHR_LOCATIONS.COUNTRY_ID) AS LCOUNT
FROM OEHR_COUNTRIES
OUTER JOIN OEHR_LOCATIONS
ON OUTER.COUNTRY_ID = OEHR_LOCATIONS.COUNTRY_ID
GROUP BY OUTER.COUNTRY_NAME, OUTER.COUNTRY_ID
ORDER BY LCOUNT;
but it wouldn't have done quite what you wanted - assuming you want to see zero counts for countries with no locations - since it's now an inner join:
COUNTRY_NAME CO LCOUNT
------------------------ -- ----------
Netherlands NL 1
India IN 1
...
Australia AU 1
Switzerland CH 2
Japan JP 2
Canada CA 2
Italy IT 2
United Kingdom UK 3
United States of America US 4
14 rows selected.
The 11 countries with no locations aren't shown at all with an inner join.

SAS proc SQL and Inner join - what are alternative methods

What I want to do is to find an alternative to the following code:
PROC SQL;
CREATE TABLE XXXX AS
SELECT DISTINCT t2.WC, t2.CWC
FROM YYYY t1
INNER JOIN ZZZZ t2 ON (t1.MC = t2.WC)
;
QUIT;
Could someone please help in doing the same thing using hash or any other method?
I have the following tables:
data have01;
infile cards truncover expandtabs;
input MC $ LC $ MCC $ MCN $ TLC $ DD $ ODS_TimeStamp ODS_LUpd zTPl $ PuD $;
cards;
1853 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
1856 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
1869 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
2024 DV16 1 Vetu SM3 2008-01-31 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 47 .
2025 DV16 1 Vetu SM3 2008-01-31 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 47 .
run;
You might have to format date column in the above table.
data have02;
infile cards truncover expandtabs;
input WPMVId ToSTimeStamp TId ASN WC $ CWC $ TSide $ MNo Y X;
cards;
1 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 1 -82140 2468
2 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 2 -81940 2466
3 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 3 -81739 2463
4 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 4 -81539 2459
5 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 5 -81339 2456
6 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 6 -81139 2453
run;
You might have to format date column in the above table.
Please help me using some alternative to SQL code above, specifically when I have issue that my Table 2 above is almost 0.8 billion rows data and it takes hell a lot time to run SQL query as above.
You can use a hash object. This is especially nice if you have a large dataset and you don't want to sort it prior to merging.
Suppose you have to data sets Aset and Bset in your work library and you want to merge them on the ID variables IDVar1 and IDVar2 (they uniquely identify each entry in both data sets and are both defined for the two datasets). All other variable names differ in the two data sets. The resulting data set will be called 'merged'. Here is a minimal example:
data Aset;
input idvar1 idvar2 var1inA var2inA;
datalines;
1 48 5 100
1 8 6 165
2 5 7 102
2 965 8 136
3 105 9 145
4 105 10 456
3 85 12 454
;
run;
data Bset;
input idvar1 idvar2 var1inB var2inB;
datalines;
2 48 5 100
2 965 6 165
2 5 7 102
1 965 8 136
5 105 9 145
3 105 10 456
3 85 12 454
;
run;
data merged (drop=retval);
if 0 then set Aset;
if _N_=1 then do;
declare hash hh(dataset:'Aset',ordered:'A');
hh.definekey('IDVar1','IDVar2');
hh.definedata(all:'Y');
hh.definedone();
end;
do while (not done);
set Bset end=done;
retval = hh.find();
if (retval=0) then output;
end;
stop;
run;
ODS LISTING:
Obs. idvar1 idvar2 var1inA var2inA var1inB var2inB
1 2 965 8 136 6 165
2 2 5 7 102 7 102
3 3 105 9 145 10 456
4 3 85 12 454 12 454
UPDATE:
The following code works for the data examples provided. I changed some of the formats to fit the values and added some length statements.
data have01;
infile cards truncover expandtabs;
length ODS_TimeStamp $23. ODS_LUpd $23. DD $10.;
input MC LC $ MCC MCN $ TLC $ DD $ ODS_TimeStamp $ ODS_LUpd $ zTPl PuD $;
cards;
1853 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
1856 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
1869 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
2024 DV16 1 Vetu SM3 2008-01-31 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 47 .
2025 DV16 1 Vetu SM3 2008-01-31 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 47 .
run;
data have02;
infile cards truncover expandtabs;
length ToSTimeStamp $23.;
input WPMVId ToSTimeStamp $ TId ASN WC CWC $ TSide $ MNo Y X;
cards;
1 21AUG2012:17:57:39.000 20949 1 2024 HPUS230 R 1 -82140 2468
2 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 2 -81940 2466
3 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 3 -81739 2463
4 21AUG2012:17:57:39.000 20949 1 2024 HPUS230 R 4 -81539 2459
5 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 5 -81339 2456
6 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 6 -81139 2453
run;
data merged (drop=retval);
if 0 then set have01;
if _N_=1 then do;
declare hash hh(dataset:'have01',ordered:'A');
hh.definekey('MC');
hh.definedata(all:'Y');
hh.definedone();
end;
do while (not done);
set have02 (rename=(WC=MC)) end=done;
retval = hh.find();
if (retval=0) then output;
end;
stop;
run;
Anything better than this answer as below...
data work.xx;
merge
work.yy (in=a keep=mc rename=(mc=wc))
work.zz (in=b keep=wc cwc)
;
by wc;
if a and b;
run;
proc sort data=work.xx nodupkey;
by wc cwc;
run;

SAS PROC SQL - issue with Left join on two character variables

DataSet: Test1
Name Type Length Format Informat
RowID Numeric 8 6. 6.
COL2 Character 6 $6. $6.
COL3 Numeric 8 NUMERIC12. NUMERIC12.
DATE Numeric 8 11. 11.
TIME Character 8 $CHAR8. $CHAR8.
Amount Numeric 8
DataSet: Test2
Name Type Length Format Informat
RowID Numeric 8 9. 9.
COL2 Character 32 $32. $32.
COL3 Date 8 DATETIME27.6 DATETIME27.6
COL4 Character 17 $17. $17.
TIME Character 8 $CHAR8. $CHAR8.
COL5 Numeric 8 NUMERIC12. NUMERIC12.
AMOUNT Numeric 8
Sample Data
Test1
RowID COL2 COL3 DATE TIME AMOUNT
3330 123456 123 20110523 14.14.50 2.00
3330 334567 123 20110523 19.13.34 2.00
3330 889789 123 20110523 20.01.11 2.00
3330 45678 1643 20110523 06.53.05 6.00
TEST2
RowID COL2 COL3 COL4 TIME COL5 Amount
3330 0010181002233611096xyBC3TLnDkVB7 23MAY2011:19:14:50.000000 20110523 14:14:50 14:14:50 123 2.00
3330 0010181005005029491mnopqrbT2cySA 24MAY2011:00:13:34.000000 20110523 19:13:34 19:13:34 123 2.00
3330 001018100222213220332ghijkl63BR1 23MAY2011:11:53:05.000000 20110523 06:53:05 06:53:05 1643 6.00
3330 00101810021738472682abcdef7vUcte 23MAY2011:13:30:03.000000 20110523 08:30:03 06:53:05 5575 1.00
I want to left join Test1 with Test2 on Columns Test1.COL3=Test2.COL5 and Test1.Time=Test2.TIME with the desired output to look like
Final
RowID COL2 COL3 DATE TIME AMOUNT RowID COL2 COL3 COL4 TIME COL5 Amount
3330 123456 123 20110523 14.14.50 2.00 3330 0010181002233611096xyBC3TLnDkVB7 23MAY2011:19:14:50.000000 20110523 14:14:50 14:14:50 123 2.00
3330 334567 123 20110523 19.13.34 2.00 3330 0010181005005029491mnopqrbT2cySA 24MAY2011:00:13:34.000000 20110523 19:13:34 19:13:34 123 2.00
3330 889789 123 20110523 20.01.11 2.00 . . . . . . .
3330 45678 1643 20110523 06.53.05 6.00 3330 001018100222213220332ghijkl63BR1 23MAY2011:11:53:05.000000 20110523 06:53:05 06:53:05 1643 6.00
I'm running the below code is SAS
proc sql;
create table final as
select * from
(
(select * from Test1) A
left join
(select * from Test2) B
on Test1.COL3=Test2.COL5 and Test1.Time=Test2.TIME
)
quit;
and Im not getting the desired output even though TIME column in both the datasets are of same length, Format and Informat
I'm getting the results as
Final
RowID COL2 COL3 DATE TIME AMOUNT RowID COL2 COL3 COL4 TIME COL5 Amount
3330 123456 123 20110523 14.14.50 2.00 . . . . . . .
3330 334567 123 20110523 19.13.34 2.00 . . . . . . .
3330 889789 123 20110523 20.01.11 2.00 . . . . . . .
3330 45678 1643 20110523 06.53.05 6.00 . . . . . . .
I do not understand what is wrong.
Try to use
PROC sql;
CREATE TABLE final AS
SELECT *
FROM (
(
SELECT *
FROM test1) A
LEFT JOIN
(
SELECT *
FROM test2) B
ON A.col3=B.col5
AND Replace(A.time,':','.')=Replace(B.time,':','.') )
quit;
14.14.50 ne 14:14:50
Fix your formats, or use INPUT to make them both a number.

Aggregating Total - SSRS 2008 / SQL

I am a new user of SSRS 2008 and SQL in general. Currently I am in the process of creating a report in Reporting Services, however I have a problem in achieving what I would like. These are the 4 columns in my current report:
AU De OP $12
AU De FX $13
EU De FX $6
GBP Bo Cor $8
EU De FX $14
AU De FX $9
GBP De FX $2
.. .. .. ..
What I would like to have is be able to aggregate column 3 and 4 by column 2 and 1. Sorry I do not know how to explain it exactly but something like this
AU De OP $12
AU De FX $22 ($13+ $$9)
EU De FX $20 ($6 + $14)
GBP Bo Cor $8
GBP De FX $2
.. .. .. ..
I will greatly appreciate any insights anyone can give.
This is a simple group by query:
select col1, col2, sum(col4) as amount
from t
group by col1, col2
You should be able to set this up in SSRS.