Bigquery multiple join using clause - google-bigquery

I need to get the BGP AS details for the IP addresses in a table, the table contains SrcAddr and DstAddr as mentioned in table1
table1
SrcAddr
DstAddr
Bytes
1.1.1.1
8.8.8.8
1005
Table2 contains the BGP as number details.
Table2
IPaddr
Organization
network_bin
mask
1.1.1.0/24
Cloudflare
asdjqowiq
24
8.8.8.0/24
Google
asdqwrqsd
24
I want to build a final table like below
Table3
SrcAddr
SrcAS
DstAddr
Dst AS
Bytes
1.1.1.1
Cloudflare
8.8.8.8
Google
1005
I used the below query by referring to the doc https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html and was able to get the src_as field but was not able to resolve the dst_as. can someone help me with this?
WITH source_of_ip_addresses AS (
SELECT SamplerAddress, REGEXP_REPLACE(SrcAddr, 'xxx', '0') srcip, REGEXP_REPLACE(DstAddr, 'xxx', '0') dstip
FROM `fluentd.netflow_message`
WHERE SrcAddr IS NOT null
GROUP BY 1,2,3
)
SELECT *, srcip, src_as,
FROM (
SELECT srcip, network_bin, mask, autonomous_system_organization as src_as
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(source_of_ip_addresses.srcip) & NET.IP_NET_MASK(4, mask) network_bin ,
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(srcip)) = 4
)
JOIN `fluentd.asn_block_processed` USING (network_bin, mask)

just repeat the same process. Also it is more convenient to use WITH clause instead of nested queries to make it simpler to repeat this code. Something like below. I obviously don't have access to your tables, so cannot check syntax, there will likely be duplicate columns you'll need to remove by using explicit column names rather than *.
WITH source_of_ip_addresses AS (
SELECT SamplerAddress, REGEXP_REPLACE(SrcAddr, 'xxx', '0') srcip, REGEXP_REPLACE(DstAddr, 'xxx', '0') dstip
FROM `fluentd.netflow_message`
WHERE SrcAddr IS NOT null
GROUP BY 1,2,3
), source_with_masks AS (
SELECT *, NET.SAFE_IP_FROM_STRING(source_of_ip_addresses.srcip) & NET.IP_NET_MASK(4, mask) network_bin ,
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(srcip)) = 4
), source_processed AS (
SELECT *
FROM source_with_masks
JOIN `fluentd.asn_block_processed` USING (network_bin, mask)
), dest_with_masks AS (
-- same as above, with dstip instead of srcip
SELECT *, NET.SAFE_IP_FROM_STRING(source_of_ip_addresses.dstip) & NET.IP_NET_MASK(4, mask) network_bin ,
FROM source_processed, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(srcip)) = 4
), dest_processed AS (
SELECT *
FROM dest_with_masks
JOIN `fluentd.asn_block_processed` USING (network_bin, mask)
)
SELECT * from dest_processed

Related

BigQuery Code Unexpected results with code formatting sqlfluff

I have a perfect code that compares the data from one table with another (see below) which works totally fine and runs fine as well in BigQuery:
with source1 as (
select
b.id,
b.qty,
a.price
from <table> as a
,unnest <details> as b
where b.status != 'canceled'
),
source2 as (
select id_, qty_, price_ from <table2>
where city != 'delhi'
)
select *
from source1 s1
full outer join source2 s2
on id = id_
where format('%t', s1) != format('%t', s2)
However, the code above runs into an error in sqlfluff i.e a certain SQL formatting rules checker that I can't bypass or turn off, see the error from sqlfluff below:
ERROR FROM SQLFLUFF:
*'s1' found in select with more than one referenced table/view' and 's2' found in select with more than one referenced table/view
Does anybody know how I can fix it ?
Windows function may perform better.
Instead of joining, the tables are unioned. Then a window function will search for combinations. As you tagged this question [Big-Query] it is tested for BigQuery:
with s1 as (
select id, qty, city from <table> where x != 'pending'
),
s2 as (
select id_, qty_, city_ from <table2>
),
concat_ as (
select 1 as dummy, * , format('%t',s1) as dummy_all
from s1
union all select 2 as dummy, * , format('%t',s2) as dummy_all
from s2
)
,combine as (
select *,
sum(if(dummy=1,1,0)) over win1 = sum(if(dummy=2,1,0)) over win1 as dummy_flag
from concat_
window win1 as (partition by id,dummy_all)
)
Select * from combine
where dummy_flag is false
I fixed the code by adding extra CTE and adjusting the where caluse this would not flag the sqlfluff rules:
To fix that, I tried the second code (see below):
added additional ctes
adjusted the where clause so that I can get all the rows where the id exists in one source but not in another and vice versa
the code seems to work but it would be great if someone could suggest how I can reduce the CTEs, considering the sqlfluff will not fail:
with s1 as (
select
b.id,
b.qty,
a.price
from <table> as a
,unnest <details> as b
where b.status != 'canceled'
),
s2 as (
select id_, qty_, price_ from <table2>
where city != 'delhi'
),
,concat_s1 as (
select
*
, format('%t',s1) as l1
from s1
)
,concat_s2 as (
select
*
, format('%t',s2) as l2
from s2
)
, combined as (
select
source1.*
,source2.*
from concat_s1 as source1
full outer join concat_s2 as source2
on source1.id_ = source2.id_
where source1.l1 != source2.l2
or source1.id is null or source2.id_ is null

is there a way to convert non-IPv4 address to country code in bigquery?

SELECT
user_id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(context_ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(context_ip))/(256*256)) AS classB
FROM
`product_table` AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
so i am trying to get the transfer the ip address to country, but seems not working , and i have the following error code
NET.IPV4_TO_INT64() encountered a non-IPv4 address. Expected 4 bytes but got 16
and the format of my context_ip is like this '173.170.166.0'
any one can help to convert the ip address to country code?
This is the example from:
https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
),
test as (Select * from unnest(SPLIT("71.203.44.188 24.167.141.7 206.191.39.53 173.170.166.0"," ")) ip )
Select TBLA,country_name, city_name, subdivision_1_name
from (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM
#source_of_ip_addresses
test
, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
) TBLA
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)

Modify T-SQL Query to include/exclude colums depending on other columns value

I have the below query, that I need some help modifying. In the below Query I get the number of columns that are not null and the percentage:
SELECT COUNT(v.col) as num_not_null, COUNT(v.col) * 1.0 / COUNT(*) * 100 as percent_not_null, COUNT(*) as toltalColsNeedsFilled
FROM EFP_EmploymentUser t
CROSS APPLY (VALUES (t.ITAdvicedFirst),
(t.ITAdvicedSecond),
(t.ITDepartmentDone),
(t.CFOAdvicedFirst),
(t.CFOInfoProvided),
(t.CFOAdvicedSecond),
(t.CFODone),
(t.EconomyAdviced),
(t.EconomyDone),
(t.AcademyAdviced),
(t.AcademyDone),
(t.PublicatorAdviced),
(t.PublicatorDone),
(t.PortraitAdviced),
(t.PortraitDone),
(t.WhoIsWhoAdviced),
(t.WhoIsWhoDone),
(t.BogportalAdviced),
(t.BogportalDone),
(t.KeyCardAdviced),
(t.KeyCardDone) ) v(col)
WHERE ID = '19';
This returns in the case of ID 19:
num_not_null percent_not_null toltalColsNeedsFilled
5 23.809523809500 21
But I need to check if the following columns in the same table (Publicator,Bogportal,Academy) are filled with value 'yes', and depending on that I need to include or exclude som of the columns from my above query:
i.e.: IF Academy = YES then include t.AcademyAdviced & t.AcademyDone
IF Publicator= YES then include t.PublicatorDone & t.PortraitAdviced
IF Bogportal = YES then include t.BogportalAdviced & t.BogportalDone
Can anyone help me how to modifying the query to achive this? :-)
Best Regards
Stig
You can use UNION ALL and WHERE predicates to decide which columns to add to the unpivot:
SELECT COUNT(v.col) as num_not_null, COUNT(v.col) * 1.0 / COUNT(*) * 100 as percent_not_null, COUNT(*) as toltalColsNeedsFilled
FROM EFP_EmploymentUser t
CROSS APPLY (
SELECT * FROM
( VALUES (t.ITAdvicedFirst),
(t.ITAdvicedSecond),
(t.ITDepartmentDone),
(t.CFOAdvicedFirst),
(t.CFOInfoProvided),
(t.CFOAdvicedSecond),
(t.CFODone),
(t.EconomyAdviced),
(t.EconomyDone),
(t.PortraitAdviced),
(t.PortraitDone),
(t.WhoIsWhoAdviced),
(t.WhoIsWhoDone),
(t.KeyCardAdviced),
(t.KeyCardDone) ) v(col)
UNION ALL
SELECT *
FROM (VALUES (t.AcademyAdviced), (t.AcademyDone) ) v(col)
WHERE t.Academy = 'YES'
UNION ALL
SELECT *
FROM (VALUES (t.PublicatorDone), (t.PortraitAdviced) ) v(col)
WHERE t.Publicator = 'YES'
UNION ALL
SELECT *
FROM (VALUES (t.BogportalAdviced), (t.BogportalDone ) ) v(col)
WHERE t.Bogportal = 'YES'
) v
WHERE t.ID = '19';

Extract number or string after string in BigQuery

I have several 1.000 URLs and want to extract some values from the URL parameters.
Here some examples from the DB:
["www.xxx.com?uci=6666&rci=fefw"]
["www.xxx.com?uci=61
["www.xxx.com?rci=62&uci=5536"]
["www.xxx.com?uci=6666&utm_source=XXX"]
["www.xxx.com?pccst=TEST%20sTESTg"]
["www.xxx.com?pccst=TEST2%20s&uci=1"]
["www.xxx.com?uci=1pccst=TEST42rt24&rci=2"]
How can I extract the value of the parameter UCI. It is always a digit number (don’t know the exact length).
I tried it with REGEXP_EXTRACT. But I didn't succeed:
REGEXP_EXTRACT(URL, '(uci)\=[0-9]+') AS UCI_extract
And I also want to extract the value of the parameter pccst. It can be every character and I don`t know the exact length. But it always ends with “ or ? or &
I tried it also with REGEXP_EXTRACT but didn't succeed:
REGEXP_EXTRACT(URL, r'pccst\=(.*)(\"|\&|\?)') AS pccst_extract
I am really not the REGEX expert.
So would be great if someone could help me.
Thanks a lot in advance,
Peter
You can adapt this solution
#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# #see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
SELECT 1 AS id, 'www.xxx.com?uci=6666&rci=fefw' AS query
UNION ALL SELECT 2, 'www.xxx.com?uci=1pccst%20TEST42rt24&rci=2'
UNION ALL SELECT 3, 'www.xxx.com?pccst=TEST2%20s&uci=1'
)
SELECT
id,
query,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM examples
Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1&pccst=TEST42rt24&rci=2"
)
SELECT
url,
REGEXP_EXTRACT(url, r'[?&]uci=(.*?)(?:$|&)') uci,
REGEXP_EXTRACT(url, r'[?&]pccst=(.*?)(?:$|&)') pccst
FROM `project.dataset.table`
result is
Row url uci pccst
1 www.xxx.com?pccst=TEST%20sTESTg null TEST%20sTESTg
2 www.xxx.com?pccst=TEST2%20s&uci=1 1 TEST2%20s
3 www.xxx.com?uci=1&pccst=TEST42rt24&rci=2 1 TEST42rt24
4 www.xxx.com?uci=61 61 null
5 www.xxx.com?rci=62&uci=5536 5536 null
6 www.xxx.com?uci=6666&rci=fefw 6666 null
7 www.xxx.com?uci=6666&utm_source=XXX 6666 null
Also, below option to parse out all key-value pairs so, then you can dynamically select needed
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1pccst=TEST42rt24&rci=2"
)
SELECT url,
ARRAY(
SELECT AS STRUCT
SPLIT(kv, '=')[SAFE_OFFSET(0)] key,
SPLIT(kv, '=')[SAFE_OFFSET(1)] value
FROM UNNEST(SPLIT(SUBSTR(url, LENGTH(NET.HOST(url)) + 2), '&')) kv
) key_value_pair
FROM `project.dataset.table`

How to transform IP addresses into geolocation in BigQuery standard SQL?

So I have read https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html
But I was wondering if there was a #standardSQL way of doing it. So far, I have a lot of challenge converting PARSE_IP and NTH() since the suggested changes in the migration docs have limitations.
Going from PARSE_IP(contributor_ip) to NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) does not work for IPv6 IP addresses.
Going from NTH(1, latitude) lat to latitude[SAFE_ORDINAL(1)] does not work since latitude is considered a string.
And there might be more migration problems that I have yet to encounter. Does anyone know how to transform IP addresses into geolocation in BigQuery standard SQL?
P.S. How would I go from geolocation to determining timezone?
edit: So what is the difference between this
#legacySQL
SELECT
COUNT(*) c,
city,
countryLabel,
NTH(1, latitude) lat,
NTH(1, longitude) lng
FROM (
SELECT
INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
FROM
[publicdata:samples.wikipedia]
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN EACH
[fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
and
SELECT
COUNT(*) c,
city,
countryLabel,
ANY_VALUE(latitude) lat,
ANY_VALUE(longitude) lng
FROM (
SELECT
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
ELSE NULL
END AS clientIpNum,
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64)
ELSE NULL
END AS classB
FROM
`publicdata.samples.wikipedia`
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
edit2: Seems like I manage to figure out the problem via not casting a float correctly. Right now, the standard SQL returns 41815 rows instead the 56347 rows from the legacy SQL which may be due to the lack of conversion from IPv6 to int for standard SQL, but it might be due to something else. Also the legacy SQL query performs much better, running at about 10 seconds instead of the full minute from the standard SQL.
According to https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7
#standardSQL
SELECT
id,
IFNULL(city, 'Other') AS city,
IFNULL(countryLabel, 'Other') AS countryLabel,
latitude,
longitude
FROM (
SELECT
id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
FROM
`<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
id ASC
The answer to this question is not valid for ipv6 addresses.
Following the approach described here https://medium.com/#hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 I came up with this solution:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
In order to get the geoip_city_v4 and geoip_city_v6 you need to download the geoip database from https://maxmind.com/
You can follow this tutorial in to update and prepare you dataset hodo.dev/posts/post-37-gcp-bigquery-geoip.