How to transform IP addresses into geolocation in BigQuery standard SQL? - google-bigquery

So I have read https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html
But I was wondering if there was a #standardSQL way of doing it. So far, I have a lot of challenge converting PARSE_IP and NTH() since the suggested changes in the migration docs have limitations.
Going from PARSE_IP(contributor_ip) to NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) does not work for IPv6 IP addresses.
Going from NTH(1, latitude) lat to latitude[SAFE_ORDINAL(1)] does not work since latitude is considered a string.
And there might be more migration problems that I have yet to encounter. Does anyone know how to transform IP addresses into geolocation in BigQuery standard SQL?
P.S. How would I go from geolocation to determining timezone?
edit: So what is the difference between this
#legacySQL
SELECT
COUNT(*) c,
city,
countryLabel,
NTH(1, latitude) lat,
NTH(1, longitude) lng
FROM (
SELECT
INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
FROM
[publicdata:samples.wikipedia]
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN EACH
[fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
and
SELECT
COUNT(*) c,
city,
countryLabel,
ANY_VALUE(latitude) lat,
ANY_VALUE(longitude) lng
FROM (
SELECT
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
ELSE NULL
END AS clientIpNum,
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64)
ELSE NULL
END AS classB
FROM
`publicdata.samples.wikipedia`
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
edit2: Seems like I manage to figure out the problem via not casting a float correctly. Right now, the standard SQL returns 41815 rows instead the 56347 rows from the legacy SQL which may be due to the lack of conversion from IPv6 to int for standard SQL, but it might be due to something else. Also the legacy SQL query performs much better, running at about 10 seconds instead of the full minute from the standard SQL.

According to https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7
#standardSQL
SELECT
id,
IFNULL(city, 'Other') AS city,
IFNULL(countryLabel, 'Other') AS countryLabel,
latitude,
longitude
FROM (
SELECT
id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
FROM
`<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
id ASC

The answer to this question is not valid for ipv6 addresses.
Following the approach described here https://medium.com/#hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 I came up with this solution:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
In order to get the geoip_city_v4 and geoip_city_v6 you need to download the geoip database from https://maxmind.com/
You can follow this tutorial in to update and prepare you dataset hodo.dev/posts/post-37-gcp-bigquery-geoip.

Related

is there a way to convert non-IPv4 address to country code in bigquery?

SELECT
user_id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(context_ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(context_ip))/(256*256)) AS classB
FROM
`product_table` AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
so i am trying to get the transfer the ip address to country, but seems not working , and i have the following error code
NET.IPV4_TO_INT64() encountered a non-IPv4 address. Expected 4 bytes but got 16
and the format of my context_ip is like this '173.170.166.0'
any one can help to convert the ip address to country code?
This is the example from:
https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
),
test as (Select * from unnest(SPLIT("71.203.44.188 24.167.141.7 206.191.39.53 173.170.166.0"," ")) ip )
Select TBLA,country_name, city_name, subdivision_1_name
from (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM
#source_of_ip_addresses
test
, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
) TBLA
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)

Bigquery multiple join using clause

I need to get the BGP AS details for the IP addresses in a table, the table contains SrcAddr and DstAddr as mentioned in table1
table1
SrcAddr
DstAddr
Bytes
1.1.1.1
8.8.8.8
1005
Table2 contains the BGP as number details.
Table2
IPaddr
Organization
network_bin
mask
1.1.1.0/24
Cloudflare
asdjqowiq
24
8.8.8.0/24
Google
asdqwrqsd
24
I want to build a final table like below
Table3
SrcAddr
SrcAS
DstAddr
Dst AS
Bytes
1.1.1.1
Cloudflare
8.8.8.8
Google
1005
I used the below query by referring to the doc https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html and was able to get the src_as field but was not able to resolve the dst_as. can someone help me with this?
WITH source_of_ip_addresses AS (
SELECT SamplerAddress, REGEXP_REPLACE(SrcAddr, 'xxx', '0') srcip, REGEXP_REPLACE(DstAddr, 'xxx', '0') dstip
FROM `fluentd.netflow_message`
WHERE SrcAddr IS NOT null
GROUP BY 1,2,3
)
SELECT *, srcip, src_as,
FROM (
SELECT srcip, network_bin, mask, autonomous_system_organization as src_as
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(source_of_ip_addresses.srcip) & NET.IP_NET_MASK(4, mask) network_bin ,
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(srcip)) = 4
)
JOIN `fluentd.asn_block_processed` USING (network_bin, mask)
just repeat the same process. Also it is more convenient to use WITH clause instead of nested queries to make it simpler to repeat this code. Something like below. I obviously don't have access to your tables, so cannot check syntax, there will likely be duplicate columns you'll need to remove by using explicit column names rather than *.
WITH source_of_ip_addresses AS (
SELECT SamplerAddress, REGEXP_REPLACE(SrcAddr, 'xxx', '0') srcip, REGEXP_REPLACE(DstAddr, 'xxx', '0') dstip
FROM `fluentd.netflow_message`
WHERE SrcAddr IS NOT null
GROUP BY 1,2,3
), source_with_masks AS (
SELECT *, NET.SAFE_IP_FROM_STRING(source_of_ip_addresses.srcip) & NET.IP_NET_MASK(4, mask) network_bin ,
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(srcip)) = 4
), source_processed AS (
SELECT *
FROM source_with_masks
JOIN `fluentd.asn_block_processed` USING (network_bin, mask)
), dest_with_masks AS (
-- same as above, with dstip instead of srcip
SELECT *, NET.SAFE_IP_FROM_STRING(source_of_ip_addresses.dstip) & NET.IP_NET_MASK(4, mask) network_bin ,
FROM source_processed, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(srcip)) = 4
), dest_processed AS (
SELECT *
FROM dest_with_masks
JOIN `fluentd.asn_block_processed` USING (network_bin, mask)
)
SELECT * from dest_processed

Geography function over a column

I am trying to use the st_makeline() function in order to create lines for every points and the next one in a single column.
Do I need to create another column with the 2 points already ?
with t1 as(
SELECT *, ST_GEOGPOINT(cast(long as float64) , cast(lat as float64)) geometry FROM `my_table.faissal.trajets_flix`
where id = 1
order by index_loc
)
select index_loc geometry
from t1
Here are the results
Thanks for your help
You seems to want to write this code:
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_makeline
WITH t1 as (
SELECT *, ST_GEOGPOINT(cast(long as float64), cast(lat as float64)) geometry
FROM `my_table.faissal.trajets_flix`
-- WHERE id = 1
)
SELECT id, ST_MAKELINE(ARRAY_AGG(geometry ORDER BY index_loc)) traj
FROM t1
GROUP BY id;
with output:
When visualized on the map.
Consider also below simple and cheap option
select st_geogfromtext(format('linestring(%s)',
string_agg(long || ' ' || lat order by index_loc))
) as path
from `my_table.faissal.trajets_flix`
where id = 1
if applied to sample data in your question - output is
which is visualized as

ORACLE SQL Pivot Issue

I am trying to pivot a sql result. I need to do this all in the one query. The below is telling me invalid identifier for header_id. I am using an Oracle database.
Code
Select * From (
select ppd.group_id,g.group_name, ct.type_desc,ht.hos_cat_descr
from item_history ih, item ci, contract ppd,
header ch, group g, cd_std_type ct, cd_hos h,
cd_std_hospital_cat ht
where ih.item_id = ci.item_id
and ih.header_id = ch.header_id
and ci.hos_id = h.hos_id
and ih.item_id = ci.item_id
and ch.user_no = ppd.user_no
and ppd.group_id = g.group_id
and ch.header_type = ct.header_type_id
and ci.hos_id = h.hos_id
and h.cat_id = ht.cat_id
)
Pivot
(
count(distinct header_id) as Volume
For hos_cat_descr IN ('A')
)
Your inner query doesn't have header_id in its projection, so the pivot clause doesn't have that column available to use. You need to add it, either as:
Select * From (
select ppd.group_id,g.group_name, ct.type_desc,ht.hos_cat_descr,ih.header_id
---------------------------------------------------------------^^^^^^^^^^^^^
from ...
)
Pivot
(
count(distinct header_id) as Volume
For hos_cat_descr IN ('A')
)
or:
Select * From (
select ppd.group_id,g.group_name, ct.type_desc,ht.hos_cat_descr,ch.header_id
---------------------------------------------------------------^^^^^^^^^^^^^
from ...
)
Pivot
(
count(distinct header_id) as Volume
For hos_cat_descr IN ('A')
)
It doesn't really matter which, since those two values must be equal as they are part of a join condition.
You could achieve the same thing with simpler aggregation instead of a pivot, but presumably you are doing more work in the pivot really.

How to create an aggregate function for median?

I need to create an aggregate function in Advantage-Database to calculate the median value.
SELECT
group_field
, MEDIAN(value_field)
FROM
table_name
GROUP BY
group_field
Seems the solutions I am finding are quite specific to the sql engine used.
There is no built-in median aggregate function in ADS as you can see in the help file:
http://devzone.advantagedatabase.com/dz/webhelp/Advantage10.1/index.html
I'm afraid that you have to write your own stored procedure or sql script to solve this problem.
The accepted answer to the following question might be a solution for you:
Simple way to calculate median with MySQL
I've updated this answer with a solution that avoids the join in favor of storing some data in a json object.
SOLUTION #1 (two selects and a join, one to get counts, one to get rankings)
This is a little lengthy, but it does work, and it's reasonably fast.
SELECT x.group_field,
avg(
if(
x.rank - y.vol/2 BETWEEN 0 AND 1,
value_field,
null
)
) as median
FROM (
SELECT group_field, value_field,
#r:= IF(#current=group_field, #r+1, 1) as rank,
#current:=group_field
FROM (
SELECT group_field, value_field
FROM table_name
ORDER BY group_field, value_field
) z, (SELECT #r:=0, #current:='') v
) x, (
SELECT group_field, count(*) as vol
FROM table_name
GROUP BY group_field
) y WHERE x.group_field = y.group_field
GROUP BY x.group_field
SOLUTION #2 (uses a json object to store the counts and avoids the join)
SELECT group_field,
avg(
if(
rank - json_extract(#vols, path)/2 BETWEEN 0 AND 1,
value_field,
null
)
) as median
FROM (
SELECT group_field, value_field, path,
#rnk := if(#curr = group_field, #rnk+1, 1) as rank,
#vols := json_set(
#vols,
path,
coalesce(json_extract(#vols, path), 0) + 1
) as vols,
#curr := group_field
FROM (
SELECT p.group_field, p.value_field, concat('$.', p.group_field) as path
FROM table_name
JOIN (SELECT #curr:='', #rnk:=1, #vols:=json_object()) v
ORDER BY group_field, value_field DESC
) z
) y GROUP BY group_field;