Can I find distances between arrays in PostgreSQL using <->? - sql

As far as I understood from this article, you can find nearest neighbors using <-> distance operator when working with geometric data types:
SELECT name, location --location is point
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;
You can also get some optimizations using SP-GiST indexes:
CREATE INDEX idx_spgist_geonames_location ON geonames USING spgist(location);
But I can't find anything about using <-> operator with arrays in the documentation. If I were to perform same queries using double precision[] instead of point, for example, would that work?

Apparently, we can't. For example, I've got a simple table:
CREATE TABLE test (
id SERIAL PRIMARY KEY,
loc double precision[]
);
And I want to query documents from it, ordering by distance,
SELECT loc FROM test ORDER BY loc <-> ARRAY[0, 0, 0, 0]::double precision[];
It doesn't work:
Query Error: error: operator does not exist: double precision[] <-> double precision[]
Documentation has no mention of <-> for arrays as well. I found a workaround in accepted answer for this question, but it poses some limitations, especially on array's length. Although there is an article (written in Russian), which suggests a workaround on array size limitation. Creation of sample table:
import postgresql
def setup_db():
db = postgresql.open('pq://user:pass#localhost:5434/db')
db.execute("create extension if not exists cube;")
db.execute("drop table if exists vectors")
db.execute("create table vectors (id serial, file varchar, vec_low cube, vec_high cube);")
db.execute("create index vectors_vec_idx on vectors (vec_low, vec_high);")
Element insertion:
query = "INSERT INTO vectors (file, vec_low, vec_high) VALUES ('{}', CUBE(array[{}]), CUBE(array[{}]))".format(
file_name,
','.join(str(s) for s in encodings[0][0:64]),
','.join(str(s) for s in encodings[0][64:128]),
)
db.execute(query)
Element querying:
import time
import postgresql
import random
db = postgresql.open('pq://user:pass#localhost:5434/db')
for i in range(100):
t = time.time()
encodings = [random.random() for i in range(128)]
threshold = 0.6
query = "SELECT file FROM vectors WHERE sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2)) <= {} ".format(
','.join(str(s) for s in encodings[0:64]),
','.join(str(s) for s in encodings[64:128]),
threshold,
) + \
"ORDER BY sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2)) ASC LIMIT 1".format(
','.join(str(s) for s in encodings[0:64]),
','.join(str(s) for s in encodings[64:128]),
)
print(db.query(query))
print('inset time', time.time() - t, 'ind', i)

Related

Right way to implement pandas.read_sql with ClickHouse

Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?
There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes

ST_GeogFromGeoJSON fails in bigquery while successful in postgres

We have geojson polygons we would like to convert to a geo object in bigquery using ST_GeogFromGeoJSON. The conversion fails in bigquery while is successful in postgres using the equivalent command ST_GeomFromGeoJSON.
I am familiar with the SAFE prefix that can be added to the the bigquery call, but we would like to use the object and not just ignore it in case the conversion fails. I tried converting the object using ST_CONVEXHULL but wasn't able to make it work.
Is there some work around in bigquery?
Example:
Running the following command in bigquery
select ST_GeogFromGeoJSON('{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}')
returns
Query failed: ST_GeogFromGeoJSON failed: Invalid polygon loop: Edge 4 crosses edge 9
While runs successfully in postgres
select ST_GeomFromGeoJSON('{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}')
October 2020 Update for this post
No more any tricks needed - ST_GEOGFROMGEOJSON and ST_GEOGFROMTEXT geographic functions now support a new make_valid parameter. If set to TRUE, the function attempts to correct polygon issues when importing geography data.
So, below simple statement works perfectly now ...
select ST_GeogFromGeoJSON(
'{"type":"Polygon","coordinates":[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}'
, make_valid => true
)
and returns expected output
Below is for BigQuery Standard SQL
Query failed: ST_GeogFromGeoJSON failed: Invalid polygon loop: Edge 4 crosses edge 9
... Is there some work around in bigquery? ...
Proposed workaround is obviously naive and simple way of fixing specific issue while easily can be extended to more generic cases. The idea here is to extract coordinates and reorder them to eliminate the problem ...
WITH test AS (
SELECT '{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}' AS geojson
)
SELECT ST_GEOGFROMGEOJSON('{"type":"Polygon","coordinates":' || fixed_coordinates || '}') AS geo
FROM (
SELECT '[[[' || STRING_AGG(lat_lon, '],[') || '],[' || ANY_VALUE(ordered_coordinates[OFFSET(0)]) || ']]]' fixed_coordinates
FROM (
SELECT
ARRAY( SELECT lon_lat
FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(geojson, '$.coordinates'), r'\[+(.*?)\]+')) lon_lat
ORDER BY CAST( SPLIT(lon_lat)[OFFSET(0)] AS FLOAT64), CAST(SPLIT(lon_lat)[OFFSET(1)] AS FLOAT64)
) ordered_coordinates
FROM test
) t, t.ordered_coordinates lat_lon
)
This produces correct output
POLYGON((-82.022982 26.69785, -81.999574 26.109253, -81.8073135 26.1074055, -81.615053 26.105558, -81.606813 26.710698, -81.8148975 26.704274, -82.022982 26.69785))
and respective visualization is
Below is for BigQuery Standard SQL
My previous answer is based on oversimplified logic of re-ordering coordinates. Obviously it will not work in more complex cases like below one
{‘type’:‘Polygon’,‘coordinates’:[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}
Is there some more advanced sorting logic that can be applied?
So more complex logic can be used to address this
#standardSQL
WITH test AS (
SELECT '{"type":"Polygon","coordinates":[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}' geojson
), coordinates AS (
SELECT CAST(SPLIT(lon_lat)[OFFSET(0)] AS FLOAT64) lon, CAST(SPLIT(lon_lat)[OFFSET(1)] AS FLOAT64) lat
FROM test, UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(geojson, '$.coordinates'), r'\[+(.*?)\]+')) lon_lat), stats AS (
SELECT ST_CENTROID(ST_UNION_AGG(ST_GEOGPOINT(lon, lat))) centroid FROM coordinates
)
SELECT ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_AGG(point ORDER BY sequence))) AS polygon
FROM (
SELECT point,
CASE
WHEN ST_X(point) > ST_X(centroid) AND ST_Y(point) > ST_Y(centroid) THEN 3.14 - angle
WHEN ST_X(point) > ST_X(centroid) AND ST_Y(point) < ST_Y(centroid) THEN 3.14 + angle
WHEN ST_X(point) < ST_X(centroid) AND ST_Y(point) < ST_Y(centroid) THEN 6.28 - angle
ELSE angle
END sequence
FROM (
SELECT point, centroid,
ACOS(ST_DISTANCE(centroid, anchor) / ST_DISTANCE(centroid, point)) angle
FROM (
SELECT centroid,
ST_GEOGPOINT(lon, lat) point,
ST_GEOGPOINT(lon, ST_Y(centroid)) anchor
FROM coordinates, stats
)
)
)
This approach produces correct output
POLYGON((-0.49075 51.46989, -0.48664 51.46987, -0.48664 51.47341, -0.48923 51.47336, -0.48921 51.4737, -0.49072 51.47462, -0.49114 51.47446, -0.49044 51.4737, -0.4907 51.4737, -0.49075 51.46989))
which is visualized as below

How to get From & To Ip Address from CIDR BigQuery

BigQuery provides updated geoip2 public dataset here [bigquery-publicdata -> geolite2 -> ipv4_city_blocks] which contains network column with IPv4 CIDR values.
How do I convert the CIDR values in the network column via BigQuery SQL (and not via a utility outside BigQuery) into start & end ip-address values so that I can find if an IP address is within a range or no? Would be helpful if you can provide the query to obtain the range ips for a CIDR value in the table.
Below is for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION cidrToRange(CIDR STRING)
RETURNS STRUCT<start_IP STRING, end_IP STRING>
LANGUAGE js AS """
var beg = CIDR.substr(CIDR,CIDR.indexOf('/'));
var end = beg;
var off = (1<<(32-parseInt(CIDR.substr(CIDR.indexOf('/')+1))))-1;
var sub = beg.split('.').map(function(a){return parseInt(a)});
var buf = new ArrayBuffer(4);
var i32 = new Uint32Array(buf);
i32[0] = (sub[0]<<24) + (sub[1]<<16) + (sub[2]<<8) + (sub[3]) + off;
var end = Array.apply([],new Uint8Array(buf)).reverse().join('.');
return {start_IP: beg, end_IP: end};
""";
SELECT network, IP_range.*
FROM `bigquery-public-data.geolite2.ipv4_city_blocks`,
UNNEST([cidrToRange(network)]) IP_range
It took about 60 sec to process all 3,037,858 rows with result like below
This query will do the job:
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT city_name, SUM(c) c, ST_GeogPoint(AVG(longitude), AVG(latitude)) point
FROM (
SELECT ip, city_name, c, latitude, longitude, geoname_id
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
WHERE city_name IS NOT null
GROUP BY city_name, geoname_id
ORDER BY c DESC
LIMIT 5000`
Find more details on:
https://towardsdatascience.com/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2
The first thing you need to check is, if that function already exists, so please refer to the BigQuery Functions and Operators documentation.
If not, you need to use Standard SQL User-Defined Functions (UDF), which lets you create a function using another SQL expression or another programming language, such as JavaScript.
Keep in mind when using UDF JavaScript function, BigQuery initializes a JavaScript environment with the function's contents on every shard of execution. There is no optimization to avoid loading the environment, so it can slow down the query.
Regarding to GeoIP2 City and Country CSV Databases site, there is a utility to convert 'network' column to start/end IPs or start/end integers. Refer to Github site for details.
January 2023 solution
Just wanted to respond to Felipe's comment here. I'm not sure why he is suggesting an alternate solution using Snowflake, as his existing solution works just fine. The only difference is that you need to create the dataset yourself.
I managed to solve this by going through the exact same steps listed in Felipe's very helpful original blog article:
Sign-up to MaxMind and download the Geolite2 databases (link)
Download the two CSV files GeoLite2-City-Blocks-IPv4.csv and GeoLite2-City-Locations-en.csv, upload them to a GCP bucket, and create tables from them. I lazily used the BQ automated schema feature and it worked just fine :)
Simply create a geolite2_locs table using a query similar to the one below (just keep or drop your columns as required for your use-case)
CREATE OR REPLACE TALBLE `dataset.geolite2_locs` OPTIONS() AS (
SELECT
ip_ref.network,
NET.IP_FROM_STRING(REGEXP_EXTRACT(ip_ref.network, r'(.*)/' )) network_bin,
CAST(REGEXP_EXTRACT(ip_ref.network, r'/(.*)' ) AS INT64) mask,
ip_ref.geoname_id,
city_ref.continent_name as continent_name,
city_ref.country_name as country_name,
city_ref.city_name as city_name,
city_ref.subdivision_1_name as subdivision_1_name,
city_ref.subdivision_2_name as subdivision_2_name,
ip_ref.latitude as latitude,
ip_ref.longitude as longitude,
FROM `geolite2`.`geolite2-ipv4` ip_ref LEFT JOIN `geolite2`.`geolite2-city-en` city_ref USING (geoname_id)
);
Adapt the query in Felipe's guide or just replace the fh-bigquery.geocode.201806_geolite2_city_ipv4_locs with your new table in his answer above.
Should take you at max 1 hour to get this going. Hope it helps.

Factorial in Google BigQuery

I need to compute the factorial of a variable in Google BigQuery - is there a function for this? I cannot find one in the documentation here:
https://cloud.google.com/bigquery/query-reference#arithmeticoperators
My proposed solution at this point is to compute the factorial for numbers 1 through 100 and upload that as a table and join with that table. If you have something better, please advise.
As context may reveal a best solution, the factorial is used in the context of computing a Poisson probability of a random variable (number of events in a window of time). See the first equation here: https://en.wikipedia.org/wiki/Poisson_distribution
Try below. Quick & dirty example
select number, factorial
FROM js(
// input table
(select number from
(select 4 as number),
(select 6 as number),
(select 12 as number)
),
// input columns
number,
// output schema
"[{name: 'number', type: 'integer'},
{name: 'factorial', type: 'integer'}]",
// function
"function(r, emit){
function fact(num)
{
if(num<0)
return 0;
var fact=1;
for(var i=num;i>1;i--)
fact*=i;
return fact;
}
var factorial = fact(r.number)
emit({number: r.number, factorial: factorial});
}"
)
If the direct approach works for the values you need the Poisson distribution calculated on, then cool. If you reach the point where it blows up or gives you inaccurate results, then read on for numerical analysis fun times.
In general you'll get better range and numerical stability if you do the arithmetic on the logarithms, and then exp() as the final operation.
You want: c^k / k! exp(-c).
Compute its log, ln( c^k / k! exp(-c) ),
i.e. k ln(x) - ln(k!) - c
Take exp() of that.
Well but how do we get ln(k!) without computing k!? There's a function called the gamma function, whose practical point here is that its logarithm gammaln() can be approximated directly, and ln(k!) = gammaln(k+1).
There is a Javascript gammaln() in Phil Mainwaring's answer here, which I have not tested, but assuming it works it should fit into a UDF for you.
Extending Mikhail's answer to be general and correct for computing the factorial for all number 1 to n, where n < 500, the following solution holds and can be computed efficiently:
select number, factorial
FROM js(
// input table
(
SELECT
ROW_NUMBER() OVER() AS number,
some_thing_from_the_table
FROM
[any table with at least LIMIT many entries]
LIMIT
100 #Change this to any number to compute factorials from 1 to this number
),
// input columns
number,
// output schema
"[{name: 'number', type: 'integer'},
{name: 'factorial', type: 'float'}]",
// function
"function(r, emit){
function fact(num)
{
if(num<0)
return 0;
var fact=1;
for(var i=num;i>1;i--)
fact*=i;
return fact;
}
#Use toExponential and parseFloat to handle large integers in both Javascript and BigQuery
emit({number: r.number, factorial: parseFloat(fact(r.number).toExponential())});
}"
)
You can get up to 27! using SQL UDF. Above that value NUMERIC type gets overflow error.
CREATE OR REPLACE FUNCTION factorial(integer_expr INT64) AS ( (
SELECT
ARRAY<numeric>[
1,
1,
2,
6,
24,
120,
720,
5040,
40320,
362880,
3628800,
39916800,
479001600,
6227020800,
87178291200,
1307674368000,
20922789888000,
355687428096000,
6402373705728000,
121645100408832000,
2432902008176640000,
51090942171709440000.,
1124000727777607680000.,
25852016738884976640000.,
620448401733239439360000.,
15511210043330985984000000.,
403291461126605635584000000.,
10888869450418352160768000000.][
OFFSET
(integer_expr)] AS val ) );
select factorial(10);

Neo4j automatic (numeric) indexing

If I enable automatic indexing on for example latitude and longitude fields (type: double) I can't do this query
autoIndex.query(
QueryContext.numericRange(
'longitude',
16.598145,
46.377254
)
);
autoIndex is defined as graphDb.index().getNodeAutoIndexer().getAutoIndex().
Simply the result size is size=0.
Other queries like
autoIndex.get(...)
or
autoIndex.query('id', new QueryContext("*").sort(sort))
work just fine.
If I however manually add index locations_index with
ValueContext latValue = new ValueContext(node.getProperty('latitude')).indexNumeric();
ValueContext lonValue = new ValueContext(node.getProperty('longitude')).indexNumeric();
locationsIndex.add(node, 'latitude', latValue);
locationsIndex.add(node, 'longitude', lonValue);
for each latitude/longitude in database then the query works.
My question is - is there a way to automatically index numeric field as ValueContext.indexNumeric()? Shouldn't that be automatic?