Computing pairwise distances in Hive - hive

I have a dataset in Hive that looks like this:
Point Latitude Longitude
A 40.3 74.8
B 12.5 -45.1
C -32.7 87.6
D 23.9 -67.2
... ... ...
How can I obtain a matrix with the distance of each point from all the other points? That is, the distances AB, AC, AD, BC, BD, CD and so forth. If it is easier to have the output in a linear format, that is fine as well. I want to be able to do this all using Hive Query Language.
Edit: The data contains hundreds of thousands of rows. In the end I want to be able to identify all points within a certain radius of a given point. So if there a way to reduce the number of calculations by first filtering out points or using some approximation, I am open to that as well.

One possible solution for this can be that you do a join of the same on itself without any condition. The output would be something like this
query1 query1 query1 query2 query2 query2
Point Latitude Longitude Point Latitude Longitude
A 40.3 74.8 A 40.3 74.8
A 40.3 74.8 B 12.5 -45.1
A 40.3 74.8 C -32.7 87.6
A 40.3 74.8 D 23.9 -67.2
...
Use the above output as a subquery and compute the distances between the points. Basically a concat of query1.Point and query2.point would give you the pair and the distance function on latitude and longitudes will give you the distance between them.
Hope this helps.

Related

How do I calculate the centroid point of multiple longitude and latitude data points in BigQuery?

I have a dataset with various longitude and latitude datapoints. I would like to develop a centroid or "average" of those longitude and latitude coordinates in bigquery at a specific level of granularity.
Example of Current Data:
ID
LONG
LAT
101
-71.23403
42.01979
101
-91.469621
44.867211
102
78.8952716
38.4022661
102
80.8518668
35.3152386
Desired Output (output centroid is made up)
ID
CENTROID_LONG
CENTROID_LAT
101
-71.23403
42.01979
102
-91.469621
44.867211
Where the values above are aggregated to a centroid lat and long number.
Consider below
select id, st_union_agg(st_geogpoint(long, lat)) points,
st_centroid(st_union_agg(st_geogpoint(long, lat))) centroid
from your_table
group by id
if applied to sample data in your question - output is
which is visualized as

moving from tabular to graph representation of a given data

Suppose that I have the following data t:
activity
teacher
group
students
duration
subject
One
A
a
3
45
Math
One
B
b
2
45
Math
two
A
c
7
60
P.E
One
D
a
3
45
Math
two
C
c
7
60
P.E
I want to construct a graph data instead of this tabular data. I am actually interested in predicting the teacher by applying some kind of Graph ML. is there a way to transform the tabular data into graphical data ? maybe using networkX.
I tried the following code
G = nx.from_pandas_edgelist(df, "subject", "teacher", edge_attr=True, create_using=nx.Graph())
nx.draw_networkx(G)
plt.show()
the output of this looks like a graph, but I don't understand how it works or how can I get the new data or what is the best way to identify the node and the edge.
thank you in advance for any help.

Creating similar samples based on three different categorical variables

I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.
For example: Sample A and B have following variables in it:
Id Country Age Sales
The proportion of Country in Sample A is:
USA- 58%
UK- 22%
India-8%
France- 6%
Germany- 6%
The proportion of country in Sample B is:
India- 42%
UK- 36%
USA-12%
France-3%
Germany- 5%
The same goes for other categorical variables: age_group, and sales_group
Thanks in advance for help
You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem).
You can see it in the simulation below:
set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE),
age_group = sample(c("old", "young"), n, replace = TRUE),
country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
amount = abs(100 * rnorm(n)))
s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]
table(sampleA$sales_group)
# A B C D
# 23 22 32 23
table(sampleB$sales_group)
# A B C D
# 25 22 28 25
DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing

SQL Sever Geospatial, find location of point at a distance along a linestring

We are investigating migrating a prototype into SQL Server (azure).
We have LineStrings that also have M values. What we would like to do is given another M value find out what its geographical location is.
To aid your visualisation, here is a real-world example:
I have a linestring that represents a flight path. Because the flight goes up and down the distance the plane has actually moved is not the same as the total length of the linestring. We have calibrated M values as a part of the linestring but need to be able to plot on it where a given event occurred. All we know about this event is its M value.
SET #g = geometry::STGeomFromText('LINESTRING(1 0 NULL 0, 2 2 NULL 5, 1 4 NULL 9, 3 6 NULL 15)', 0);
Given something like the above, what is the lat and long of a point with an M value of 8?
This should be an equivalent postgis's ST_LocateAlong
The M value is not a time, but a distance. It should be understood that this distance is arbitrary and does not directly relate to the length of the line and is calibrated against known points. This is due to the set being based on historic data that is in no way accurate by today's standards.
*Note I am not sure if I have Nulled the Z or M value. The extra parameter we are considering here is the M only.

whats A is representing in GPS co-ordinate point?

I am getting GPS information from a device like this
052340.000,A
32.46275,N
75.310415,E
I know N is for north and E for east but what A is representing?
Looking at the value, and some of the other comments, it is unlikely to be an altitude in meters. If this has been extracted from a GPGLL NMEA sentance, the value is time of fix, e.g. 05:23:40, as per the following
$GPGLL
Geographic Position, Latitude / Longitude and time.
eg2. $GPGLL,4916.45,N,12311.12,W,225444,A
4916.46,N Latitude 49 deg. 16.45 min. North
12311.12,W Longitude 123 deg. 11.12 min. West
225444 Fix taken at 22:54:44 UTC
A Data valid
eg3. $GPGLL,5133.81,N,00042.25,W*75
1 2 3 4 5
1 5133.81 Current latitude
2 N North/South
3 00042.25 Current longitude
4 W East/West
5 *75 checksum
$--GLL,lll.ll,a,yyyyy.yy,a,hhmmss.ss,A
llll.ll = Latitude of position
a = N or S
yyyyy.yy = Longitude of position
a = E or W
hhmmss.ss = UTC of position
A = status: A = valid data
Altitude. it would depend on the device as to what units it is. from the number shown in your example i would doubt it is meters, unless you are in an aeroplane.
more info here