Query to calculate term frequency * inverse document frequency

Query to calculate term frequency * inverse document frequency - sql

I have 2 tables in my Oracle database:
DF (term, doccount)
TF (abstractid, term, freq)
One for Document frequency(DF) having terms and documentCount and another table for term frequency called TF havind the documentID, terms, Frequency.
I want to calculate TF*IDF where TF = number of times that a term appears in an article (frequency column from table TF) and IDF = log (132225)-log(docCount)+1
I want to store my result in a table (TFIDF) having documentID, Terms and the calculated TF*IDF
Any ideas?

You need to join your TF and DF tables and then insert into the destination TFIDF table.
Try this:
insert into TFIDF (documentID, terms, tf_idf)
select abstractID, df.term, (log(10, 132225)-log(10, doccount)+1)*(tf.freq)
from tf, df
where tf.term = df.term;

Related

Get centroid coordinates of a given country in BigQuery

In BigQuery, given a country in ISO-2 code I need to get its centroids coordinates (lat and long).
There is a way to do this?
Looking into the geography functions of BQ I did not find a way.

You can use the bigquery-public-data.geo_openstreetmap.planet_features table from BigQuery public datasets to calculate the centroids of countries. The inner query is based on this answer from Stack Overflow. Consider the below query and output.
SELECT -- Extract country code, lat and long
CountryISO2Code,
ST_X(centroid) AS longitude,
ST_Y(centroid) AS latitude
FROM (SELECT (SELECT value FROM UNNEST(all_tags)
WHERE key = 'ISO3166-1:alpha2') AS CountryISO2Code,
ST_CENTROID(geometry) centroid -- calculate the centroid with the geometry values
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key='boundary' AND value= 'administrative')
AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key='admin_level' AND value = '2')
AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key='ISO3166-1:alpha2'))
WHERE CountryISO2Code="IN" -- country code to filter the output
Output of the above query
Please note that there are some country codes that are not available in the table. Also, OSM dataset in BigQuery itself is produced as a public good by volunteers, and there are no guarantees about data quality.
I also found some community-maintained data from Github like this one which can be imported into BigQuery and is ready-to-use.

SQL ORDER BY before TABLESAMPLE BERNOULLI, or in Python

I am performing a query on PostrgreSQL using Python (pyscopg2)
The data is point geometries, stored in patched of 600 points per patch.
I am trying to streamline and speed up the process, previously I would do the following:
explode the geometry
order by x, y, z
save the result to a new table
Use TABLESAMPLE BERNOULLI(1) to sample the data to 1%
save back to the database
To speed things up I'm trying to reduce the amount of writing to the database and keep the data in python as much as possible.
The old code:
Exploding the patches
query = sql.SQL("""INSERT INTO {}.{} (x,y,z)
SELECT
st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
from "public".{} order by x,y,z;""").format(
*map(sql.Identifier, (schema_name, table_name2, table_name1)))
sampling the data:
query2 = ("CREATE TABLE {}.{} AS (SELECT * FROM {}.{} TABLESAMPLE BERNOULLI ({}))".format(
schema, table_name_base, schema, imported_table_name_base, sample_base))
This works, but I would like to either:
A) Perform this as a single query, so explode --> order by --> sample.
B) Perform the explode in SQL, then sample in python.
For A) I have attempted to nest/subquery but PostgreSQL will not allow TABLESAMPLE to work on anything that isn't a table or a view.
For B) I use data = gpd.read_postgis(query, con=conn) to get the data directly into a geopandas dataframe, so sorting is then easy, but how do I perform the equivalent of TABLESAMPLE BERNOULLI to a geopandas dataframe?
Option A is my preferred option, but it might be useful to test option B incase I end up allowing different sampling methods.
Edit:
This is the visual result of:
query = """
SELECT
PC_EXPLODE(pa)::geometry as geom,
st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
FROM {}.{}
TABLESAMPLE BERNOULLI ({})
ORDER BY x,y,z, geom
;
""".format(schema, pointcloud, sample)

I am a little lost. A random sample is a random sample and doesn't depend on the ordering. If you want a sample that depends on the ordering, then use an nth sample. That would be:
select t.*
from (select t.*,
row_number() over (order by x, y, z) as seqnum
from (select st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
from "public".{}
) t
) t
where seqnum % 100 = 1;
Or perhaps you just want to take the sample and then order afterwards, which you can also do with a subquery.

Processing AI Platform Batch Results BigQuery

The output from AI Platform for tabular dataset looks something like this:
or
{
"classes": ["a","b","c"],
"scores": [0.9,0.1,0.0]
}
There are two arrays within in a record field. predicted_label.classes is the label, and predicted_label.scores is the score produced by AI Platform.
I would like to select the class based on the highest score. i.e in the above example I would like to have an output like row=0, class="a", score=0.9
UNNEST does not immediately solve my issue from my understanding, as it requires the input to be an array. I believe if the ouput was a repeated RECORD it would be easier.
What SQL query will enable me to extract the right label from the AI Platform batch results?

Try this:
with testdata as (
select struct(["a", "b", "c"] as classes, [0.9, 0.1, 0.0] as scores) as predicted_label
)
select (
select struct(offset, class, score)
from unnest(predicted_label.classes) as class with offset
join unnest(predicted_label.scores) as score with offset
using (offset)
order by score desc
limit 1
) as highest
from testdata

You should design your prediction list so that each label and score is represented as a key-value pair.
That BigQuery table looks like this array.
prediction RECORD REPEATED
prediction.label STRING REQUIRED
prediction.score FLOAT REQUIRED
Why?
This a correct representation of your real world situation.
You need no further verification that both list keep the elements in the correct pairing order (on write and on read).
With two loose lists you create a pitfall that will hounds you.
SQL example
with this_model as (
select [
STRUCT ('a' as label, 0.9 as score)
, STRUCT ('b' as label, 0.1 as score)
, STRUCT ('c' as label, 0.0 as score)
] as prediction
)
select pair.label, pair.score
from this_model, UNNEST(prediction) pair
order by pair.score desc
limit 1;

R: Shaping the results of a data import using distinct values in a field

I have had a good look around on the web but nothing seems to answer the question too clearly. R isn't my usual platform but trying to use it a bit more, starting by replicating some code I have elsewhere in VBA. Below is the extract of an output of a query from a SQL database in R.
ID Return_Date ISIN Return
25786 41815 XS1022203076 1.397800e-03
25787 41808 XS1022203076 -4.000600e-03
25977 41815 GB1070308082 9.685500e-03
25978 41808 GB1070308082 2.993700e-03
Is there a quick way in R to take the results of the above and get it into the shape shown below? I.E where each of the distinct values in the ISIN field become columns all of which are sorted by values in another field (Return_Date).
Return_date GB1070308082 XS1022203076
41815 9.685500e-03 1.397800e-03
41808 2.993700e-03 -4.000600e-03

In base R, the function you're looking for is reshape:
reshape(mydf, idvar="Return_Date", timevar="ISIN",
direction = "wide", drop="ID")
# Return_Date Return.XS1022203076 Return.GB1070308082
# 1 41815 0.0013978 0.0096855
# 2 41808 -0.0040006 0.0029937
You can also look at dcast from "reshape2", for example:
library(reshape2)
dcast(mydf, Return_Date ~ ISIN, value.var="Return")

Dynamicly labeling users by score

I've got a table with users and their score, a decimal number ranging 1 - 10.
Table(user_id, score) I also have a 1 row query with the average of the scores (about 6), and the standard deviation (about 0.5). I use MS Access 2007.
I want to label the users A, B, C, D, where:
A has a score higher than (avg+stdev);
B has a score lower than A, but higher than the average;
C has a score lower than the average, but higher than (avg-stdev)
D has a score lower than (avg-stdev).
If I export all data to Excel, I can calculate the values easily, and import them back to the database. Obviously, this isn't the most elegant way. I would like to do this with SQL as a Query. The result should be a table (user_id, label)
But how?

You can use a cross join to join up your users to the 1-row stats query. Then you can use a nested iif to calculate the grade.
Something like this...
SELECT users.*,grade.*
,iif(users.score>grade.high,"A",iif(users.score>grade.average,"B",iif(users.score>grade.low,"C","D"))) as label
FROM (SELECT round(avg(users.score)-stdev(users.score),1) as low
,round(avg(users.score),1) as average
,round(avg(users.score)+stdev(users.score),1) as high
FROM users) AS grade, users;

The IIF did the trick.
I adopted the Query with the average scores to add the minimum A, B and C-scores
Table(avg,stdev,Ascore,Bscore,Cscore) as averages
The final query looked like
SELECT user.Id, user.avgScore,
IIf(avgScore>averages.Ascore,"A",
IIf(avgScore>averages.Bscore,"B",
IIf(avgScore>averages.Cscore,"C","D"))) AS label
FROM averages, users

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Query to calculate term frequency * inverse document frequency - sql

You need to join your TF and DF tables and then insert into the destination TFIDF table. Try this: insert into TFIDF (documentID, terms, tf_idf) select abstractID, df.term, (log(10, 132225)-log(10, doccount)+1)*(tf.freq) from tf, df where tf.term = df.term;

Related

Get centroid coordinates of a given country in BigQuery

SQL ORDER BY before TABLESAMPLE BERNOULLI, or in Python

Processing AI Platform Batch Results BigQuery

R: Shaping the results of a data import using distinct values in a field

Dynamicly labeling users by score

Categories

Resources