How can I identify english language text using BigQuery? - google-bigquery

I have some data on YouTube channel descriptions which are quite messy as you'd imagine. I'd like to filter channels whose description is in English, but I'm not sure how to go about it. Here's a sample of what the data looks like
WITH
foo AS (
SELECT ".olá sejam muito bem vindos. este canal foi criado" AS x
UNION ALL SELECT "Hello, I am Abhy and welcome to my channel." AS x
UNION ALL SELECT "Channels I love: Labrant Fam, Norris Nuts, La Familia Diamond, Piper Rockelle" AS x
UNION ALL SELECT "हेलो दोस्तो रमेश और सागर और सुखदेव आपका स्वागत करते हैं इस चैनल के ऊपर" AS x
UNION ALL SELECT "Hi, I'm K-POP RANDOM👩🇲🇨 === 🌈KPOP RANDOM DANCE🌈 === 🌻I hope you can enjoy" AS x
UNION ALL SELECT 'Public TV Kannada news channel. The slogan is "Yaara Aasthiyoo Alla, Idu Nimma TV"' AS x
UNION ALL SELECT "Instagram: www.instagram.com/whatsfordinner5291/" AS x
UNION ALL SELECT "Welcome to RunningBoy12, a gaming channel brought to you by RO!" as x
)
select * from foo
My idea is to hand-label some records, measure the frequency of foreign characters and words, and then fit a logistic regression model to the data using BigQuery ML. Is there a better way?

You can detect language with Cloud Translation API. Before inserting records, you need to run this API. You may want to use Cloud Functions to call this API. Or if you want to do more complicated ETL, you may use Cloud Dataflow.
When a text is categorized as English, you shall insert record to any DB you want.
In this way, you don't have to store non-English text in your DB, and can save your money for storage and querying. Instead of BigQuery, CloudFirestore could be option. It depends on the service you want to achieve.
Here is Cloud Translation API document:
https://cloud.google.com/translate/docs/advanced/detecting-language-v3#before_you_begin
Comparizon of DB:
https://db-engines.com/en/system/Amazon+DocumentDB%3BGoogle+BigQuery%3BGoogle+Cloud+Firestore

Related

Using BigQuery Geo Viz to visualize a polygon and its centroid

Using BigQuery Geo Viz,
I am trying to visualize a polygon and its centroid point, simultaneously on the same map.
I tried the ST_UNION function but could not really combine the two GEOGRAPHYs.
Any idea how to visualize both GEOGRAPHYs.
Polygon:
POLYGON((-95.7082555 29.9212101, -95.665885 29.907145, -95.7742806214083 29.82947355, -95.7303605 29.8538605, -95.659484 29.901497, -95.662932 29.894958, -95.8441482 29.7265376, -95.646749 29.905534, -95.810012 29.719363, -95.664174 29.883618, -95.639718 29.910045, -95.652796 29.89204, -95.649915 29.886317, -95.650089 29.881912, -95.641443 29.897741, -95.632912 29.911674, -95.653458 29.864561, -95.635056 29.864431, -95.636533 29.757219, -95.623339 29.903466, -95.597235 29.75367, -95.3636989932886 29.8063167449664, -95.575123 29.920295, -95.3944858832763 29.94248964622, -95.147033 30.013214, -95.586588 29.947706, -95.456723 31.3287239, -95.69717 29.96911, -95.674433 29.943844, -95.678203 29.935184, -95.7082555 29.9212101))
Centroid point:
POINT(-95.5606651932764 30.2307053050834)
Try selecting the two structures separately and using UNION ALL to gather them in the same visualization:
SELECT ST_GeogFromText('POLYGON((-95.7082555 29.9212101, -95.665885 29.907145, -95.7742806214083 29.82947355, -95.7303605 29.8538605, -95.659484 29.901497, -95.662932 29.894958, -95.8441482 29.7265376, -95.646749 29.905534, -95.810012 29.719363, -95.664174 29.883618, -95.639718 29.910045, -95.652796 29.89204, -95.649915 29.886317, -95.650089 29.881912, -95.641443 29.897741, -95.632912 29.911674, -95.653458 29.864561, -95.635056 29.864431, -95.636533 29.757219, -95.623339 29.903466, -95.597235 29.75367, -95.3636989932886 29.8063167449664, -95.575123 29.920295, -95.3944858832763 29.94248964622, -95.147033 30.013214, -95.586588 29.947706, -95.456723 31.3287239, -95.69717 29.96911, -95.674433 29.943844, -95.678203 29.935184, -95.7082555 29.9212101))') t UNION ALL SELECT ST_GeogFromText('POINT(-95.5606651932764 30.2307053050834)') t
If your intention is to show the geometry and the point in the same visualization it will work as you can see in the image below:
Please let me know if this is what you are looking for
In case of the simple scenario you presented in the question of having just one polygon and its centroid below simple solution works
#standardSQL
WITH objects AS (
SELECT 'POLYGON((-95.7082555 29.9212101, -95.665885 29.907145, -95.7742806214083 29.82947355, -95.7303605 29.8538605, -95.659484 29.901497, -95.662932 29.894958, -95.8441482 29.7265376, -95.646749 29.905534, -95.810012 29.719363, -95.664174 29.883618, -95.639718 29.910045, -95.652796 29.89204, -95.649915 29.886317, -95.650089 29.881912, -95.641443 29.897741, -95.632912 29.911674, -95.653458 29.864561, -95.635056 29.864431, -95.636533 29.757219, -95.623339 29.903466, -95.597235 29.75367, -95.3636989932886 29.8063167449664, -95.575123 29.920295, -95.3944858832763 29.94248964622, -95.147033 30.013214, -95.586588 29.947706, -95.456723 31.3287239, -95.69717 29.96911, -95.674433 29.943844, -95.678203 29.935184, -95.7082555 29.9212101))' wkt_string UNION ALL
SELECT 'POINT(-95.5606651932764 30.2307053050834)'
)
SELECT ST_GEOGFROMTEXT(wkt_string) geo
FROM objects
and this can be visualized with different tools - like in below example
For more realistic scenario, when you have many polygons and need to visulaize them along with their centroids - you can use below approach (based on example with us states)
#standardSQL
SELECT state_geom state, ST_CENTROID(state_geom) centroid
FROM `bigquery-public-data.utility_us.us_states_area`
with result as below
which can be visualized as in below examples (just showing few states to get an idea)
And finally, you can combine all such polygons (states in this example) with their centroids in nice visualization as below
Another (of many endless options) thing you can do is to add some metrics and more attributes to the query - for example state_name and area_land_meters and make your visualization data driven and with dynamic tooltips like in below example

BigQuery: Store semi-structured JSON data

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.
My structure will be like so:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
I wanted to use a STRUCT type but it seems all the fields need to be declared?
I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.
Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?
There are two primary methods for storing semi-structured data as you have in your example:
Option #1: Store JSON String
You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.
Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
There are some pros and cons to this approach:
Pros
The syntax is fairly straight forward
Less error prone
Cons
Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
Queries will run slower than using pure native SQL.
Option #2: Repeated Fields
BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.
Using the same example, here is how we would do that:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.
Pros
Store size will be less than JSON
Queries will typically execute faster.
Cons
The syntax is more complex, not as straight forward
Hope that helps, good luck.

Google's Big Query using SQL: Associate the assignee name and harmonized assignee name when there are multiple assignees

My goal is to create a table from Google's Big Query patents-public-data.patents.publications_201710 table using standard SQL that has one row for the publication_number, assignee and assignee_harmonized.name where the publication_number is repeated for records that have multiple assignees. Here's an example of my desired output:
publication_number|assignee|assignee_harm
US-6044964-A|Sony Corporation|SONY CORP
US-6044964-A|Digital Audio Disc Corporation|DIGITAL AUDIO DISC CORP
US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION
US-8746747-B2|null|MCPHERSON TERRY R
I've tried the following query based off of the UNNEST suggestion found in this post
#standard SQL
SELECT
p.publication_number,
p.assignee,
a.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p,
UNNEST(assignee_harmonized) AS a
WHERE
p.publication_number IN ('US-6044964-A',
'US-8746747-B2')
However, the output appears as follows:
row|publication_number|assignee|assignee_harm
1|US-6044964-A|Sony Corporation|SONY CORP
||Digital Audio Disc Corporation|
2|US-6044964-A|Sony Corporation|DIGITAL AUDIO DISC CORP
||Digital Audio Disc Corporation|
3|US-8746747-B2|IPS Corporation—Weld-On Division|MCPHERSON TERRY R
4|US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION
You can see that the "Sony Corporation" assignee is inappropriately associated with the "DIGITAL AUDIO DISC CORP" harmonized name in row 2 with a similar issue appearing in row 3. Also, rows 1 and 2 contain two lines each but don't repeat the publication_number identifier. I don't see a straightforward way to do this because the number of "assignee" doesn't always equal the number of "assignee_harmonized.name" and they don't always appear in the same order (otherwise I could try creating two tables and merging them somehow). On the other hand, there has to be a way to associate the "assignee" variable with its harmonized value "assignee_harmonized.name", otherwise the purpose of having a harmonized value is lost. Could you please suggest a query (or set of queries) that will produce the desired output when there are either multiple "assignee" or multiple "assignee_harmonized.name" or both?
You're querying for a string and two arrays - the whole thing basically looks like this:
{
"publication_number": "US-8746747-B2",
"assignee": [
"IPS Corporation—Weld-On Division"
],
"assignee_harm": [
"MCPHERSON TERRY R",
"IPS CORPORATION—WELD ON DIVISION"
]
}
So that's the data and you somehow need to decide how to treat the combination of them ... either you cross join everything:
#standard SQL
SELECT
p.publication_number,
assignee,
assignee_harmonized.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p
,p.assignee assignee
,p.assignee_harmonized AS assignee_harmonized
WHERE
p.publication_number IN ('US-6044964-A','US-8746747-B2')
.. which gives you relational data .. or you leave it as two separate arrays:
#standard SQL
SELECT
p.publication_number,
assignee,
ARRAY( (SELECT name FROM p.assignee_harmonized)) AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p
WHERE
p.publication_number IN ('US-6044964-A','US-8746747-B2')
You can save this nested result as a table in bq as well.

SQL Modeling Pyramid or Binary Tree

I'm building a sql server project that have a pyramid or binary tree concept...
I gonna try to explain using some tables!
The first table is
TB_USER(ID, ID_FATHER, LEFT/RIGHT TREE POSITION)
User can sell producs! So when they sell they earn points. Then, the second table is
TB_SELL (ID_USER, ID_PRODUCT, POINT)
As a result I'd like to see in the report format of points of each client below me in the binary model tree. How can I design these tables to make my life easier in this kind of search ? I will always get my soons up to 9 levels down.
I know that with procedure I can solve this problem , however I would like to know an elegant and simple solution.
Thank you
I solve this using a with a recursive query:
with with_user_earns as (
-- get father information (start)
select father.id, father.str_name, father.id_father, father.ind_father_side_type, 1 as int_user_level from tb_user father where id = 9
union all
-- get all soons (stop condition)
select son.id, son.str_name, son.id_father, son.ind_father_side_type, WUE.int_user_level + 1 from tb_user as son inner join with_user_earns as WUE on son.id_father = WUE.id where son.id_father is not null /*and WUE.int_user_level < 9*/
)
-- show result
select with_user_earns.id, with_user_earns.str_name, with_user_earns.id_father, with_user_earns.ind_father_side_type, with_user_earns.int_user_level from with_user_earns order by with_user_earns.int_user_level, with_user_earns.id

SQLite Select first row per category

I'm trying to write a mutli-language blog software using Python and sqlite and I'm struggeling with making an sql query elegant.
I've got all articles of the blog in two tables:
articleindex (contains most of the metadata of the articles like the URL, etc)
articlecontent (contains well, the content of the article, and a flag for the language, and when this specific translation was written (aka, the date))
I now try to select all articles ordered by date and by language. This if for the main view of the blog. It should list all articles in chronological order, regardless of the language they are in, but only once (I don't want to have the english version of an article below or above the german version) - if there are multiple translations the main view should contain the default language (english) if it exists. If there is no english version it should show the german version (if it exits) if there is no german version it shall show the esperanto version, etc.
Of course I can do this in python, select all articles, and skip a record if the another version of the article has already been written. However this seemed inelegant. I'd rather liked SQLite to return the data as need.
So far I could manage to get the data in the order I want, I just don't seem to be able to eliminate the unneeded records.
Here is the table structure:
CREATE TABLE articleindex (id INTEGER PRIMARY KEY,
category text,
translationid text,
webid text) `
CREATE TABLE articlecontent (id INTEGER PRIMARY KEY,
articleindexid INTEGER,
lang text,
content text,
date text) `
I came up with this query, which gives me the right order, but has duplicates in it:
SELECT * FROM articlecontent AS ac LEFT JOIN articleindex AS ai
ON ac.articleindexid = ai.id ORDER BY ac.date DESC, CASE ac.lang
WHEN "en" THEN 0
WHEN "de" THEN 1
WHEN "eo" THEN 2
END
This results in the (shortend) output:
articleindexid. lang
21, en
21, de
12, en
12, de
8, en
8, de
2, en
2, de
2, eo
How do I skip for example the second record with the articleindexid 21 or 12?
Using search engines I came across suggestions about using Partitions, but it seems Sqlite doesn't support those. I also have difficulties for what to search for, so any suggestion is appreciated.
Thanks
Ben
I think you should create table for language priorities to use it instead of CASE in SQL statemens. For example
LANG_PRIORITY(lang text,Ord INTEGER) = (("en",0),("de",1),("eo",2))
Anyway in your current environment try to use the following query. The subquery with LIMIT 1 will select one row per DATE with higher priority:
SELECT * FROM articlecontent AS ac
LEFT JOIN articleindex AS ai
ON ac.articleindexid = ai.id
WHERE ac.id =
(
SELECT ID FROM articlecontent as ac2
WHERE ac.date=ac2.date
ORDER BY CASE ac2.lang
WHEN "en" THEN 0
WHEN "de" THEN 1
WHEN "eo" THEN 2
END
LIMIT 1
)
ORDER BY ac.date DESC