big query materialized view; create a materialized view using an (inner) join - google-bigquery

Hopefully someone can help me move forward cause it seems I'm stuck.
I've read the documentation carefully on how to create a materialized view, however I'm getting this unclear error;
Materialized view query contains unsupported feature.
I know the error is in the join, if I remove it, the error disappears as well, however inner joins should be supported when looking at the documentation, so I'm a little bit lost right now.
Background: we use firestore and its BigQuery Extension to get our collection into bigQuery. for performance reasons I'm looking into the materialized view option.
Question: Basically what I want is to get the latest document from the (Firestore Bigquery Extension) changelog table.
I first gather the latest timestamp per document_id, I then want to retrieve the actual document (the data column) from the data table.
More specifically; can someone help to;
A; figure out what I'm doing wrong
B; or tell me how I create a materialized view where I;
always have access to the latest document version
exclude deleted documents (operation = 'DELETE`) from the result set
All help is very much appreciated!
CREATE OR REPLACE MATERIALIZED VIEW `XXX.XXX.view_name` AS
WITH `ids` AS (
SELECT
`document_name`,
MAX(`timestamp`) AS `timestamp`,
FROM `XXX.XXX.id_table`
GROUP BY `document_name`
)
SELECT
`d`.`data`
FROM `XXX.XXX.data_table` AS `d`
INNER JOIN `ids` AS `i` ON (
`i`.`timestamp` = `d`.`timestamp`
`i`.`document_name` = `d`.`document_name`
)
WHERE `operation` != 'DELETE'
;

This is my guess based the current documentation:
Document says:
"Aggregates in the materialized view query must be outputs. Computing or filtering based on an aggregated value is not supported."
In your "WITH" clause you have used MAX(timestamp) aggregate function and that column you are using it in join's ON computation. And probably thats why it not supported.

Related

Trouble with SDO_OVERLAPBDYDISJOINT and spatial index

i am using Oracle 11 g and I need to know if a specific point is inside de buffer of another point from a table with spatial index, i am using the follow sentence:
'''SELECT A.fieldX
FROM TABLE A
WHERE
SDO_OVERLAPBDYDISJOINT(sdo_geom.sdo_buffer(A.geometry,2,0.1),SDO_GEOMETRY(2001,NULL
,SDO_POINT_TYPE(497644.6,2432725.8,NULL),NULL,NULL)) = 'TRUE';'''
And I obtain the follow error:
13226. 00000 - "interface not supported without a spatial index"
Cause: The geometry table does not have a spatial index.
Action: Verify that the geometry table referenced in the spatial operator
has a spatial index on it.
The operator SDO_OVERLAPBDYDISJOINT uses only geometries from tables with spatial index, and I understand that this error is caused for the buffer operator but if I invert the order and put first the SDO_POINT_TYPE, I have the same error. Is there any way to use this operator or another similar without a spatial index?
I dont want to use pl sql because I need to use the sentence in a VBA code.
Thanks a lot!!!
What you essentially want is to find out all the geometries that are within some distance of another. This is easily and better done this way. It is also much more efficient.
SELECT A.fieldX
FROM TABLE A
WHERE sdo_within_distance(A.geometry,SDO_GEOMETRY(2001,NULL,SDO_POINT_TYPE(497644.6,2432725.8,NULL),NULL,NULL)),'distance=2') = 'TRUE';
I think your problem is that the A.geometry is indexed, but its buffer is not.
The first thing you should try, is to use
SDO_OVERLAPBDYDISJOINT(A.geometry, buffer(sdo_point(...),2,0.1)) - and, while at it, it would be more correct to use SDO_INSIDE here.
If this does not work, you should check if your index is, indeed, ok. You can easily test it using a specific id from your table - lets say, 10 - and run:
select a.id from your_table a, your_table b where a.id=b.id and b.id=10 and sdo_equals(a.geometry,b.geometry)='TRUE'; If it returns your id (10 in my example), your index is ok.

Google Bigquery, WHERE clause based on JSON item

I've got a bigquery import from a firestore database where I want to query on a particular field from a document. This was populated via the firestore-bigquery extension and the document data is stored as a JSON string.
I'm trying to use a WHERE clause in my query that uses one of the fields from the JSON data. However this doesn't seem to work.
My query is as follows:
SELECT json_extract(data,'$.title') as title,p
FROM `table`
left join unnest(json_extract_array(data, '$.tags')) as p
where json_extract(data,'$.title') = 'technology'
data is the JSON object and title is an attribute of all of the items. The above query will run but yield 'no results' (There are definitely results there for the title in question as they appear in the table preview).
I've tried using WHERE title = 'technology' as well but this returns an error that title is an unrecognized field (hence the json_extract).
From my research this should work as a standard SQL JSON query but doesn't seem to work on Bigquery. Does anyone know of a way around this?
All I can think of is if I put the results in another table, but I don't know if that's a workable solution as the data is updated via the extension on an update, so I would need to constantly refresh my second table as well.
Edit
I'm wondering if configuring a view would help with this? Though ultimately I would like to query this based on different parameters and the docs here https://cloud.google.com/bigquery/docs/views suggest you can't reference query parameters in a view
I've since managed to work this out, and will share the solution for anyone else with the same problem.
The solution was to use JSON_VALUE in the WHERE clause instead e.g:
where JSON_VALUE(data,'$.title') = 'technology';
I'm still not sure if this is the best way to do this in terms of performance and cost so I will wait to see if anyone else leaves a better answer.

Track data load history in Snowflake

Snowflake stores few metadata sets in its INFORMATION_SCHEMA object. I tried to investigate how specific table got loaded by procedure/query.
History allows to investigate high-level but I wanted to have custom SQL code to drill more deep.
After executing below code i got Statement not found error even though Query_ID is valid.
Is there any way to navigate history load so I can track what procedure loaded data to which table?
SELECT * FROM table(RESULT_SCAN('xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx'));
details of using RESULT_SCAN( ) can be found at the below link, please note these two conditions might be affecting your ability to run the query:
the query cannot have executed more than 24 hours prior to the use of RESULT_SCAN()
only the user who ran the original query can use the RESULT_SCAN( )
https://docs.snowflake.com/en/sql-reference/functions/result_scan.html#usage-notes
As for "navigate history load so I can track what procedure loaded data to which table?": I'd strongly recommend you doing your analysis on the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view.
A good starting point might be something like this:
SELECT *
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('days', -30, CURRENT_TIMESTAMP())
AND start_time <= date_trunc(HOUR, CURRENT_TIMESTAMP())
AND query_text iLike '%TABLE_NAME_HERE%'
AND query_type <> 'SELECT';
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
If you suspect the table in question has been loaded from a COPY INTO table command,
it'd make sense for you to begin with seeing the results of those in one of the following two views:
SNOWFLAKE.ACCOUNT_USAGE.COPY_HISTORY https://docs.snowflake.com/en/sql-reference/account-usage/copy_history.html
SNOWFLAKE.ACCOUNT_USAGE.LOAD_HISTORY https://docs.snowflake.com/en/sql-reference/account-usage/load_history.html
While the views in the account_usage "share" may have some latency (typically 10-20 minutes, could be as much as 90 minutes), I've found that using them to do analysis like you are doing easier than querying INFORMATION_SCHEMA objects (opinion).
I hope this helps...Rich
If you wish to view the most recent query history you can use the following syntax:
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY())
WHERE QUERY_ID = 'xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx'
To filter for data load queries:
SELECT *
FROM TABLE(information_schema.QUERY_HISTORY())
WHERE QUERY_TEXT LIKE '%COPY INTO%'
Tip: The above table functions return the last 7 days worth of data. If you require more history, use the Account usage views.
Tip: to use the Account History views, switch to the AccountAdmin role.
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
Rgds,
Dan.

How would I join two tables, A and B, on A.slug and B.path in PostgreSQL?

Say I have an articles table that has a column called slugs, storing the slugs of the article - for example example-article-2016.
I also have a log table that logs each visit to each article, and has a column called paths that stores the same data in a different format: /articles/example-article-2016.
I have thought about just processing the path column in a way that would remove the /articles/ part, and then joining, but I am curious if there is a way to join on these columns without actually modifying the data.
You don't have to actually modify the data permanently, but you do have to adjust it for the join. One way would be to replace /articles/ with '' for example:
SELECT ...
FROM articles a
JOIN log l ON REPLACE(l.paths, '/articles/', '') = a.slugs
This won't use indexes and is not ideal, but works perfectly fine in ad-hoc scenarios. If you need to do this join a lot, you should consider a schema change.
You can just do:
SELECT
a.slugs /*, l.visited_at */
FROM
articles a
JOIN logs l ON substr(l.path, length('/articles/')+1) = a.slugs ;
The substr function should be quite fast to execute. You can obviously change length('/articles/')+1 by the constant 11; but I think that leaving it there is much more informative of what you're actually doing... If the last bit of performance is needed, put the 11.
You will probably take benefit of having the following computed index:
CREATE INDEX idx_logs_slug_from_path
ON logs ((substr(path, length('/articles/')+1))) ;
Check the whole setup at dbfiddle here

ERROR in CREATE VIEW

I tried to create a new view in my MS Access database so I can select better from it but I wonder what's happening here.
CREATE VIEW new
AS
SELECT msthread.id,
msthread.threadname,
Count(msthread.threadname) AS TotalPost,
threadcategory
FROM msthread
LEFT OUTER JOIN msposts
ON msthread.threadname = msposts.threadname
GROUP BY msthread.id,
msthread.threadname,
msthread.threadcategory
Access gives me this error message when I try to execute that statement.
Syntax error in create table statement
Is there specific problems in creating view with JOINs? I'm trying to access 2 tables.
CREATE VIEW was introduced with Jet 4 in Access 2000. But you must execute the statement from ADO/OleDb. If executed from DAO, it triggers error 3290, "Syntax error in CREATE TABLE statement", which is more confusing than helpful.
Also CREATE VIEW can only create simple SELECT queries. Use CREATE PROCEDURE for any which CREATE VIEW can't handle.
But CREATE VIEW should handle yours. I used a string variable to hold the DDL statement below, and then executed it from CurrentProject.Connection in an Access session:
CurrentProject.Connection.Execute strSql
That worked because CurrentProject.Connection is an ADO object. If you will be doing this from outside Access, use an OleDb connection.
Notice I made a few changes to your query. Most were minor. But I think the query name change may be important. New is a reserved word so I chose qryNew instead. Reserved words as object names seem especially troublesome in queries run from ADO/OleDb.
CREATE VIEW qryNew
AS
SELECT
mst.id,
mst.threadname,
mst.threadcategory,
Count(mst.threadname) AS TotalPost
FROM
msthread AS mst
LEFT JOIN msposts AS msp
ON mst.threadname = msp.threadname
GROUP BY
mst.id,
mst.threadname,
mst.threadcategory;
Going out on a limb here without the error message but my assumption is that you need an alias in front of your non-aliased column.
You may also have a problem titling the view as new. This is a problem with using a generic name for a view or table. Try giving it a distinct name that matters. I'll use msThreadPosts as an example.
CREATE VIEW msThreadPosts
AS
SELECT msthread.id,
msthread.threadname,
Count(msthread.threadname) AS TotalPost,
msposts.threadcategory --Not sure if you want msposts or msthread just pick one
FROM msthread
LEFT OUTER JOIN msposts
ON msthread.threadname = msposts.threadname
GROUP BY msthread.id,
msthread.threadname,
msthread.threadcategory
As long as we are looking at this query lets fix some other things that are being done in a silly way.
Lets start off with aliasing. If you alias a column you can very easily make your query easy to understand and read to anyone who is inclined to read it.
CREATE VIEW msThreadPosts
AS
SELECT mt.id,
mt.threadname,
Count(mt.threadname) AS TotalPost,
mp.threadcategory
FROM mtas mt
LEFT OUTER JOIN msposts mp
ON mt.threadname = mp.threadname
GROUP BY mt.id,
mt.threadname,
mt.threadcategory
There now doesn't that look better.
The next thing to look as if your column names. msthread has an id column. That column name is incredibly generic. This can cause problems when a column isn't aliased and an id exists in mulitple places or there are muliple id columns. Now if we change that column name to msthreadID it makes things much clearer. The goal is to design your tables in a way that anyone working on your database can imidiatley tell what a column is doing.
The next thing to look at is your join. Why are you joining on thread name? threadname is likely a character string and therefore not terribly efficient for joins. if msthread as an id column and needs to be joined to msposts then shouldn't msposts also have that id column to match up on to make joins more efficient?