Can I query a JSON field from Crashlytics as a "where" directive in BigQuery? - sql

I've searched SO for this and found a number of similar questions, but I can't quite figure out how to apply them to my scenario.
We have Google Crashlytics linked to BigQuery.
Given the following table data (many columns deleted for clarity):
TABLE firebase_crashlytics.com_foo_ios
is_fatal | application
======================
true | {"v":{"f":[{"v":"53"},{"v":"0.9.1"}]}}
false | {"v":{"f":[{"v":"71"},{"v":"1.0.0"}]}}
true | {"v":{"f":[{"v":"72"},{"v":"1.0.1"}]}}
I've tried a lot of the suggestions, but can't seem to make this work.
I want to query the com_foo_ios table for all records for which is_fatal equals true, and the application version (the second array element of application) is higher than 1.0.0. Alternatively, I could use the build number as that is unique to versions.
So my question is:
Can this be done via an SQL query without having to write custom functions as suggested in Berlyant's reply here, which didn't work for me.
Interestingly, the errors I see indicate that the two elements of the application array are identified as build_version and display_name. I've tried using those in queries as well, to no avail.
Using the above sample data, can anyone suggest a straightforward way of querying this info?

Using the above sample data, can anyone suggest a straightforward way of querying this info?
select t.*,
json_extract_scalar(_[offset(0)], '$.v') as col1,
json_extract_scalar(_[offset(1)], '$.v') as col2
from your_table t,
unnest([struct(json_extract_array(application, '$.v.f') as _)])
if applied to sample data in your question
with your_table as (
select true is_fatal, '{"v":{"f":[{"v":"53"},{"v":"0.9.1"}]}}' application union all
select false, '{"v":{"f":[{"v":"71"},{"v":"1.0.0"}]}}' union all
select true, '{"v":{"f":[{"v":"72"},{"v":"1.0.1"}]}}'
)
output is
Hmm, dumping the data model shows 'application: STRUCT<build_version STRING, display_version STRING>'
so looks like your table schema is actually like in below sample
with your_table as (
select true is_fatal, struct('53' as buildversion, '0.9.1' as display_version) application union all
select false, struct('71' as buildversion, '1.0.0' as display_version) union all
select true, struct('72' as buildversion, '1.0.1' as display_version)
)
If so - use below to get to particular field(s)
select is_fatal, application.*
from your_table t
with output

Related

SQL to select with multiple different WHEREs / transpose / pivot table?

This is for BigQuery SQL
I have some data like
version
color
v1
red
v2
blue
I want to get it into an output format like this:
v1
v2
red
blue
I guess this is a classic transpose but not sure the best way to do this. I tried a nested query:
select * from
(select color from table where version = 'v1'),
(select color from table where version = 'v2')
(simplified query from my real column etc names!)
but that gives me multiple rows per item with different versions of v1
most of the examples I found seemed a lot more complex.
https://towardsdatascience.com/pivot-in-bigquery-4eefde28b3be
Appreciate some help in basic pivots or groupby or the best way to transpose?
UPDATE it almost works but fails cos of field names.
select *
from (
select agent, text, expect
from `my.data.runs`
)
pivot (
min(expect) as expect,
min(agent) as agent
for agent in ("august-mr")
)
Invalid field name "expect_august-mr".
if my agent was named 'august_mr' it works fine!
Is there anyway to try and escape or enable a dash in the values for the agent?
just simple pivot as in below example
select *
from table
pivot (min(color) for version in ('v1', 'v2'))
if applied to sample data in your question - output is

How to Query JSON Within A Database

I would like to query information from databases that were created in this format:
index
label
key
data
1
sneaker
UPC
{“size”: “value”, “color”: “value”, “location”: “shelf2”}
2
location
shelf2
{“height”: “value”, “row”: “value”, “column”: “value”}
Where a large portion of the data is in one cell stored in a json array. To make matters a bit tricky, the attributes in json aren’t in any particular order, and sometimes they reference other cells. Ie in the above example there is a “location” attribute which has more data in another row. Additionally sometimes the data cell is a multidimensional array where values are nested inside another json array.
I’m seeking to do certain query tasks like
Find all locations that have a sneaker
Or find all sneakers with a particular color etc
What’s the industry accepted solution on how to do this?
These are sqlite databases that I’m currently using DB Browser for SQLite to query. Definitely open to better solutions if they exist.
The design that you have needs SQLite's JSON1 extension.
The tasks that you mention in your question can be accomplished with the use of functions like json_extract().
Find all locations that have a sneaker
SELECT t1.*
FROM tablename t1
WHERE t1.label = 'location'
AND EXISTS (
SELECT 1
FROM tablename t2
WHERE t2.label = 'sneaker'
AND json_extract(t2.data, '$.location') = t1.key
)
Find all sneakers with a particular color
SELECT *
FROM tablename
WHERE label = 'sneaker'
AND json_extract(data, '$.color') = 'blue'
See the demo.
For more complicated tasks, such as getting values out of json arrays there are other functions like json_each().

Query Snowflake Jobs [duplicate]

is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.

Google BigQuery Trying to run a TABLE_RANGE_DATE

i am building a partition based table in a dataset and i am trying to query those partitions using a date range.
Here is an example of the data:
Dataset:
logs
Tables:
logs_20170501
logs_20170502
logs_20170503
i am trying first the TABLE_RANGE_DATE
SELECT count(*) FROM TABLE_DATE_RANGE([logs.logs_],
TIMESTAMP("2017-05-01"),
TIMESTAMP("2017-05-03")) as logs_count
i am keep getting : "ERROR:Error evaluating subsidiary query"
i tried those options as well:
single comma:
SELECT count(*) FROM TABLE_DATE_RANGE([logs.logs_],
TIMESTAMP('2017-05-01'),
TIMESTAMP('2017-05-03')) as logs_count
Add Project ID:
SELECT count(*) FROM TABLE_DATE_RANGE([main_sys_logs:logs.logs_],
TIMESTAMP('2017-05-01'),
TIMESTAMP('2017-05-03')) as logs_count
And it didn't worked.
So i tried to use TABLE_SUFFIX
SELECT
count(*)
FROM [main_sys_logs:logs.logs_*]
WHERE _TABLE_SUFFIX BETWEEN '20170501' AND '20170503'
And i got this error :
Invalid table name:'main_sys_logs:logs.logs_*
i have been switching SQL Dialect between legacy SQL ON/Off and i just got different errors on the table name part.
Is there any tips or help for this matter ?
maybe my table name is build wrong with the "_" at the end and this is causing the problem ? thanks for any help.
So i tried this Query and it worked :
SELECT count(*) FROM TABLE_DATE_RANGE(logs.logs_,
TIMESTAMP("2017-05-01"),
TIMESTAMP("2017-05-03")) as logs_count
it started to work after i run this query , i don't know if this is the reason .. but i just query the TABLES data for the dataset
SELECT *
FROM logs__TABLES__

GitHub Archive Google Big Query repositories language information for 2015

I have a problem with retrieving language information from GitHub Archive Google BigQuery since the structure of the tables changed which was at the beginning of 2015.
When querying github_timeline table I have a field named repository_language. It allows me to get my language statistics.
Unfortunately for 2015 the structure has changed and the table doesn't contain any events after 2014.
For example the following query doesn't return any data:
select
repository_language, repository_url, created_at
FROM [githubarchive:github.timeline]
where
PARSE_UTC_USEC(created_at) > PARSE_UTC_USEC('2015-01-02 00:00:00')
Events for 2015 are in: githubarchive:month & githubarchive:day tables. None of them have language information tho (or at least repository_language column).
Can anyone help me?
Look at payload field
It is string that, I think, actually holds JSON with all "missing" attributes
You can process this using JSON Functions
Added Query
Try as below:
SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC
What Mikhail said + you can use a query like this:
SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') language, COUNT(*) c
FROM [githubarchive:month.201501]
GROUP BY 1
ORDER BY 2 DESC