SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!
I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER
I am working on a project to build some queries from Google Analytics data in BigQuery to replicate some reports for one particular KPI, I have a table with a list of sites that I need to have excluded from the Google Analytics data in order to get the correct metric.
My list might have something such as:
sitename.com
However I need to match this to the eventLabel column in BigQuery where the URL could come back as:
http://sitename.com/subpage/extra-subpage
I can't do a Not In as this requires a direct match, I have tried using a like statement however I get the following error
Scalar subquery produced more than one element
I'm not really sure how else to proceed and am wondering if I need to do a query that say does the string match (as i can get it to work if i use an inner join and then use this new table to do the exclusions as I can keep the eventLabel and then do my Not In based on that?
SELECT Distinct
h.eventinfo.eventAction eventAction,
h.eventinfo.eventlabel eventlabel
FROM `projectName.ga_sessions_*`, unnest(Hits) h
WHere
_TABLE_SUFFIX BETWEEN "20190101" AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
and type = 'EVENT'
and h.eventInfo.eventCategory = 'EventName'
and Replace(Replace(Replace(h.eventInfo.eventLabel,'http://',''),'https://',''),'www.','')
Not like (select concat(ThirdPartyURL,'%') from `projectName.datasetName.ExclusionList`)
I hope the above makes sense.
TIA.
After reproducing your problem the solution is to use NOT IN instead of NOT LIKE as follow:
WITH `projectName.datasetName.ExclusionList` AS
(SELECT 'label1' AS ThirdPartyURL UNION ALL
SELECT 'label2')
SELECT DISTINCT h.eventinfo.eventAction eventAction,
h.eventinfo.eventlabel eventlabel
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
unnest(Hits) h
WHERE _TABLE_SUFFIX BETWEEN "20170801" AND "20170802"
AND TYPE = 'EVENT'
AND h.eventInfo.eventCategory = 'EventName'
AND Replace(Replace(Replace(h.eventInfo.eventLabel, 'http://', ''), 'https://', ''), 'www.', '')
NOT IN
(SELECT ThirdPartyURL FROM `projectName.datasetName.ExclusionList`)
This is the link to BigQuery related SQL documentation
The old BigQuery Export schema wise script is running.It is given below. But I want to replicate this code and write it according to new export schema as we Bigquery schema has been changed. Please help becasue in new BigQuery Export schema I don't find any other corresponding record against
event_dim (event_dim is in according to old BigQuery Export schema).
Here is link for BigQuery Export schema: click here
SELECT user_dim.app_info.app_instance_id
, (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
, (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time,
event.name,
params.value.int_value engagement_time
FROM `xxx.app_events_*`,
UNNEST(event_dim) as event,
UNNEST(event.params) as params,
UNNEST(user_dim.user_properties) as user_params
where (event.name = "user_engagement" and params.key = "engagement_time_msec")
and
(user_params.key = "access" and user_params.value.value.string_value = "true") and
PARSE_DATE('%Y%m%d', event.date) >= date_sub("{{upto_date (yyyy-mm-dd)}}", interval {{last n days}} day) and
PARSE_DATE('%Y%m%d', event.date) <= "{{upto_date (yyyy-mm-dd)}}"
Tried the query below but what I want app_instance, min_time, max_time, event_name, engagement_time at one SELECT statement. And as I am using 'group by', I am not able to get all those (app_instance, min_time, max_time, event_name, engagement_time) at a time. Please help.
SELECT user_pseudo_id
, MIN(event_timestamp) AS min_time
,MAX(event_timestamp) AS max_time
FROM `xxx.app_events_*` as T,
T.event_params,
T.user_properties,
T.event_timestamp
where (event_name = "user_engagement" and event_params.key = "engagement_time_msec")
and
(user_properties.key = "access" and user_properties.value.string_value = "true") and
PARSE_DATE('%Y%m%d', event_date) >= date_sub("{{upto_date (yyyy-mm-dd)}}", interval {{last n days}} day) and
PARSE_DATE('%Y%m%d', event_date) <= "{{upto_date (yyyy-mm-dd)}}"
group by 1
It is true that there was a schema change in the Google Analytics for Firebase BigQuery Export. Although there is no clear mapping of the old fields as compared to the new ones, the SQL query that is provided in the documentation in order to migrate existing BQ datasets from the old schema to the new one provides some hints of how have these fields changed.
I share the migration_script.sql SQL query below, just for reference, but let me pin-point the most relevant changes for your use-case:
event_dim is mapped as event in the SQL query, but does not have any final representation in the schema, because event_dim is no longer a nested field: UNNEST(event_dim) AS event
event_dim.timestamp_micros is mapped as event_timestamp: event.timestamp_micros AS event_timestamp
event_dim.name is mapped as event_name: event.name AS event_name
event_param.value.int_value is mapped as event_params.value.int_value: event_param.value.int_value AS int_value
user_dim.user_properties is mapped as user_properties, and all its nested values follow the same structure: UNNEST(user_dim.user_properties) AS user_property) AS user_properties
So, in summary, the schema change has been focused at unnesting several of the fields for simplicity, in such a way that, for example, instead of having to access event_dim.name (which would require unnesting and complicating the query), you can query directly the field event_name.
Having this in mind, I am sure you will be able to adapt your query to this new schema, and it will probably look way more simple, given that you will not have to unnest so many fields.
Just for clarification, let me share with you a couple of sample BQ queries comparing the old and the new schema (they are using public Firebase tables, so you should be able to run them out-of-the-box):
# Old Schema - UNNEST() required because there are nested fields
SELECT
user_dim.app_info.app_instance_id,
MIN(event.timestamp_micros) AS min_time,
MAX(event.timestamp_micros) AS max_time,
event.name
FROM
`firebase-public-project.com_firebase_demo_ANDROID.app_events_20180503`,
UNNEST(event_dim) AS event
WHERE
event.name = "user_engagement"
GROUP BY
user_dim.app_info.app_instance_id,
event.name
As compared to:
# New Schema - UNNEST() not required because there are no nested fields
SELECT
user_pseudo_id,
MIN(event_timestamp) AS min_time,
MAX(event_timestamp) AS max_time,
event_name
FROM
`firebase-public-project.analytics_153293282.events_20180815`
WHERE
event_name = "user_engagement"
GROUP BY
user_pseudo_id,
event_name
These queries are equivalent, but referencing tables with the old and new schema. Please note that, as your query is more complex, you may need to add some UNNEST() in order to access the remaining nested fields in the table.
Additionally, you may want to have a look at these samples that can help you with some ideas on how to write queries with the new schema.
EDIT 2
My understanding is that a query like the one below should allow you to query for all the fields in a single statement. I am grouping by all the non-aggregated/filtered fields, but depending on your use case (this is definitely something you would need to work on your own) you may want to apply a different strategy in order to be able to query the non-grouped fields (i.e. use a MIN/MAX filter, etc.).
SELECT
user_pseudo_id,
MIN(event_timestamp) AS min_time,
MAX(event_timestamp) AS max_time,
event_name,
par.value.int_value AS engagement_time
FROM
`firebase-public-project.analytics_153293282.events_20180815`,
UNNEST(event_params) as par
WHERE
event_name = "user_engagement" AND par.key = "engagement_time_msec"
GROUP BY
user_pseudo_id,
event_name,
par.value.int_value
ANNEX
migration_script.sql:
SELECT
#date AS event_date,
event.timestamp_micros AS event_timestamp,
event.previous_timestamp_micros AS event_previous_timestamp,
event.name AS event_name,
event.value_in_usd AS event_value_in_usd,
user_dim.bundle_info.bundle_sequence_id AS event_bundle_sequence_id,
user_dim.bundle_info.server_timestamp_offset_micros as event_server_timestamp_offset,
(
SELECT
ARRAY_AGG(STRUCT(event_param.key AS key,
STRUCT(event_param.value.string_value AS string_value,
event_param.value.int_value AS int_value,
event_param.value.double_value AS double_value,
event_param.value.float_value AS float_value) AS value))
FROM
UNNEST(event.params) AS event_param) AS event_params,
user_dim.first_open_timestamp_micros AS user_first_touch_timestamp,
user_dim.user_id AS user_id,
user_dim.app_info.app_instance_id AS user_pseudo_id,
"" AS stream_id,
user_dim.app_info.app_platform AS platform,
STRUCT( user_dim.ltv_info.revenue AS revenue,
user_dim.ltv_info.currency AS currency ) AS user_ltv,
STRUCT( user_dim.traffic_source.user_acquired_campaign AS name,
user_dim.traffic_source.user_acquired_medium AS medium,
user_dim.traffic_source.user_acquired_source AS source ) AS traffic_source,
STRUCT( user_dim.geo_info.continent AS continent,
user_dim.geo_info.country AS country,
user_dim.geo_info.region AS region,
user_dim.geo_info.city AS city ) AS geo,
STRUCT( user_dim.device_info.device_category AS category,
user_dim.device_info.mobile_brand_name,
user_dim.device_info.mobile_model_name,
user_dim.device_info.mobile_marketing_name,
user_dim.device_info.device_model AS mobile_os_hardware_model,
#platform AS operating_system,
user_dim.device_info.platform_version AS operating_system_version,
user_dim.device_info.device_id AS vendor_id,
user_dim.device_info.resettable_device_id AS advertising_id,
user_dim.device_info.user_default_language AS language,
user_dim.device_info.device_time_zone_offset_seconds AS time_zone_offset_seconds,
IF(user_dim.device_info.limited_ad_tracking, "Yes", "No") AS is_limited_ad_tracking ) AS device,
STRUCT( user_dim.app_info.app_id AS id,
#firebase_app_id AS firebase_app_id,
user_dim.app_info.app_version AS version,
user_dim.app_info.app_store AS install_source ) AS app_info,
(
SELECT
ARRAY_AGG(STRUCT(user_property.key AS key,
STRUCT(user_property.value.value.string_value AS string_value,
user_property.value.value.int_value AS int_value,
user_property.value.value.double_value AS double_value,
user_property.value.value.float_value AS float_value,
user_property.value.set_timestamp_usec AS set_timestamp_micros ) AS value))
FROM
UNNEST(user_dim.user_properties) AS user_property) AS user_properties
FROM
`SCRIPT_GENERATED_TABLE_NAME`,
UNNEST(event_dim) AS event
As I believe my previous answer provides some general ideas for the Community, I will keep it and write a new one in order to be more specific for your use case.
First of all, I would like to clarify that in order to adapt a query (just like you are asking us to do), one needs to have a clear understanding of the statement, objective of the query, expected results and data to play with. As this is not the case, it is difficult to work with it, even more considering that there are some functionalities that are not clear from the query, for example: in order to obtain the "min_time" and "max_time" for each event, you are taking the min and max value across multiple events, which does not make clear sense to me (it may, depending on your use case, reason why I suggested that it would be better if you could provide more details or work more on the query yourself). Moreover, the new schema "flattens" events, in such a way that each event is written in a different line (you can easily check this by running a SELECT COUNT(*) FROM 'table_with_old_schema' and compare it to SELECT COUNT(*) FROM 'table_with_new_schema'; you will see that the second one has many more rows), so your query does not make sense anymore, because events are not grouped anymore, and then you cannot pick a minimum and maximum between nested fields.
This being clarified, and having removed some fields that cannot be directly adapted to the new schema (you may be able to adapt this from your side, but this would require some additional effort and understanding of what did those fields mean to you in your previous query), here there are two queries that provide exactly the same results, when run against the same table, with different schema:
Query against a table with the old schema:
SELECT
user_dim.app_info.app_instance_id,
event.name,
params.value.int_value engagement_time
FROM
`DATASET.app_events_YYYYMMDD`,
UNNEST(event_dim) AS event,
UNNEST(event.params) AS params,
UNNEST(user_dim.user_properties) AS user_params
WHERE
(event.name = "user_engagement"
AND params.key = "engagement_time_msec")
AND (user_params.key = "plays_quickplay"
AND user_params.value.value.string_value = "true")
ORDER BY 1, 2, 3
Query against the same table, with the new schema:
SELECT
user_pseudo_id,
event_name,
params.value.int_value engagement_time
FROM
`DATASET.events_YYYYMMDD`,
UNNEST(event_params) AS params,
UNNEST(user_properties) AS user_params
WHERE
(event_name = "user_engagement"
AND params.key = "engagement_time_msec")
AND (user_params.key = "plays_quickplay"
AND user_params.value.string_value = "true")
ORDER BY 1, 2, 3
Again, for this I am using the following table from the public dataset: firebase-public-project.com_firebase_demo_ANDROID.app_events_YYYYMMDD, so I had to change some filters and remove some others in order for it to retrieve sensible results against that table. Therefore, feel free to modify or add the ones you need in order for it to be useful for your use case.
I have a large table of events. Per user I want to count the occurence of type A events before the earliest type B event.
I am searching for an elegant query. Hive is used so I can't do subqueries
Timestamp Type User
... A X
... A X
... B X
... A X
... A X
... A Y
... A Y
... A Y
... B Y
... A Y
Wanted Result:
User Count_Type_A
X 2
Y 3
I could not get the "cut-off" timestamp by doing:
Select User, min(Timestamp)
Where Type=B
Group BY User;
But then how can I use that information inside the next query where I want to do something like:
SELECT User, count(Timestamp)
WHERE Type=A AND Timestamp<min(User.Timestamp_Type_B)
GROUP BY User;
My only idea so far are to determine the cut-off timestamps first and then do a join with all type A events and then select from the resulting table, but that feels wrong and would look ugly.
I'm also considering the possibility that this is the wrong type of problem/analysis for Hive and that I should consider hand-written map-reduce or pig instead.
Please help me by pointing in the right direction.
First Update:
In response to Cilvic's first comment to this answer, I've adjusted my query to the following based on workarounds suggested in the comments found at https://issues.apache.org/jira/browse/HIVE-556:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
CROSS JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
WHERE main.[Type] = 'A'
AND (sub.[User] = main.[User])
AND (main.[Timestamp] < sub.[First_B_TS])
GROUP BY main.[User]
Original:
Give this a shot:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]
I did my best to follow hive syntax. Let me know if you have any questions. I would like to know why you wish/need to avoid a subquery.
In general, I +1 coge.soft's solution. Here it is again for your reference:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]
However, a couple things to note:
What happens when there are no B events? Assuming you would want to count all the A events per user in that case an inner join as specified in the solution wouldn't work since there would be no entry for that user in the sub table. You would need to change to a left outer join for that.
The solution also does 2 passes over the data - one to populate the sub table, other to join the sub table with the main table. Depending on your notion of performance and efficiency, there is an alternative where you could do this by a single pass of data. You can distribute the data by user using Hive's distribute by functionality and write a custom reducer that would do your count calculation in your favorite language using Hive's transform functionality.
So I decided to try out PostgreSQL instead of MySQL but I am having some slight conversion problems. This was a query of mine that samples data from four tables and spit them out all in on result.
I am at a loss of how to convey this in PostgreSQL and specifically in Django but I am leaving that for another quesiton so bonus points if you can Django-fy it but no worries if you just pure SQL it.
SELECT links.id, links.created, links.url, links.title, user.username, category.title, SUM(votes.karma_delta) AS karma, SUM(IF(votes.user_id = 1, votes.karma_delta, 0)) AS user_vote
FROM links
LEFT OUTER JOIN `users` `user` ON (`links`.`user_id`=`user`.`id`)
LEFT OUTER JOIN `categories` `category` ON (`links`.`category_id`=`category`.`id`)
LEFT OUTER JOIN `votes` `votes` ON (`votes`.`link_id`=`links`.`id`)
WHERE (links.id = votes.link_id)
GROUP BY votes.link_id
ORDER BY (SUM(votes.karma_delta) - 1) / POW((TIMESTAMPDIFF(HOUR, links.created, NOW()) + 2), 1.5) DESC
LIMIT 20
The IF in the select was where my first troubles began. Seems it's an IF true/false THEN stuff ELSE other stuff END IF yet I can't get the syntax right. I tried to use Navicat's SQL builder but it constantly wanted me to place everything I had selected into the GROUP BY and that I think it all kinds of wrong.
What I am looking for in summary is to make this MySQL query work in PostreSQL. Thank you.
Current Progress
Just want to thank everybody for their help. This is what I have so far:
SELECT links_link.id, links_link.created, links_link.url, links_link.title, links_category.title, SUM(links_vote.karma_delta) AS karma, SUM(CASE WHEN links_vote.user_id = 1 THEN links_vote.karma_delta ELSE 0 END) AS user_vote
FROM links_link
LEFT OUTER JOIN auth_user ON (links_link.user_id = auth_user.id)
LEFT OUTER JOIN links_category ON (links_link.category_id = links_category.id)
LEFT OUTER JOIN links_vote ON (links_vote.link_id = links_link.id)
WHERE (links_link.id = links_vote.link_id)
GROUP BY links_link.id, links_link.created, links_link.url, links_link.title, links_category.title
ORDER BY links_link.created DESC
LIMIT 20
I had to make some table name changes and I am still working on my ORDER BY so till then we're just gonna cop out. Thanks again!
Have a look at this link GROUP BY
When GROUP BY is present, it is not
valid for the SELECT list expressions
to refer to ungrouped columns except
within aggregate functions, since
there would be more than one possible
value to return for an ungrouped
column.
You need to include all the select columns in the group by that are not part of the aggregate functions.
A few things:
Drop the backticks
Use a CASE statement instead of IF() CASE WHEN votes.use_id = 1 THEN votes.karma_delta ELSE 0 END
Change your timestampdiff to DATE_TRUNC('hour', now()) - DATE_TRUNC('hour', links.created) (you will need to then count the number of hours in the resulting interval. It would be much easier to compare timestamps)
Fix your GROUP BY and ORDER BY
Try to replace the IF with a case;
SUM(CASE WHEN votes.user_id = 1 THEN votes.karma_delta ELSE 0 END)
You also have to explicitly name every column or calculated column you use in the GROUP BY clause.