Extracting DAU, MAU using BigQuery - google-bigquery

I'm trying to extract Firebase Analytics DAU and MAU using BigQuery. The query I'm using for daily users is below -
SELECT
event_date AS day,
COUNT(DISTINCT user_id) AS daily_visitors
FROM `XXXXXXX.analytics_153729556.events_20190825`
WHERE
app_info.id = 'XXXXXXX'
AND
event_name = 'user_engagement'
GROUP BY day;
I have a few questions I would love some help with.
There is a significant(2000+) difference between the value from the query result and the value the Firebase dashboard shows for the same date(s). Is there a specific reason for this or is my query just plain wrong?
There are instances where I see dates other than the actual table selected. Example, I see 20190502 in the results when 20190501 should be the only row (based on the table name). Is this possibly because the events being dumped into the table are for an app in a different timezone? If not, what else could be the reason behind this?
I also want to extract historical MAU and DAU data, and store it on MongoDB for any future requirements that may arise. Is there a specific way in which I can extract them - after overcoming the problem I'm facing, of course?

Related

Grafana User Growth Time Series with SQL Server

I'm trying display the user growth per day using Grafana Time Series with SQL Server. However I found the documentation to be unhelpful and my queries are incorrect.
The following returns a constant value of 1 for every day. What do I need to change to display the number of new users created per day?
Thank you very much in advance.
SELECT
$__timeGroup([created_at],'1d') as time,
COUNT(id) as value,
'users' as metric
FROM [db].[user]
WHERE $__timeFilter([created_at])
GROUP BY [created_at]
ORDER BY 1
This works for me:
SELECT
$__timeGroup(created_at, '1d') AS time,
COUNT(id) as 'New Users'
FROM [db].[user]
GROUP BY $__timeGroup(created_at, '1d')
ORDER BY 1

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

Data collection from GDELT using bigquery

I am trying to construct an economic indicator based on all events with specific cameo codes from gdelt database.
So the idea is to collect data from 1990 to till date and see how economic cooperation varied based on news appearances of certain words. CAMEO codes 0211, 0311, 061, 1011 and 1211 in specific.
My query is how to extract this data for these specific cameo codes. If you can direct me to any source, it would be of great help.
One person suggested me that try using bigquery. I honestly don't know how to navigate the google bigquery page till now (I tried my best probably being from a non-tech background, it was a bit overwhelming for me). If any of you can help with one Cameo code data extraction example then I can play around with other events.
Edit: I am editing to show the progress I have made and the issues I am facing while running the query.
SELECT
*
FROM
[gdelt-bq:full.events]
WHERE
Year >= 1979
AND EventCode IN ('0211', '0311','061', '1011', '1211')
AND Actor1CountryCode != Actor2CountryCode
This query will process 228 GB when run and also excludes the cases where both the country codes are null. It has over 2 million rows and I cant download this as a csv file from bigquery platform.
The part where I need help is the following,
is there any way that I can get the total number of events for each event code which satisfies the following conditions
Actor1Countrycode and Actor2CountryCode should be different except when they are null
Count for each event code every month which satisfies the above condition.
PS: You can run the code given by Ben P in the answer below to see the number and type of columns in the database.
Edit2: There is another query that I am trying to write where in the AvgTone of an event with a specified code is greater than the average of AvgTone of all events in that particular month. Any leads on how to write this would be really helpful. Suppose, I add a WHERE clause wherein the AvgTone is greater than the average of AvgTone of all events for that particular period (MonthYear in this case). My doubt is how to write this in a query format.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211',
'0311',
'061')
AND Actor1CountryCode != Actor2CountryCode
AND AvgTone > (SELECT AVG(AvgTone) FROM [gdelt-bq:full.events] GROUP BY MonthYear ORDER BY MonthYear)
GROUP BY
MonthYear
ORDER BY
MonthYear
Error: ELEMENT can only be applied to result with 0 or 1 row.
Can someone help me with the above query? Thanks.
The GDELT database is available in BigQuery.
Here is a link to their available datasets, your first step would to identify which contains the information you are interested in:
https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/
Then this section of the site contains sample queries, which you can use as a starting point and try to tweak to your needs (note that these examples appear to me mostly in Legacy SQL, I would suggest you use them as a guide and rewrite then in Standard SQL):
https://blog.gdeltproject.org/a-compilation-of-gdelt-bigquery-demos/
If you have any specific SQL/BigQuery questions after you have done this I would recommend you come back with fresh questions and share examples of your working code, details what you have already tried and the results you expect to see.
Having had a quick look, and I must say i am not familiar with the dataset, but this may be a simple query that can start you on your way:
-- first we select all columns from the event dataset, which seems
-- to be the one you want, containing cameo codes
SELECT * FROM `gdelt-bq.full.events`
-- then we add a filter to only look at events in or after 1990
WHERE Year >= 1990
-- and another filter to look at only the specific camera
--codes you provided (I think EventCode is the correct column here,
AND EventCode IN ('0211','0311','061','1011','1211')
-- finally, we add a limit to our query, so we don't bring back ALL
-- the results while testing, once we are happy with our query, we'd remove this!
LIMIT 100
Finally, the GDELT tag right here on StackOverflow contains some really great content.
Hope that helps, GDELT looks like a fascinating project!
I finally figured out a way to extract data from GDELT using bigquery. Although the query is very simple, my lack of SQL knowledge made it difficult. Thanks to Ben who provided the initial help. Following are the queries which satisfy the conditions given in the question.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode IS NULL
AND Actor2CountryCode IS NULL
GROUP BY
MonthYear
ORDER BY
MonthYear
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode != Actor2CountryCode
GROUP BY
MonthYear
ORDER BY
MonthYear

What is the meaning of event_params value for firebase conversion?

I am researching about firebase conversion in BigQuery and right now still not understanding at all about conversion meaning
I tried this query to check out the value of the 'firebase_conversion' key and see that all of the value is 1.
Is this value mean that the event is marked conversion in Firebase?
SELECT event_name, event_params.value.int_value FROM [firebase-public-project:analytics_153293282.events_20181003] where event_params.key = "firebase_conversion"
Is there anyone familiar with conversion?
Could you guys help me to explain how firebase calculate the conversion rate? and How could we calculate it through BigQuery
On top of the documentation rtenha mentioned, you can also find a specific Firebase in BigQuery section in [1]. It even has some SQL examples regarding Firebase data exploration with BigQuery.
As you say, the value of 1 in event.params.value.int_value indicates that it is marked as a conversion, and it might be useful when it comes to counting events of that type.
In order to calculate the conversion rate, you need to divide the number of USERS that have done some type of conversion among the total number of USERS.
Here is an SQL example [2] that would:
1-create a table with an only cell: the total number of users in the desired time
2-create a table with the number of users that performed each of the events marked as conversions
3-select, for each type of event, the ratio of users that performed such conversion and the total number of users
I hope this finds you well!
[1] https://support.google.com/firebase/answer/9037342?hl=en&ref_topic=7029512
[2]
WITH t_e as (select count(DISTINCT user_id) as total_events from table_of_events
WHERE
table_of_events.event_timestamp >
UNIX_MICROS(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY))
AND table_of_events._TABLE_SUFFIX BETWEEN '20180501' AND '20180511'),
t_c as (SELECT count(DISTINCT user_id) as total_conversions from table_of_events
WHERE
table_of_events.event.params.key = “firebase_conversion”,
table_of_events.event_timestamp >
UNIX_MICROS(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY))
AND table_of_events._TABLE_SUFFIX BETWEEN '20180501' AND '20180511'
GROUP BY event_name)
select event_name, t_c.total_conversions/t_e.total_events as conversion_rate
FROM t_c, t_e

BigQuery: SELECT in WHERE-clause with filter based on a value in the current row

I know the title is probably pretty stupid but I have a hard time phrasing it differently.
I have to use BigQuery at work atm for some report. BigQuery is connected to a Google Analytics view of ours. This gives us a dataset with 1 table for each day. The rows of the tables are user-sessions on our site, while columns have some information about the sessions.
The problem I have is the following:
I want to select sessions with transactions, but only if the user was referred to our site by a certain referrer in the last x days before the transaction happened. I'm only familiar with basic SQL and not with any advanced concepts. It's really frustrating to me because this would be a no-brainer with any proper programming language given a .csv of the data, but I'm lacking knowledge of the relevant concepts in SQL.
#standardSQL
SELECT
COUNT(*)
FROM
`dataset.ga_sessions_2017*`
WHERE
totals.transactions > 0 AND
fullVisitorId IN (SELECT
fullVisitorId
FROM
`dataset.ga_sessions_2017*`
WHERE
trafficSource.source = "xyz.com"
) AND
< date difference thing>
I could filter for the date difference like I did with the trafficSource (referrer). The problem for me is that while "xyz.com" is a static thing, I'd need to reference the date value of the current row I'm in. So the date by which I'd filter the 2nd SELECT would be dynamically changing from row to row. Can anyone guide me on how this is usually done? This seems like a thing that would come up often.
I'm not familiar with the GA tables specifically, but having written some wildcard queries in BigQuery before, I think what you're looking for can be done using the _TABLE_SUFFIX pseudo column:
CAST(_TABLE_SUFFIX AS INT64) >= 1217
Where 1217 is today's date in MMDD format minus 3 days, assuming the table names are _20171217, _20171218, etc. Otherwise you can just use REPLACE to remove underscores before casting to an int. There are also functions that will generate today's date for you if you needed this query to run automatically.
Also, I think the fullVisitorId business could be replaced with a simple WHERE trafficSource.source = "xyz.com" but it's hard to say for sure without being able to run the query myself.
So the full query would look something like this:
#standardSQL
SELECT
COUNT(*)
FROM
`dataset.ga_sessions_2017*`
WHERE
totals.transactions > 0 AND
trafficSource.source = "xyz.com" AND
CAST(_TABLE_SUFFIX AS INT64) >= 1217