Why totals.hits and the last hitnumber in hits are different? - google-bigquery

I am looking ga360 data with bigquery.
Accordingt to schema of bigquery export, I think the number of totals.hits and the biggest number of hits.hitnumber in the sesssion should be same.
But I find some data is hits.hitnumber is greater than totals.hits.
Could anybody explain if ? Thanks.

Related

How does partitioning in BigQuery works?

Hi All: I am trying to understand how the partitioned tables work. I have a sales table of size 12.9MB. I have a date column that is partitioned by day. My assumption is that when I filter the data table using this date column, the amount of data processed by BigQuery will be optimized. However, it doesn’t seem to work that way, and I would like to understand the reason.
In the below query, I am filtering sales.date using a subquery. When I try to execute the query as such, it is processing the entire table of 12.9 MB.
However, if I replace the below subquery with the actual date (the same result that we have from the subquery), then the amount of data processed is 4.9 MB.
The subquery alone processes 630 KB of data. If my understanding is right, shouldn’t the below given query process 4.9 MB + 630 KB = ~ 5.6 MB? But, it still processes 12.9 MB. Can someone explain what’s happening here?
SELECT
sales.*,
FROM `my-project.transaction_data.sales_table` sales
WHERE DATE(sales.date) >= DATE_SUB(DATE((select max(temp.date) FROM ` my-project.transaction_data.sales_table ` temp)), INTERVAL 2 YEAR)
ORDER BY sales.customer, sales.date
Can someone explain what’s happening here?
This is expected behavior
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
see more at Querying partitioned tables
Possible workaround is to use scripting where you will first calculate the actual date and assign it to valiable and then use it in the query, thus eliminating subquery

Data collection from GDELT using bigquery

I am trying to construct an economic indicator based on all events with specific cameo codes from gdelt database.
So the idea is to collect data from 1990 to till date and see how economic cooperation varied based on news appearances of certain words. CAMEO codes 0211, 0311, 061, 1011 and 1211 in specific.
My query is how to extract this data for these specific cameo codes. If you can direct me to any source, it would be of great help.
One person suggested me that try using bigquery. I honestly don't know how to navigate the google bigquery page till now (I tried my best probably being from a non-tech background, it was a bit overwhelming for me). If any of you can help with one Cameo code data extraction example then I can play around with other events.
Edit: I am editing to show the progress I have made and the issues I am facing while running the query.
SELECT
*
FROM
[gdelt-bq:full.events]
WHERE
Year >= 1979
AND EventCode IN ('0211', '0311','061', '1011', '1211')
AND Actor1CountryCode != Actor2CountryCode
This query will process 228 GB when run and also excludes the cases where both the country codes are null. It has over 2 million rows and I cant download this as a csv file from bigquery platform.
The part where I need help is the following,
is there any way that I can get the total number of events for each event code which satisfies the following conditions
Actor1Countrycode and Actor2CountryCode should be different except when they are null
Count for each event code every month which satisfies the above condition.
PS: You can run the code given by Ben P in the answer below to see the number and type of columns in the database.
Edit2: There is another query that I am trying to write where in the AvgTone of an event with a specified code is greater than the average of AvgTone of all events in that particular month. Any leads on how to write this would be really helpful. Suppose, I add a WHERE clause wherein the AvgTone is greater than the average of AvgTone of all events for that particular period (MonthYear in this case). My doubt is how to write this in a query format.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211',
'0311',
'061')
AND Actor1CountryCode != Actor2CountryCode
AND AvgTone > (SELECT AVG(AvgTone) FROM [gdelt-bq:full.events] GROUP BY MonthYear ORDER BY MonthYear)
GROUP BY
MonthYear
ORDER BY
MonthYear
Error: ELEMENT can only be applied to result with 0 or 1 row.
Can someone help me with the above query? Thanks.
The GDELT database is available in BigQuery.
Here is a link to their available datasets, your first step would to identify which contains the information you are interested in:
https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/
Then this section of the site contains sample queries, which you can use as a starting point and try to tweak to your needs (note that these examples appear to me mostly in Legacy SQL, I would suggest you use them as a guide and rewrite then in Standard SQL):
https://blog.gdeltproject.org/a-compilation-of-gdelt-bigquery-demos/
If you have any specific SQL/BigQuery questions after you have done this I would recommend you come back with fresh questions and share examples of your working code, details what you have already tried and the results you expect to see.
Having had a quick look, and I must say i am not familiar with the dataset, but this may be a simple query that can start you on your way:
-- first we select all columns from the event dataset, which seems
-- to be the one you want, containing cameo codes
SELECT * FROM `gdelt-bq.full.events`
-- then we add a filter to only look at events in or after 1990
WHERE Year >= 1990
-- and another filter to look at only the specific camera
--codes you provided (I think EventCode is the correct column here,
AND EventCode IN ('0211','0311','061','1011','1211')
-- finally, we add a limit to our query, so we don't bring back ALL
-- the results while testing, once we are happy with our query, we'd remove this!
LIMIT 100
Finally, the GDELT tag right here on StackOverflow contains some really great content.
Hope that helps, GDELT looks like a fascinating project!
I finally figured out a way to extract data from GDELT using bigquery. Although the query is very simple, my lack of SQL knowledge made it difficult. Thanks to Ben who provided the initial help. Following are the queries which satisfy the conditions given in the question.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode IS NULL
AND Actor2CountryCode IS NULL
GROUP BY
MonthYear
ORDER BY
MonthYear
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode != Actor2CountryCode
GROUP BY
MonthYear
ORDER BY
MonthYear

Extracting DAU, MAU using BigQuery

I'm trying to extract Firebase Analytics DAU and MAU using BigQuery. The query I'm using for daily users is below -
SELECT
event_date AS day,
COUNT(DISTINCT user_id) AS daily_visitors
FROM `XXXXXXX.analytics_153729556.events_20190825`
WHERE
app_info.id = 'XXXXXXX'
AND
event_name = 'user_engagement'
GROUP BY day;
I have a few questions I would love some help with.
There is a significant(2000+) difference between the value from the query result and the value the Firebase dashboard shows for the same date(s). Is there a specific reason for this or is my query just plain wrong?
There are instances where I see dates other than the actual table selected. Example, I see 20190502 in the results when 20190501 should be the only row (based on the table name). Is this possibly because the events being dumped into the table are for an app in a different timezone? If not, what else could be the reason behind this?
I also want to extract historical MAU and DAU data, and store it on MongoDB for any future requirements that may arise. Is there a specific way in which I can extract them - after overcoming the problem I'm facing, of course?

Google Analytics 'User Count' not Matching Big Query 'User Count'

Our Google Analytics 'User Count' is not matching our Big Query 'User Count.'
Am I calculating it correctly?
Typically, GA and BQ align very closely…albeit, not exactly.
Recently, User Counts in GA vs.BQ are incongruous.
Our number of ‘Sessions per User' typically has a very normal
distribution.
In the last 4 weeks, 'Sessions per User' (in GA) has been
several deviations from the norm.
I cannot replicate this deviation when cross-checking data from the same time period in BQ
The difference lies in the User Counts.
What I'm hoping someone can answer is:
Am I at least using the correct SQL syntax to get to the answer in BQ?
This is the query I’m running in BQ:
SELECT
WEEK(Week) AS Week,
Week AS Date_Week,
Total_Sessions,
Total_Users,
Total_Pageviews,
( Total_Time_on_Site / Total_Sessions ) AS Avg_Session_Duration,
( Total_Sessions / Total_Users ) AS Sessions_Per_User,
( Total_Pageviews / Total_Sessions ) AS Pageviews_Per_Session
FROM
(
SELECT
FORMAT_UTC_USEC(UTC_USEC_TO_WEEK (date,1)) AS Week,
COUNT(DISTINCT CONCAT(STRING(fullVisitorId), STRING(VisitID)), 1000000) AS Total_Sessions,
COUNT (DISTINCT(fullVisitorId), 1000000) AS Total_Users,
SUM(totals.pageviews) As Total_Pageviews,
SUM(totals.timeOnSite) AS Total_Time_on_Site,
FROM
(
TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
TIMESTAMP('2015-02-09'),
TIMESTAMP('2015-04-12'))
)
GROUP BY Week
)
GROUP BY Week, Date_Week, Total_Sessions, Total_Users, Total_Pageviews, Avg_Session_Duration, Sessions_Per_User, Pageviews_Per_Session
ORDER BY Week ASC
We have well under 1,000,000 users/sessions/etc a week.
Throwing that 1,000,000 into the Count Distinct clause should be preventing any sampling on BQ’s part.
Am I doing this correctly?
If so, any suggestion on how/why GA would be reporting differently is welcome.
Cheers.
*(Statistically) significant discrepancies begin in Week 11
Update:
We have Premium Analytics, as #Pentium10 suggested. So, I reached out to their paid support.
Now when I pull the exact same data from GA, I get this:
Looks to me like GA has now fixed the issue.
Without actually admitting there ever was one.
::shrug::
I have this problem before. The way I fixed it was by using COUNT(DISTINCT FULLVISITORID) for total_users.
In standard SQL use COUNT(DISTINCT fullVisitorId)
Google Analytics shows an approximation for users, Big Query is exact. You can test this with unsampled reports in Google Analytics - numbers will match.
Also: GA uses all available data to count users, even where totals.visits is NULL!
In contrast GA counts sessions only where totals.visits = 1!

Sql Queries for finding the sales trend

Suppose ,I have a table which has all the billing records. Now I want to see the sales trend for a user given time duration group by each 3 days ...what should be the sql query regarding this?
please help,Otherwise I am gone ...
I can only give a vague suggestion as per the question, however you may want to have a derived column with a standardised date (as per MS date format, just a number per day) that you could then use a modulus (3) on so that days are equal per 3 day period. You can then group and aggregate over this column to get the values for a 3 day period. Obviously to display the date nicely you would have to multiply back and convert your column as well.
Again I'm not sure of the specifics, but I think this general idea could be achieved to get a result (may well not be the best way so it would help to add more to the question...)