BigQuery: SELECT in WHERE-clause with filter based on a value in the current row - sql

I know the title is probably pretty stupid but I have a hard time phrasing it differently.
I have to use BigQuery at work atm for some report. BigQuery is connected to a Google Analytics view of ours. This gives us a dataset with 1 table for each day. The rows of the tables are user-sessions on our site, while columns have some information about the sessions.
The problem I have is the following:
I want to select sessions with transactions, but only if the user was referred to our site by a certain referrer in the last x days before the transaction happened. I'm only familiar with basic SQL and not with any advanced concepts. It's really frustrating to me because this would be a no-brainer with any proper programming language given a .csv of the data, but I'm lacking knowledge of the relevant concepts in SQL.
#standardSQL
SELECT
COUNT(*)
FROM
`dataset.ga_sessions_2017*`
WHERE
totals.transactions > 0 AND
fullVisitorId IN (SELECT
fullVisitorId
FROM
`dataset.ga_sessions_2017*`
WHERE
trafficSource.source = "xyz.com"
) AND
< date difference thing>
I could filter for the date difference like I did with the trafficSource (referrer). The problem for me is that while "xyz.com" is a static thing, I'd need to reference the date value of the current row I'm in. So the date by which I'd filter the 2nd SELECT would be dynamically changing from row to row. Can anyone guide me on how this is usually done? This seems like a thing that would come up often.

I'm not familiar with the GA tables specifically, but having written some wildcard queries in BigQuery before, I think what you're looking for can be done using the _TABLE_SUFFIX pseudo column:
CAST(_TABLE_SUFFIX AS INT64) >= 1217
Where 1217 is today's date in MMDD format minus 3 days, assuming the table names are _20171217, _20171218, etc. Otherwise you can just use REPLACE to remove underscores before casting to an int. There are also functions that will generate today's date for you if you needed this query to run automatically.
Also, I think the fullVisitorId business could be replaced with a simple WHERE trafficSource.source = "xyz.com" but it's hard to say for sure without being able to run the query myself.
So the full query would look something like this:
#standardSQL
SELECT
COUNT(*)
FROM
`dataset.ga_sessions_2017*`
WHERE
totals.transactions > 0 AND
trafficSource.source = "xyz.com" AND
CAST(_TABLE_SUFFIX AS INT64) >= 1217

Related

Data collection from GDELT using bigquery

I am trying to construct an economic indicator based on all events with specific cameo codes from gdelt database.
So the idea is to collect data from 1990 to till date and see how economic cooperation varied based on news appearances of certain words. CAMEO codes 0211, 0311, 061, 1011 and 1211 in specific.
My query is how to extract this data for these specific cameo codes. If you can direct me to any source, it would be of great help.
One person suggested me that try using bigquery. I honestly don't know how to navigate the google bigquery page till now (I tried my best probably being from a non-tech background, it was a bit overwhelming for me). If any of you can help with one Cameo code data extraction example then I can play around with other events.
Edit: I am editing to show the progress I have made and the issues I am facing while running the query.
SELECT
*
FROM
[gdelt-bq:full.events]
WHERE
Year >= 1979
AND EventCode IN ('0211', '0311','061', '1011', '1211')
AND Actor1CountryCode != Actor2CountryCode
This query will process 228 GB when run and also excludes the cases where both the country codes are null. It has over 2 million rows and I cant download this as a csv file from bigquery platform.
The part where I need help is the following,
is there any way that I can get the total number of events for each event code which satisfies the following conditions
Actor1Countrycode and Actor2CountryCode should be different except when they are null
Count for each event code every month which satisfies the above condition.
PS: You can run the code given by Ben P in the answer below to see the number and type of columns in the database.
Edit2: There is another query that I am trying to write where in the AvgTone of an event with a specified code is greater than the average of AvgTone of all events in that particular month. Any leads on how to write this would be really helpful. Suppose, I add a WHERE clause wherein the AvgTone is greater than the average of AvgTone of all events for that particular period (MonthYear in this case). My doubt is how to write this in a query format.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211',
'0311',
'061')
AND Actor1CountryCode != Actor2CountryCode
AND AvgTone > (SELECT AVG(AvgTone) FROM [gdelt-bq:full.events] GROUP BY MonthYear ORDER BY MonthYear)
GROUP BY
MonthYear
ORDER BY
MonthYear
Error: ELEMENT can only be applied to result with 0 or 1 row.
Can someone help me with the above query? Thanks.
The GDELT database is available in BigQuery.
Here is a link to their available datasets, your first step would to identify which contains the information you are interested in:
https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/
Then this section of the site contains sample queries, which you can use as a starting point and try to tweak to your needs (note that these examples appear to me mostly in Legacy SQL, I would suggest you use them as a guide and rewrite then in Standard SQL):
https://blog.gdeltproject.org/a-compilation-of-gdelt-bigquery-demos/
If you have any specific SQL/BigQuery questions after you have done this I would recommend you come back with fresh questions and share examples of your working code, details what you have already tried and the results you expect to see.
Having had a quick look, and I must say i am not familiar with the dataset, but this may be a simple query that can start you on your way:
-- first we select all columns from the event dataset, which seems
-- to be the one you want, containing cameo codes
SELECT * FROM `gdelt-bq.full.events`
-- then we add a filter to only look at events in or after 1990
WHERE Year >= 1990
-- and another filter to look at only the specific camera
--codes you provided (I think EventCode is the correct column here,
AND EventCode IN ('0211','0311','061','1011','1211')
-- finally, we add a limit to our query, so we don't bring back ALL
-- the results while testing, once we are happy with our query, we'd remove this!
LIMIT 100
Finally, the GDELT tag right here on StackOverflow contains some really great content.
Hope that helps, GDELT looks like a fascinating project!
I finally figured out a way to extract data from GDELT using bigquery. Although the query is very simple, my lack of SQL knowledge made it difficult. Thanks to Ben who provided the initial help. Following are the queries which satisfy the conditions given in the question.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode IS NULL
AND Actor2CountryCode IS NULL
GROUP BY
MonthYear
ORDER BY
MonthYear
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode != Actor2CountryCode
GROUP BY
MonthYear
ORDER BY
MonthYear

Extracting DAU, MAU using BigQuery

I'm trying to extract Firebase Analytics DAU and MAU using BigQuery. The query I'm using for daily users is below -
SELECT
event_date AS day,
COUNT(DISTINCT user_id) AS daily_visitors
FROM `XXXXXXX.analytics_153729556.events_20190825`
WHERE
app_info.id = 'XXXXXXX'
AND
event_name = 'user_engagement'
GROUP BY day;
I have a few questions I would love some help with.
There is a significant(2000+) difference between the value from the query result and the value the Firebase dashboard shows for the same date(s). Is there a specific reason for this or is my query just plain wrong?
There are instances where I see dates other than the actual table selected. Example, I see 20190502 in the results when 20190501 should be the only row (based on the table name). Is this possibly because the events being dumped into the table are for an app in a different timezone? If not, what else could be the reason behind this?
I also want to extract historical MAU and DAU data, and store it on MongoDB for any future requirements that may arise. Is there a specific way in which I can extract them - after overcoming the problem I'm facing, of course?

applying knowledge of SQL for everyday workplace activities

My question is how to properly write a SQL query for the below highlighted/bold question.
There is a table in HMO database which stores doctor's working
hours.Table has following fields
"FirstName","LastName","Date","HoursWorked". write a sql statement
which retrieves average working hours for period January-March for a
doctor with name Joe Doe.
so far i have
SELECT HoursWorked
FROM Table
WHERE DATE = (January - March) AND
SELECT AVG(HoursWorked) FROM Table WHERE FirstName="Joe",LastName="Doe"*
A few pointers as this sounds like a homework question (which we don't answer for you here, but we can try to give you some guidance).
You want to put all the things you want to return from your select first and you want to have all your search conditions at the end.
So the general format would be :
SELECT Column1,
Column2,
Column3,
FROM YourTable
WHERE Column4 = Restriction1
AND Column5 = Restriction2
The next thing you need to think about is how the dates are formatted in your database table. Hopefully they're kept in a column of type datetime or date (options will depend on the database engine you're using, eg, Microsoft SQL Server, Oracle or MySql). In reality some older databases people use can store dates in all sorts of formats which makes this much harder, but since I'm assuming it's a homework type question, lets assume it's a datetime format.
You specify restrictions by comparing columns to a value, so if you wanted all rows where the date was after midnight on the 2nd of March 2012, you would have the WHERE clause :
WHERE MyDateColumn >= '2012-03-02 00:00:00'
Note that to avoid confusion, we usually try to format dates as "Year-Month-Day Hour:Minute:Second". This is because in different countries, dates are often written in different formats and this is considered a Universal format which is understood (by computers at least) everywhere.
So you would want to combine a couple of these comparisons in your WHERE, one for dates AFTER a certain date in time AND one for dates before another point in time.
If you give this a go and see where you get to, update your question with your progress and someone will be able to help get it finished if you have problems.
If you don't have access to an actual database and need to experiment with syntax, try this site : http://sqlfiddle.com/
you already have the answer written
SELECT AVG(HoursWorked) FROM Table WHERE FirstName="Joe",LastName="Doe"*
you only need to fix the query
SELECT AVG(HoursWorked) as AVGWORKED FROM Table WHERE FirstName='Joe' AND LastName='Doe'
That query will give you the average hours worked for Joe Doe, however you only need to get between some time you add the next "AND", if you are using SQL server you can use the built in function DateFromParts(year,month,day) to create a new Date, or if you are using another Database Engine you can convert a string to a DateColumn Convert(Date,'MM/dd/yyyy')
Example
SELECT AVG(HoursWorked) as AVGWORKED FROM Table WHERE FirstName='Joe' AND LastName='Doe' AND DateColumn between DateFromParts(year,month,day) and Convert(Date,'MM/dd/yyyy')
In the example i showed both approaches (datefromparts for the initial date, and convert(date) for the ending date).

Using Table Decorators on Big Query Web Interface

I saw the news about Table Decorators being available to limit the amount of data that is queried by specifying a time interval or limit. I did not see any examples on how to use the Table Decorators in the Big Query UI. Below is an example query that I'd like to run and only look at data that came in over the last 4hours. Any tips on how I can modify this query to utilize Table Decorators?
SELECT
foo,
count(*)
FROM [bigtable.201309010000]
GROUP BY 1
EDIT after trying example below
The first query above scans 180GB of data for the month of September (up through Sept 19th). I'd expect the query below to only scan data that came in during the time period specified. In this case 4hrs, so I'd expect the billing to be about 1.6GB not 180GB. Is there a way to set up ETL/query so we do not get billed for scanning the whole table?
SELECT
foo,
count(*)
FROM [bigtable.201309010000#-14400000]
GROUP BY 1
To use table decorators, you can either specify #timestamp or #timestamp-end_time. Timestamp can be negative, in which case it is relative; end_time can be empty, in which case it is the current time. You can use both of these special cases together, to get a time range relative to now. e.g. [table#-time_in_ms-]. So for your case, since 4 hours is 14400000 milliseconds, you can use:
SELECT foo, count(*) FROM [dataset.table#-14400000-] GROUP BY 1
This is a little bit confusing, we're intending to publish better documentation and examples soon.

How to get all rows from a table inserted in a particular date.

I am trying to write a query that gets all the rows of a table for a particular date.
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE='2013-05-07'
However that does not work, because in the table the COLUMN_CONTAINING_DATE contains data like '2013-05-07 00:00:01' etc. So, this would work
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE>='2013-05-07' AND COLUMN_CONTAINING_DATE<'2013-05-08'
However, I dont want to go for option 2 because that feels like a hacky way. I would rather put a query that says get me all the rows for a give date and somehow not bother about the minutes and hours in the COLUMN_CONTAINING_DATE.
I am trying to have this query run on both H2 and DB2.
Any suggestions?
You can do:
select *
from MY_Table
where trunc(COLUMN_CONTAINING_DATE) = '2013-05-07';
However, the version that you describe as a "hack" is actually better. By wrapping a function around the data, many SQL optimizers will not use indexes. With just direct comparisons, an index would definitely be used.
Use something like this
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE=DATE('2013-05-07')
You can ease this if you use the Temporal data management capability from DB2 10.1.
For more information:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/
If your concerns are related to the different data types (timestamp in the column, and a string containing a date), you can do this:
SELECT * FROM MY_TABLE
WHERE
COLUMN_CONTAINING_DATE >= '2013-05-07 00:00:00'
and COLUMN_CONTAINING_DATE < '2013-05-08 00:00:00'
and I'd pay attention to the formatting of the where clause, because this will improve readability a lot, if you have to look at your queries two months later. Just pick a style you prefer for ranges like "a <= x < b". Unfortunately SQL's between does not support this.
One could argue that the milliseconds are still missing, so perfectionists may append another ".0" in the timestamp ...