Using the TABLE_DATE_RANGE function in BigQuery - google-bigquery

I'm using BigQuery for the first time in quite awhile, so I'm a bit rusty.
I'm using a public dataset that can be found here for Reddit data.
Here is a snapshot:
What I'm trying to do is create a query that extracts all data from 2017.
Basically, I want to use the BQ syntax specific equivalent of this, which is written using Standard SQL:
fh-bigquery.reddit_posts.2017*
I know that would involve using the TABLE_DATA_RANGE function, but I'm stumped on the specific wording of it.
If I was using just one of the tables, it would look like this:
SELECT
FORMAT_UTC_USEC(SEC_TO_TIMESTAMP(created_utc)) AS created_date
FROM
[fh-bigquery:reddit_posts.2017_06]
LIMIT
10
But I'm obviously trying to span this over multiple months.

Below is for BigQuery Standard SQL
#standardSQL
SELECT
TIMESTAMP_SECONDS(created_utc) AS created_date
FROM `fh-bigquery.reddit_posts.2017_*`
LIMIT 10
It does what your query for one table does - but for all tables for 2017 (not sure what actually the logic you are looking for in your query - but I hope you left it outside the question just for simplicity sake)
Note: you can use _TABLE_SUFFIX in your query to identify which exactly table specific row comes from - for example:
#standardSQL
SELECT
_TABLE_SUFFIX AS month,
COUNT(1) AS records
FROM `fh-bigquery.reddit_posts.2017_*`
GROUP BY month
ORDER BY month
with output as below
month records
----- ---------
01 9,218,513
02 8,588,120
03 9,616,340
04 9,211,051
05 9,498,553
06 9,597,725
07 9,989,122
08 10,424,133
09 9,787,604
10 10,281,718
In case if for whatever reason you still bound to BigQuery Legacy SQL - you can use below
#legacySQL
SELECT
FORMAT_UTC_USEC(SEC_TO_TIMESTAMP(created_utc)) AS created_date
FROM TABLE_QUERY([fh-bigquery:reddit_posts], "LEFT(table_id, 5) = '2017_'")
LIMIT 10
But it is highly recommended to migrate to Standard SQL

Related

IBM Cognos Analytics Selecting top 2 from a dataset

I'm working on a report where I need data for current week, and the week before, and compare these two. I have a week column in my data, which are transactions, So my data looks something like:
Amount - Week
13 - 01
19 - 01
11 - 02
10 - 02
13 - 02
12 - 03
18 - 03
15 - 04
And I want to this as a result from the two most recent weeks and sum of Amount:
Week 03: 30
Week 04: 15
Now it easy to get the most recent week, just a maximum (Week for report), but when I want to select the 2nd largest I'm getting stuck.
I've tried to do a filter that is basically "Maximum( case when week = maximum(week) then null else week)", but either I have not figured out the syntax or I this approach does not work.
Other alternative which I tired was the rank() feature and then a query which selects rank in (1, 2) but for whatever reason I couldn't get this approach to work and only got the error
The function "to_char" is being used for local processing but is not available as a built-in function, or at least one of its parameters is not supported.
Which I believe has something to do with the aggregation (multiple records per occurence of week). Anyway I'm kind of stuck and the error messages aren't giving me any clues. Would very much appreicate some help!
RANK should work fine, but it may not work well if you try to get Cognos to do all of the work in one place. I thought I could filter on the ranked data item and set the Application property to After auto aggregation. But I got strange results.
Rather than trying to create one complicated solution, try breaking the problem into smaller, simpler components.
Define Query1
Data Items:
Week = [namespace].[query subject].[Week]
Amount = [namespace].[query subject].[Amount] with the detail aggregation set to Total
Rank = rank([namespace].[query subject].[Week])
Create Query2 and set Query1 as its source.
Data Items:
[Query1].[Week]
[Query1].[Amount]
Detail Filters:
[Query1].[Rank] <= 2
Use Query2 as the source for your list.

filter by first 4 digits of a timestamp value in BigQuery standard SQL

I have a variable where the entries are date (timestamp). The exact format is like this: 2009-03-01 00:00:00 UTC.
I want to filter all the rows where year is 2009 (first 4 digits of the timestamp). I am using google BigQuery standard SQL. I tried the following:
WHERE LEFT(CAST(incurred_month_timestamp as STING), '4') LIKE 2013
If anyone can share the query, it would be helpful.
Thanks.
Don't use string functions on dates! BigQuery has lots of useful date/time/timestamp functions.
The one you want is extract():
where extract(year from incurred_month_timestamp) = 2013
Or use a range:
where incurred_month_timestamp >= timestamp('2013-01-01') and
incurred_month_timestamp < timestamp('2014-01-01')
A simpler way is using builtin year function in BigQuery Legacy SQL, which extracts the year from a timestamp.
Here is an example with your code.
where year(incurred_month_timestamp) == 2009

SQL YTD for previous years and this year

Wondering if anyone can help with the code for this.
I want to query the data and get 2 entries, one for YTD previous year and one for this year YTD.
Only way I know how to do this is as 2 separate queries with where clauses.. I would prefer to not have to run the query twice.
One column called DatePeriod and populated with 2011 YTD and 2012YTD, would be even better if I could get it to do 2011YTD, 2012YTD, 2011Total, 2012Total... though guessing this is 4 queries.
Thanks
EDIT:
In response to help clear a few things up:
This is being coded in MS SQL.
The data looks like so: (very basic example)
Date | Call_Volume
1/1/2012 | 4
What I would like is to have the Call_Volume summed up, I have queries that group it by week, and others that do it by month. I could pull all the dailies in and do this in Excel but the table has millions of rows so always best to reduce the size of my output.
I currently group by Week/Month and Year and union all so its 1 output. But that means I have 3 queries accessing the same table, large pain, very slow not efficient and that is fine but now I also need a YTD so its either 1 more query or if I could find a way to add it to the yearly query that would ideal:
So
DatePeriod | Sum_Calls
2011 Total | 40
2011 YTD | 12
2012 Total | 45
2012 YTD | 15
Hope this makes any sense.
SQL is built to do operations on rows, not columns (you select columns, of course, but aggregate operations are all on rows).
The most standard approach to this is something like:
SELECT SUM(your_table.sales), YEAR(your_table.sale_date)
FROM your_table
GROUP BY YEAR(your_table.sale_date)
Now you'll get one row for each year on record, with no limit to how many years you can process. If you're already grouping by another field, that's fine; you'll then get one row for each year in each of those groups.
Your program can then iterate over the rows and organize/render them however you like.
If you absolutely, positively must have columns instead, you'll be stuck with something like this:
SELECT SUM(IF(YEAR(date) = 2011, sales, 0)) AS total_2011,
SUM(IF(YEAR(date) = 2012, total_2012, 0)) AS total_2012
FROM your_table
If you're building the query programmatically you can add as many of those column criteria as you need, but I wouldn't count on this running very efficiently.
(These examples are written with some MySQL-specific functions. Corresponding functions exist for other engines but the syntax would be a little different.)

Is this SQL the most efficient way

We have a table that converts SAT scores into ACT scores using a year. if the data changes in the future we would add the new scores along with the year the scores change. We need to pass in a year and sat score and return the correct act score.
sample data with three rows would be
act sat year
28 1010 1998
29 1010 2012
30 1010 2015
If I pass in a SAT score of 1010 and a year of 2014 I should return an act score of 29 back.
I wrote the following SQL statement that works.
select act,
RANK() OVER(ORDER BY year DESC)
from keessattbl
where sat = 1010 and INT(year) <= 2014
FETCH FIRST ROW ONLY
Is this the most efficient way to handle this.
Thanks in advance Doug
Another option would be to use the following:
select k1.*
from keessattbl k1
where k1.sat = 1010
and k1.year = (select max(k2.year)
from keessattbl k2
where k2.sat = k1.sat
and k2.year <= 2014)
You will need to check which one is more efficient. If year (and possibly sat) is indexed, then both are probably quite fast.
But you will need to look at the execution plan (or simply time the statements) to find out.
I would say "Sure." Is it not performing well?
Also, most DBMS's have some way to get the first row of a result set, so you don't need to use DB2 unless you want to.
if you are not sure if it's the most efficient way to write then you can check by doing an EXPLAIN on the query. write the query another way, do an EXPLAIN on it and compare the costs. IBM provides the IBM Data Studio product for free. you can just right-click on your sql and select Visual Explain to get the results in the gui.

Getting Hourly statistics using SQL

We have a table, name 'employeeReg' with fields
employeeNo | employeeName | Registered_on
Here Registered_on is a timestamp.
We require an hourly pattern of registrations, over a period of days. eg.
01 Jan 08 : 12 - 01 PM : 1592 registrations
01 Jan 08 : 01 - 02 PM : 1020 registrations
Can someone please suggest a query for this.
We are using Oracle 10gR2 as our DB server.
This is closely related to, but slightly different from, this question about How to get the latest record for each day when there are multiple entries per day. (One point in common with many, many SQL questions - the table name was not given originally!)
The basic technique will be to find a function that will format the varied Registered_on values such that all the entries in a particular hour are grouped together. This presumably can be done with TO_CHAR() since we're dealing with Oracle (MySQL does not support this).
SELECT TO_CHAR(Registered_on, "YYYY-MM-DD HH24") AS TimeSlot,
COUNT(*) AS Registrations
FROM EmployeeReg
GROUP BY 1
ORDER BY 1;
You might be able to replace the '1' entries by TimeSlot, or by the TO_CHAR() expression; however, for reasons of backwards compatibility, it is likely that this will work as written (but I cannot verify that for you on Oracle - an equivalent works OK on IBM Informix Dynamic Server using EXTEND(Registered_on, YEAR TO HOUR) in place of TO_CHAR()).
If you then decide you want zeroes to appear for hours when there are no entries, then you will need to create a list of all the hours you do want reported, and you will need to do a LEFT OUTER JOIN of that list with the result from this query. The hard part is generating the correct list - different DBMS have different ways of doing it.
Achieved what I wanted, with :)
SELECT TO_CHAR(a.registered_on, 'DD-MON-YYYY HH24') AS TimeSlot,
COUNT(*) AS Registrations
FROM EmployeeReg a
Group By TO_CHAR(a.registered_on, 'DD-MON-YYYY HH24');