Use SQL to ensure I have data for each day of a certain time period - sql

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.

You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);

BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

Related

date during your collection period saw the greatest number of matching tweets

hello i want to write a query which tells me the great number of tweets collect on a particular day like my question is
"date during your collection period saw the greatest number of matching tweets"
i have collected data from tweetcatcher but didnt know which query i use
Not much to go on, but based on your comment
SELECT date, Tweets=count(*) FROM Course GROUP BY date Order by 2 Desc
You could also add Select Top 1 ... if you want only one record.

Using Table Decorators on Big Query Web Interface

I saw the news about Table Decorators being available to limit the amount of data that is queried by specifying a time interval or limit. I did not see any examples on how to use the Table Decorators in the Big Query UI. Below is an example query that I'd like to run and only look at data that came in over the last 4hours. Any tips on how I can modify this query to utilize Table Decorators?
SELECT
foo,
count(*)
FROM [bigtable.201309010000]
GROUP BY 1
EDIT after trying example below
The first query above scans 180GB of data for the month of September (up through Sept 19th). I'd expect the query below to only scan data that came in during the time period specified. In this case 4hrs, so I'd expect the billing to be about 1.6GB not 180GB. Is there a way to set up ETL/query so we do not get billed for scanning the whole table?
SELECT
foo,
count(*)
FROM [bigtable.201309010000#-14400000]
GROUP BY 1
To use table decorators, you can either specify #timestamp or #timestamp-end_time. Timestamp can be negative, in which case it is relative; end_time can be empty, in which case it is the current time. You can use both of these special cases together, to get a time range relative to now. e.g. [table#-time_in_ms-]. So for your case, since 4 hours is 14400000 milliseconds, you can use:
SELECT foo, count(*) FROM [dataset.table#-14400000-] GROUP BY 1
This is a little bit confusing, we're intending to publish better documentation and examples soon.

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.

PostgreSQL - GROUP BY timestamp values?

I've got a table with purchase orders stored in it. Each row has a timestamp indicating when the order was placed. I'd like to be able to create a report indicating the number of purchases each day, month, or year. I figured I would do a simple SELECT COUNT(xxx) FROM tbl_orders GROUP BY tbl_orders.purchase_time and get the value, but it turns out I can't GROUP BY a timestamp column.
Is there another way to accomplish this? I'd ideally like a flexible solution so I could use whatever timeframe I needed (hourly, monthly, weekly, etc.) Thanks for any suggestions you can give!
This does the trick without the date_trunc function (easier to read).
// 2014
select created_on::DATE from users group by created_on::DATE
// updated September 2018 (thanks to #wegry)
select created_on::DATE as co from users group by co
What we're doing here is casting the original value into a DATE rendering the time data in this value inconsequential.
Grouping by a timestamp column works fine for me here, keeping in mind that even a 1-microsecond difference will prevent two rows from being grouped together.
To group by larger time periods, group by an expression on the timestamp column that returns an appropriately truncated value. date_trunc can be useful here, as can to_char.

SQL: Calculating system load statistics

I have a table like this that stores messages coming through a system:
Message
-------
ID (bigint)
CreateDate (datetime)
Data (varchar(255))
I've been asked to calculate the messages saved per second at peak load. The only data I really have to work with is the CreateDate. The load on the system is not constant, there are times when we get a ton of traffic, and times when we get little traffic. I'm thinking there are two parts to this problem: 1. Determine ranges of time that are considered peak load, 2. Calculate the average messages per second during these times.
Is this the right approach? Are there things in SQL that can help with this? Any tips would be greatly appreciated.
I agree, you have to figure out what Peak Load is first before you can start to create reports on it.
The first thing I would do is figure out how I am going to define peak load. Ex. Am I going to look at an hour by hour breakdown.
Next I would do a group by on the CreateDate formated in seconds (no milleseconds). As part of the group by I would do an avg based on number of records.
I don't think you'd need to know the peak hours; you can generate them with SQL, wrapping a the full query and selecting the top 20 entries, for example:
select top 20 *
from (
[...load query here...]
) qry
order by LoadPerSecond desc
This answer had a good lesson about averages. You can calculate the load per second by looking at the load per hour, and dividing by 3600.
To get a first glimpse of the load for the last week, you could try (Sql Server syntax):
select datepart(dy,createdate) as DayOfYear,
hour(createdate) as Hour,
count(*)/3600.0 as LoadPerSecond
from message
where CreateDate > dateadd(week,-7,getdate())
group by datepart(dy,createdate), hour(createdate)
To find the peak load per minute:
select max(MessagesPerMinute)
from (
select count(*) as MessagesPerMinute
from message
where CreateDate > dateadd(days,-7,getdate())
group by datepart(dy,createdate),hour(createdate),minute(createdate)
)
Grouping by datepart(dy,...) is an easy way to distinguish between days without worrying about month borders. It works until you select more that a year back, but that would be unusual for performance queries.
warning, these will run slow!
this will group your data into "second" buckets and list them from the most activity to least:
SELECT
CONVERT(char(19),CreateDate,120) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY CONVERT(Char(19),CreateDate,120)
ORDER BY 2 Desc
this will group your data into "minute" buckets and list them from the most activity to least:
SELECT
LEFT(CONVERT(char(19),CreateDate,120),16) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY LEFT(CONVERT(char(19),CreateDate,120),16)
ORDER BY 2 Desc
I'd take those values and calculate what they want