Is there a way to handle immutability that's robust and scalable? - google-bigquery

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.

Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.

Related

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

SQL query to count number of checkins per month

To put a long story short, I am working on a database using PostgreSQL that is managing yelp checkins. The checkintable has the attributes business_id(string), date(string in form yyyy-mm-dd), and time(string in form 00:00:00).
What I simply need to do is, given a business_id, I need to return a list of the total number of checkins based on just the mm (month) value.
So for instance, I need to retrieve the total checkins that were in Jan, Feb, March, April, etc, not based upon the year.
Any help is greatly appreciated. I've already considered group by clauses but I didn't know how to factor in '%mm%'.
Reiterating Gordon, class or not, storing dates and times as strings makes things harder, slower, and more likely to break. It's harder to take advantage of Postgres's powerful date math functions. Storing dates and times separately makes things even harder; you have to concatenate them together to get the full timestamp which means it will not be indexed. Determining the time between two events becomes unnecessarily difficult.
It should be a single timestamp column. Hopefully your class will introduce that shortly.
What I simply need to do is, given a business_id, I need to return a list of the total number of checkins based on just the mm (month) value.
This is deceptively straightforward. Cast your strings to dates, fortunately they're in ISO 8601 format so no reformatting is required. Then use extract to extract just the month part.
select
extract('month' from checkin_date::date) as month,
count(*)
from yelp_checkins
where business_id = ?
group by month
order by month
But there's a catch. What if there are no checkins for a business on a given month? We'll get no entry for that month. This is a pretty common problem.
If we want a row for every month, we need to generate a table with our desired months with generate_series, then left join with our checkin table. A left join ensures all the months (the "left" table) will be there even if there is no corresponding month in the join table (the "right" table).
select
months.month,
count(business_id)
from generate_series(1,12) as months(month)
left join yelp_checkins
on months.month = extract('month' from checkin_date::date)
and business_id = ?
group by months.month
order by months.month
Now that we have a table of months, we can group by that. We can't use a where business_id = ? clause or that will filter out empty months after the left join has happened. Instead we must put that as part of the left join.
Try it.
Why would you store the date as a string? That is a broken data model. You should fix the data.
That said, I recommend converting a date and truncating to the first day of the month:
select date_trunc('day', datestr::date) as yyyymm, count(*)
from t
group by yyyymm
order by yyyymm;
If you don't want these based on the year, then use extract():
select extract(month from datestr::date) as mm, count(*)
from t
group by mm
order by mm;

SQL DateDiff Syntax

I have a homework problem that I'm having a lot of trouble with... I don't expect the answer and I truly want to learn it. Could somebody help me out with the syntax?
Problem:
For each Sales Order, show how many days it took to ship the order in order by the longest order, then by Sales Order Number. Display Sales Order Number and the number of days to ship. Include the orders that have not yet shipped.
So far I have:
SELECT SalesOrder.SalesOrderNumber,
DATEDIFF (d, MIN(SalesOrder.OrderDate), MAX(Shipment.ShipmentDate)) AS "DaysToShip"
FROM SalesOrder, Shipment
GROUP BY SalesOrder.SalesOrderNumber;
Sometimes it's helpful to see an intermediate form of your query to evaluate if it's providing the correct data at some stage.
Consider the following query, pulled from your example minus some elements:
SELECT SalesOrder.SalesOrderNumber, SalesOrder.OrderDate, Shipment.ShipmentDate
FROM SalesOrder, Shipment
You should observe the results of this query and see how they differ from what you expect. In this case, you haven't indicated how SalesOrder and Shipment are related. The result will be many more rows than there are orders, with each SalesOrder related to each and every other Shipment record (a cross-join).
Once you provide the correct join condition and achieve the desired results at that stage, try adding in aggregation (GROUP BY, MIN, MAX) and test that form of your query. Finally, when you're convinced that you have the correct inputs, add in DATEDIFF and you'll have your final query.
SELECT SalesOrder.SalesOrderNumber,
DATEDIFF (d, MAX(SalesOrder.OrderDate), MAX(Shipment.ShipmentDate)) AS "DaysToShip"
FROM SalesOrder, Shipment
GROUP BY SalesOrder.SalesOrderNumber;

Access SQL Query: Comparing Date In Select Statement

I have a problem that I simply cannot seem to figure out. I have a list of employees with different travel dates and I want to display all of them in a cascading list format. The problem is that I only want to see employees once, and only the date closest to today.
For example I could have 'Smith' in there multiple times with dates before and after today, as we also keep historical records. This means I can't just do min, as it will try and display a date before today, and max is too far forward.
The code example below ALMOST works. The problem is in the select statement. I want to show the minimum date after today, but instead it gives me 0's and -1's where the dates should be. There might just be another way to do this all together, but this is the only configuration that seems to allow the other information such as Site, Position, and Comments to be displayed correctly alongside it.
SELECT A.`Last Name` AS [Last Name], Min(A.`Date In`) > Now() AS [Date In], Max(B.Site) AS Site, Max(B.Position), Max(B.Comments) AS Comments
FROM Deployments AS A
INNER JOIN Deployments AS B ON A.ID = B.ID
GROUP BY A.`FSR Name`
HAVING (((Max(A.`Actual TEP IN`))>Now()));
I did a group by Name because I only want to see each individual once. If I don't add the table to itself with a join it gives a self reference error. This is my first time posting so I hope this makes sense! All help will be greatly appreciated!
Not sure what DB you're on, but in general, you need to return MIN(date) instead of the result of the comparison "Min(Date) > Now()" - I'm guessing this is where you're seeing 0's and -1's, since that would be the result of the comparison, when you want the minimum date value itself.
Also, if you are just wanting people who have a trip date in the future, just restrict your query with a WHERE clause, do a GROUP BY, and you get rid of the self-join. Also note that the example below aligns some discrepancies in your OP like where you're selecting based on "Last Name" but grouping on "FSR Name" - these things must be consistent, whichever field you're concerned about.
Example:
SELECT A.[FSR Name] AS [FSR Name],
Min(A.[Date In]) AS [Date In],
Max(A.Site) AS Site,
Max(A.Position) AS Position,
Max(A.Comments) AS Comments
FROM Deployments AS A
WHERE A.[Date In] > Now()
GROUP BY A.[FSR Name];
EDIT: If you need to make sure that Site,Position,Comments all came from the same row, you have to do something like one of these options:
If you have a Primary Key:
select * from Deployments A3 where A3.pk_value =
(select max(A2.pk_value) from Deployments A2
where A2.[Date In] =
(select Max([Date In]) from Deployments A where A.[FSR Name] = A2.[FSR Name])
and A2.[FSR Name] = A3.[FSR Name]
)
This guarantees you to get 1 row per FSR Name, even if there are multiple rows for that FSR with the same "latest" date.
Otherwise, you can leave out the secondary query dealing with the pk_value, but you run a risk of getting multiple rows for an FSR that has multiple records with the same "latest" date.
Note: when you get to queries this complex, running on a full-featured database (SQL Server, Oracle, anything but Access) allows you to use much more sophistication. For this example, "Windowing Functions" would give you the answer without as much wrangling. Not sure if you're stuck with Access for now, but consider this for the future, anyway.
Try something like this
Select A.LastName, A.DateIn, A.Site, A.Position, A.Comments
From deployments a
Where not exists (Select *
From deployments b
Where b.id <> a.id
and (abs(datediff(d, getdate(), a.datein))) > abs(datediff(d, getdate(), b.datein))
or abs(datediff(d, getdate(), a.datein)) = abs(datediff(d, getdate(), b.datein) and a.id > b.id))
Instead of the funny mins and maxes that you are using to try to get the row with the datein that is closest to today, try using datediff. With this function, you can specify what type of date or time value you are looking to compare (day, month, year, minute) and then find the difference between two different datetimes. In this case, I used getdate() to find the current date and time. Then, we want the datein with the least value for datediff, the datein that is closest to today. Datediff will return positive or negative values, so I used abs to get the absolute value of the result. I did this because it doesn't matter if the date is before today or after today.
Then we are looking in the deployment table. The subquery says that we should look at all the values which are not the current value. Then, find all the rows that have a smaller datediff than the current record. Also, find all the records that have the same datediff as the current record and a smaller id. We will only include the current record if there isn't anything that fits this criteria. It is a little weird to think about, but this type of query should help you find what you are looking for a lot easier. The only thing is that you will need to add criteria in the where clause of the subquery to determine which entries to compare. As it stands, this query will look at all of the entries in your deployments table and pull back the one row that has a datein closest to today. Since you want one row for each person, this will need few more specifications.

How to adapt this query to use window functions

When I started tackling this problem, I thought, "This will be a great query to learn about Window Functions." I wasn't able to end up getting it to work with window functions, but I was able to get what I wanted using a join.
How would you adapt this query to use window functions:
SELECT
day,
COUNT(i.project) as num_open
FROM generate_series(0, 364) as t(day)
LEFT JOIN issues i on (day BETWEEN i.closed_days_ago AND i.created_days_ago)
GROUP BY day
ORDER BY day;
The query above takes a list of issues that have a range represented by created_days_ago and closed_days ago and for the last 365 days, it'll count the number of issues that were created but not yet closed for that specific day.
http://sqlfiddle.com/#!15/663f6/2
The issues table looks like:
CREATE TABLE issues (
id SERIAL,
project VARCHAR(255),
created_days_ago INTEGER,
closed_days_ago INTEGER);
What I was thinking was that the partition for a given day should include all the rows in issues where day is between the created and closed days ago. Something like SELECT day, COUNT(i.project) OVER (PARTITION day BETWEEN created_days_ago AND closed_days_ago) ...
I've never used window functions before, so I might be missing something basic, but it seemed like this was just the type of query that makes window functions so awesome.
The fact that you use generate_series() to create a full range of days, including those days with no changes, and thus no rows in table issues, does not rule out the use of window functions.
In fact, this query runs 50 times faster than the query in the Q in my local test:
SELECT t.day
, COALESCE(sum(a.created) OVER (ORDER BY t.day DESC), 0)
- COALESCE(sum(b.closed) OVER (ORDER BY t.day DESC), 0) AS open_tickets
FROM generate_series(0, 364) t(day)
LEFT JOIN (SELECT created_days_ago AS day, count(*) AS created
FROM issues GROUP BY 1) a USING (day)
LEFT JOIN (SELECT closed_days_ago AS day, count(*) AS closed
FROM issues GROUP BY 1) b USING (day)
ORDER BY 1;
It is also correct, as opposed to the query in the question, which results in 17 open tickets on day 0, although all of them have been closed.
The error is due to BETWEEN in your join condition, which includes upper and lower border. This way tickets are still counted as "open" on the day they are closed.
Each row in the result reflects the number of open tickets at the end of the day.
Explain
The query combines window functions with aggregate functions.
Subquery a counts the number of created tickets per day. This results in a single row per day, making the rest easier.
Subquery b does the same for closed tickets.
Use LEFT JOINs to join to the generated list of days in subquery t.
Be wary of joining to multiple unaggregated tables! That could trigger a CROSS JOIN among the joined tables for multiple matches per row, generating incorrect results. Compare:
Two SQL LEFT JOINS produce incorrect result
Finally use two window functions to compute the running total of created versus closed tickets.
An alternative would be to use this in the outer SELECT
sum(COALESCE(a.created, 0)
- COALESCE(b.closed, 0)) OVER (ORDER BY t.day DESC) AS open_tickets
Performs the same in my tests.
-> SQLfiddle demo.
Aside: I would never store "days_ago" in a table, but the absolute date / timestamp. Looks like a simplification for the purpose of this question.