SQL code generation question, group by and aggregates - sql

I'm working on a web application that lets a user design ad-hoc queries against an employee database. The queries are designed in an AJAX web based interface where the user specifies groups of crtieria that get intersected together, i'm trying to add functionality to also allow the user introduce date relationships between crtieria. For example, here's a sample (problematic) generated code for a query that says "Give me all employees that had at least 3 audits 150+ days after they started on the job"
select * FROM
(
SELECT employee_id
, max(employee_start_date) employee_start_date
from employees
where employee_salary_type in (55, 66, 77)
group by employee_id having count(*) >= 1
) employee_criteria_1,
(
SELECT employee_id
,max(audit_date) audit_date
from employees
where job_audit_id in (5, 6, 7)
-- They had at least 3 audits
group by employee_id having count(*) >= 3
) employee_criteria_2
WHERE
employee_criteria_1.employee_id = employee_criteria_2.employee_id
-- The audits must have happened at least 150 days after employee's start date
and employee_criteria_2.audit_date > employee_criteria_1.employee_start_date + 150
As you notice, each criteria from the UI gets generated into a SQL SELECT block, the all are intersected together. Here's my problem:
The query above checks whether the employee had at least 3 audits, and the last audit MAX occurs 150 days after start date INSTEAD of the 3 audits occur 150+ after start date.
You might ask, "well, why do you have a max(audit_date) statement then?" The reason is that I need to have an aggregate function in order for the group by to work (the group here is generated out of the "occurs at least 3 times" high-level query criteria).
So, what can I add to this code (without much changes, cause i'd like to keep this code generation mechanism) so that i'm now checking that all those 3 occurrences/audits happen 150+ days after (instead of only the max one)??
Thanks!

It sounds like you need to look into window functions, and possibly the having clause rather than the where clause.

Related

How can I create a query to display the development of new created ERC-20 contracts on Ethereum with Dune without counting duplicates?

I am trying to display the development of new created ERC-20 smart contracts on Ethereum. For this purpose I am using the Analytics platform Dune which provides a several databases for SQL querying.
My problem is the following one:
I have a table that shows all transactions of a specific contract. Every transaction is displayed by one row of the table and contains following columns "date", "smart_contract_address"(identifier for a unique ERC20 smart contract) and other details of the transaction as "amount"
Simplified example:
smart_contract_address
date
Amount
A
Q1/2022
50
B
Q1/2022
20
C
Q2/2022
10
A
Q1/2022
5
A
Q2/2022
7
I would like to count the different smart_contract_addresses per quarter. I want to make sure that every address is only counted once. After an address was counted it should be "ignored", not only if it appeared in the same quarter, but also in following ones.
For my example my expected query result would look like:
Quarter
Count
Q1/2022
2
Q2/2022
1
However my, query does not show my expected result. With the distinct keyword I make sure that in every quarter one address is only counted once, but will be counted again in the following quarters...
Can you tell me how I need to adjust my query that I count same addresses only once and for the quarter where they appeared for the very first time?
with everything as (
select contract_address as ca, date_trunc('quarter', evt_block_time) as time
from erc20."ERC20_evt_Transfer"
)
select time, count(distinct ca) as "Count"
from everything
group by time
try this:
with everything as (
select
contract_address as ca,
min(date_trunc('quarter', evt_block_time)) as time
from erc20."ERC20_evt_Transfer"
group by contract_address
)
select time, count(ca) as "Count"
from everything
group by time

SQL query for percentage change compared to previous date

I have a table within access containing the performance of departments on different reference dates. All data is within one table "tblmain". The table contains the following fields:
reference date (called "ref_date", formatted dd.mm.yyyy)
department identifier (called "dep_id")
performance value (called "val")
Every reference date consists of round about 100 departments and every week I import a new reference date.
My goal now is to build a query which calculates the percentage change from on reference date compared to the previous reference date. Furthermore, it should only show the departments with a change bigger than 5%.
I am currently stuck. I have created a query that gives me the val from the previous reference date but only for one specific department. And I do not know how to continue. This query looks as follows:
SELECT TOP 1 tblmain.val
FROM (SELECT TOP 2 tblmain.val, tblmain.ref_date FROM tblmain WHERE dep_id=1 ORDER BY tblmain.ref_date DESC)
ORDER BY tblmain.ref_date;
I would appreciate any feedback. After finishing this query, I plan to use this query in a form where I can choose an reference date and threshold.
Many thanks in advance!
Query to pull prior val for each record:
SELECT tblMain.ID, tblMain.ref_date, tblMain.dep_id, tblMain.val,
(SELECT TOP 1 val FROM tblMain AS Dupe
WHERE Dupe.dep_id=tblMain.dep_id AND Dupe.ref_Date < tblMain.ref_date
ORDER BY dupe.ref_date) AS PriorVal
FROM tblMain;
Now use that query to calculate percentage:
SELECT Query1.*, Abs(([PriorVal]-[val])/[PriorVal]*100) AS P
FROM Query1
WHERE (((Abs(([PriorVal]-[val])/[PriorVal]*100))>5));

How to find count of items in a table by 2 groups

To keep it simple, I have a table with 3 columns:
Request Id (eg. REQ1, REQ2, and so on)
Status (possible values - In Planning, Work in Progress, In Review, Completed)
Due Type (possible values - Due Today, Due Tomorrow, This Week, Next Week, In 15 Days)
Now, all I want to find out is, how do I arrive at a result which will tell how many In Planning are Due Today, how many In Review are Due tomorrow and so on.
Tried using count with over and partition by, but it gives me the count of statuses and count of due types but not a combination of both maintaining the relation
Are you just looking for aggregation?
select due_type, status, count(*)
from t
group by due_type, status;
If so, this is a basic SQL query. You should brush up on group by and other SQL fundamentals.

Getting a snapshot of records where an "event" can mean several entries on the same date

This is really frustrating me.
So, I'm making a database recording people joining and leaving our office, as well as changing roles, in order to keep track of headcount. This is succinctly recorded in the following table:
EmployeeID | RoleID | FTE | Date
FTE is the proportion of full-time hours the role is worth (i.e. 1 is full-time, 0.5 is part-time, etc). Leaving events are recorded as changing the role to 0 (Absent) and FTE to 0. The trouble is, people can have more than one role, which means that the number of hours they actually worked is a composite of all the events for that employee that occur on the same day. So if someone goes from full time on one project to splitting their time between two projects, a ChangeRole event is logged for each.
So I want to know the total headcount on a monthly basis. Essentially the query I would want is "Select all records from this table where, for each EmployeeID, the date is the maximum date below a specified date." From there I can sum the FTE to get the headcount.
Now I can get some of those things in isolation: I can do max(date), I can do criteria:<#dd/mm/yyyy##. But for some reason I can't seem to combine it all to get what I want, and I'm at a point where I've been staring at the problem so long that it doesn't make sense to me. Can anyone help me out? Thanks!
Something like this?
SELECT Events.*
FROM Events INNER JOIN (
SELECT EmployeeID, Max(Date) AS LatestDate
FROM Events
WHERE Events.Date < [Date entered]
GROUP BY EmployeeID) AS S
ON (Events.EmployeeID = S.EmployeeID) AND (Events.Date = S.LatestDate)

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.