How can I write this summing query? - sql

I didn't design this table, and I would redesign it if I could, but that's not an option for me.
I have this table:
Transactions
Index --PK, auto increment
Tenant --this is a fk to another table
AmountCharged
AmountPaid
Balance
Other Data
The software that is used calculates the balance each time from the previous balance like this:
previousBalance - (AmountPaid - AmountCharged)
Balance is how much the tenant really owes.
However, the program uses Access and concurrent users, and messes up. Big time.
For example: I have a tenant that looks like this:
Amount Charged | Amount Paid | Balance
350 0 350
440 0 790
0 350 -350 !
0 440 -790
I want to go though and reset all the balances to what they should be, so I'd have some sort of running total. I don't know if Access can use variables like SP's or not.
I don't even know how to start on this, I'd assume it'd be a query with a subquery to sum all the charges/payments before it's index, but I don't know how to write it.
How can I do this?
Edit:
I am using Access 97

Assuming Index is incremental, and higher values --> later transaction dates, you can use a self-join with a >= condition in the join clause, something like this:
select
a.[Index],
max(a.[Tenant]) as [Tenant],
max(a.[AmountCharged]) as [AmountCharged],
max(a.[AmountPaid]) as [AmountPaid],
sum(
iif(isnull(b.[AmountCharged]),0,b.[AmountCharged])+
iif(isnull(b.[AmountPaid]),0,b.[AmountPaid])
) as [Balance]
from
[Transactions] as a
left outer join
[Transactions] as b on
a.[Tenant] = b.[Tenant] and
a.[Index] >= b.[Index]
group by
a.[Index];
Access SQL is fiddly; there may be some syntax errors above, but that's the general idea. To create this query in the query designer, add the Transactions table twice, join them on Tenant and Index, and then edit the join (if possible).
You could do the same with a subquery, something like:
select
[Index],
[Tenant],
[AmountCharged],
[AmountPaid],
(
select
sum(
iif(isnull(b.[AmountCharged]),0,b.[AmountCharged])+
iif(isnull(b.[AmountPaid]),0,b.[AmountPaid])
)
from
[Transactions] as b
where
[Transactions].[Tenant] = b.[Tenant] and
[Transactions].[Index] >= b.[Index]
) as [Balance]
from
[Transactions];
Once you have calculated the proper balances, use an update query to update the table, by joining the Transactions table to the select query defined above on Index. You could probably combine it into one update query, but that would make it more difficult to test.

If all the records have a sequnecing number (with no gaps in between) you can try the following: create a query where you link the table to itself. In the join, you spicify that you want to link the tables with Id = Id - 1. That way, you link each record to its previous record.
If ou do not have a column that can be used for this, try adding an autonumber column.
Other option is to write some simple lines in VBA to loop over the records and update the values. If it is a one-off operation, I think that will be the easiest if you are not very experienced with sql.

Related

Query build to find records where all of a series of records have a value

Let me explain a little bit about what I am trying to do because I dont even know the vocab to use to ask. I have an Access 2016 database that records staff QA data. When a staff member misses a QA we assign a job aid that explains the process and they can optionally send back a worksheet showing they learned about what was missed. If they do all of these ina 3 month period they get a credit on their QA score. So I have a series of records all of whom have a date we assigned the work(RA1) and MAY have a work returned date(RC1).
In the below image "lavalleer" has earned the credit because both of her sheets got returned. "maduncn" Did not earn the credit because he didn't do one.
I want to create a query that returns to me only the people that are like "lavalleer". I tried hitting google and searched here and access.programmers.co.uk but I'm only coming up with instructions to use Not null statements. That wouldn't work for me because if I did a IS Not Null on "maduncn" I would get the 4 records but it would exclude the null.
What I need to do is build a query where I can see staff that have dates in ALL of their RC1 fields. If any of their RC1 fields are blank I dont want them to return.
Consider:
SELECT * FROM tablename WHERE NOT UserLogin IN (SELECT UserLogin FROM tablename WHERE RCI IS NULL);
You could use a not exists clause with a correlated subquery, e.g.
select t.* from YourTable t where not exists
(select 1 from YourTable u where t.userlogin = u.userlogin and u.rc1 is null)
Here, select 1 is used purely for optimisation - we don't care what the query returns, just that it has records (or doesn't have records).
Or, you could use a left join to exclude those users for which there is a null rc1 record, e.g.:
select t.* from YourTable t left join
(select u.userlogin from YourTable u where u.rc1 is null) v on t.userlogin = v.userlogin
where v.userlogin is null
In all of the above, change all occurrences of YourTable to the name of your table.

SQL Server Debit, Credit Query for imbalance

I have a Transaction table that has id,postDate,account,debit,credit columns.
In SQL Server I'm trying to write a union or join query that will highlight imbalances in this table. An imbalance is defined as any credit for an account that does not have a corresponding debit for another account. This table has millions of rows and some transactions are duplicates in it that need to be removed on a case by case basis, but I need a query that will highlight them.
I started here with a union as a start. The finer points of how to highlight imbalances is the thing I need. I basically want an indicator on the row if the record is a duplicate. I'd also like to add a running balance column to the query as well... If the approach I am taking is naive, it is because I am a novice at advanced joining. I have stubbed out for the column but hardcoded string data in it.
SELECT id,
b.account,
b.credit,
b.debit,
b.duplicate
FROM (SELECT t.id,
t.account,
t.credit,
t.debit,
'false' as duplicate
FROM TransactionRegister t
UNION ALL
SELECT t.id,
t.account,
t.debit,
t.credit,
'false' as duplicate
FROM TransactionRegister t ) AS b order by id desc
Here is a sample of the table or two transactions. Typically a debit transaction for one account will always have a matching credit to the another account. I just want to show credits without a matching debit and vice versa and let the user decide what to do or if it is legit. In this system it also appears that the matching transaction is always the next number in the sequence, but I cannot be totally sure that is 100% the case. Out of million records, if the user sees 100 that are not balancing, they can easily digest it.
ID|Date|Account|Credit|Debit
1|01/22/2018,22222,13500,0
2|01/22/2018,11111,0,13500

Sum column total based on criteria sql

I am not even sure if I am asking this question correctly. Algorithmically I know what I want to do, but don't know the appropriate syntax in SQL.
I have created a table that contains total online session times by customer number, IP, session start time, and total session length. Here is an example of what this table looks like(ip and CustNo is masked, also not sure how to make tables so excuse the weirdness):
CustNo minDate maxDate ClientIp timeDiff
123456 2017-11-14-02:39:27.093 2017-11-14-02:39:59.213 1.1.1.1 0.000372
I then create another table looking for a specific type of activity and want to know how long this specific user has used that IP for before this specific activity. The second table contains each activity as a separate row, customerID, IP and a timestamp.
Up to here no issue and the tables look fine.
I now need to write the part that will look into the first table based on customer ID and IP, then sum all usage of that IP for that customer as long as session min start time is less than the activity time but I have no idea how to do this. Here is the current function (not working obviously). I am doing a left join because it is possible this will be a new IP and it may not be in the first table.
SELECT
*,
SUM(##finalSessionSums.timeDiff)
FROM
##allTransfersToDiffReceip
LEFT JOIN
##finalSessionSums ON ##allTransfersToDiffReceip.CustNo = ##finalSessionSums.CustNo
AND ##allTransfersToDiffReceip.ClientIp = ##finalSessionSums.ClientIp
AND ##allTransfersToDiffReceip.[DateTime] < ##finalSessionSums.minDate
I get an aggregate function error here but I don't know how to approach this at all.
You have a SELECT * (return all columns) and an aggregate function (In this case SUM). Whenever you combine specific columns for return alongside aggregate, summarised values you need to stipulate each column specified in the SELECT clause in the GROUP BY clause. For example
SELECT
A, B, SUM(C) as CSum
FROM
Table
GROUP BY
A, B
In cause of the few information, I can't provide a perfect solution, but I'll give it a try:
First, like Alan mentioned, you have to select only columns that you need for your aggregate-function, which is CustomerNo and Ip. To get the sums of the query, you have to group it like this:
SELECT sum(s.timeDiff) as Sum, s.custNo, s.Ip
FROM ##finalSessionSums s
INNER JOIN ##allTransfersToDiffReceip a on a.CustNo = s.CustNo
AND a.ClientIp = s.ClientIp
AND a.[DateTime] < s.minDate
GROUP BY s.custNo, s.Ip;

Need help wrapping head around joins

I have a database of a service that helps people sell things. If they fail a delivery of a sale, they get penalised. I am trying to extract the number of active listings each user had when a particular penalty was applied.
I have the equivalent to the following tables(and relevant fields):
user (id)
listing (id, user_id, status)
transaction (listing_id, seller_id)
listing_history (id, listing_status, date_created)
penalty (id, transaction_id, user_id, date_created)
The listing_history table saves an entry every time a listing is modified, saving a record of what the new state of the listing is.
My goal is to end with a result table with the field: penalty_id, and number of active listings the penalised user had when the penalty was applied.
So far I have the following:
SELECT s1.penalty_id,
COUNT(s1.record_id) 'active_listings'
FROM (
SELECT penalty.id AS 'penalty_id',
listing_history.id AS 'record_id',
FROM user
JOIN penalty ON penalty.user_id = user.id
JOIN transaction ON transaction.id = penalty.transaction_id
JOIN listing_history ON listing_history.listing_id = listing.id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
) s1
GROUP BY s1.penalty_id
Status = 0 means that the listing is active (or that the listing was active at the time the record was created). I got results similar to what I expected, but I fear I may be missing something or may be doing the JOINs wrong. Would this have your approval? (apart from the obvious non-use of aliases, for clarity problems).
UPDATE - As the comments on this answer indicate that changing the table structure isn't an option, here are more details on some queries you could use with the existing structure.
Note that I made a couple changes to the query before even modifying the logic.
As viki888 pointed out, there was a problem reference to listing.id; I've replaced it.
There was no real need for a subquery in the original query; I've simplified it out.
So the original query is rewritten as
SELECT penalty.id AS 'penalty_id'
, COUNT(listing_history.id) 'active_listings'
FROM user
JOIN penalty
ON penalty.user_id = user.id
JOIN transaction
ON transaction.id = penalty.transaction_id
JOIN listing_history
ON listing_history.listing_id = transaction.listing_id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
GROUP BY penalty.id
Now the most natural way, in my opinion, to write the corrected timeline constraint is with a NOT EXISTS condition that filters out all but the most recent listing_history record for a given id. This does require thinking about some edge cases:
Could two listing history records have the same create date? If so, how do you decide which happened first?
If a listing history record is created on the same day as the penalty, which is treated as happening first?
If the created_date is really a timestamp, then this may not matter much (if at all); if it's really a date, it might be a bigger issue. Since your original query required that the listing history be created before the penalty, I'll continue in that style; but it's still ambiguous how to handle the case where two history records with matching status have the same date. You may need to adjust the date comparisons to get the desired behavior.
SELECT penalty.id AS 'penalty_id'
, COUNT(DISTINCT listing_history.id) 'active_listings'
FROM user
JOIN penalty
ON penalty.user_id = user.id
JOIN transaction
ON transaction.id = penalty.transaction_id
JOIN listing_history
ON listing_history.listing_id = transaction.listing_id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
AND NOT EXISTS (SELECT 1
FROM listing_history h2
WHERE listing_history.date_created < h2.date_created
AND h2.date_created < penalty.date_created
AND h2.id = listing_history.id)
GROUP BY penalty.id
Note that I switched from COUNT(...) to COUNT(DISTINCT ...); this helps with some edge cases where two active records for the same listing might be counted.
If you change the date comparisons to use <= instead of < - or, equivalently, if you use BETWEEN to combine the date comparisons - then you'd want to add AND h2.status != 0 (or AND h2.status <> 0, depending on your database) to the subquery so that two concurrent ACTIVE records don't cancel each other out.
There are several equivalent ways to write this, and unfortunately its the kind of query that doesn't always cooperate with a database query optimizer so some trial and error may be necessary to make it run well with large data volumes. Hopefully that gives enough insight into the intended logic that you could work out some equivalents if need be. You could consider using NOT IN instead of NOT EXISTS; or you could use an outer join to a second instance of LISTING_HISTORY... There are probably others I'm not thinking of off hand.
I don't know that we're in a position to sign off on a general statement that the query is, or is not, "correct". If there's a specific question about whether a query will include/exclude a record in a specific situation (or why it does/doesn't, or how to modify it so it won't/will), those might get more complete answers.
I can say that there are a couple likely issues:
The only glaring logic issue has to do with timeline management, which is something that causes a lot of trouble with SQL. The issue is, while your query demonstrates that the listing was active at some point before the penalty creation date, it doesn't demonstrate that the listing was still active on the penalty creation date. Consider
PENALTY
id transaction date
1 10 2016-02-01
TRANSACTION
id listing_id
10 100
LISTING_HISTORY
listing_id status date
100 0 2016-01-01
100 1 2016-01-15
The joins would create a single record, and the count for penalty 1 would include listing 100 even though its status had changed to something other than 0 before the penalty was created.
This is hard - but not impossible - to fix with your existing table structure. You could add a NOT EXISTS condition looking for another LISTING_HISTORY record matching the ID with a date between the first LISTING_HISTORY date and the PENALTY date, for one.
It would be more efficient to add an end date to the LISTING_HISTORY date, but that may not be so easy depending on how the data is maintained.
The second potential issue is the COUNT(RECORD_ID). This may not do what you mean - what COUNT(x) may intuitively seem like it should do, is what COUNT(DISTINCT RECORD_ID) actually does. As written, if the join produces two matches with the same LISTING_HISTORY.ID value - i.e. the listing became active at two different times before the penalty - the listing would be counted twice.

Creating a denormalized table from a normalized key-value table using 100s of joins

I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
I think there may be some trial and error here to see what works but here are some things you can try:
Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.
The above was taken from an msdn post where someone was doing something similar to what you are.
Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.