BigQuery how to ommit data based on multiple results - google-bigquery

I have a query that will bring back results similar to below
Account Number
Contract
Status
000001
0123
Live
000001
0124
Live
000001
0125
Dead
000002
0125
Dead
000002
0125
Dead
What I want to do is omit all results if at least one in the group with the same account number is "Live"
so my result should look something like this, omitting account 00001 alltogether as there is at least 1 live contract within it
Account Number
Contract
Status
000002
0125
Dead
000002
0125
Dead
Is this possible and how would I achieve it?

Use below approach
select *
from your_table
qualify countif(Status = 'Live') over(partition by AccountNumber) = 0
If applied to sample data in your question - output is

Related

SQL Choose rows based on infant ID and whether the infant's mother ID is in the row or not

I'm very new to SQL and this is only my second post on stackoverflow. I'm trying to follow the rules but please excuse my n00bness. Thanks in advance for your time and help. I'm using MS Access.
I'm studying infant social development and mother-infant interactions. To make this easier to understand, I simplified all of the following:
I have 2 tables: biography and interactions. Biography consists of the infant identity code, date of birth of the infant, and the infant's mother. Interactions consists of data collected while observing the infants and their mothers (as well as peers). I observe an infant for a set amount of time and record their behavior at each timestamp. If the behavior involves a partner (I'm specifically interested in play behavior) I include the identity of the partner.
What I would like to do is take out all the "play" rows in which the mother is the only play partner (because I'm interested in when the infant plays with peers rather than the mother). I want to include rows in which the infant is playing with the mother AND a peer (because this counts as playing with a peer). I think this entails relating the two tables using the mom column of each infant's id. I think, in English, this could be described as: Exclude play rows where mom is the only play partner. It's important to note that who the mother is obviously depends on who the infant being observed is.
As you can see below, sometimes there are multiple play partners. Again, I do want to include rows such as the last few, where cc is playing with it's mother AND aa. The partner id's are usually separated by a space, but sometimes there are typos and there is no space or more than one space. There may even be some commas. But the ID codes are consistent and will always be there typed correctly. The dataset includes tens of thousands of lines so I'm wondering if there is an efficient way to complete this task. The tables are visualized below:
biography
id | dob | mom
-------------------------
aa 2015-01-01 mom_a
bb 2016-01-01 mom_b
cc 2017-01-01 mom_c
interactions
id | behavior | partner | time
---------------------------------------------
aa play mom_a 12:00
aa rest 12:05
aa play bb 12:10
aa play bb 12:15
aa rest 12:20
bb rest 13:00
bb rest 13:05
bb play mom_b 13:10
bb play cc 13:15
bb rest 13:20
cc rest 14:00
cc play aa bb 14:05
cc play mom_c aa bb 14:10
cc play mom_c aa 14:15
cc play mom_c aa 14:20
cc play mom_c aa 14:25
I think, below query may work, I haven't tried it yet.
select * from biography bio
inner join interactions itr
on bio.id = itr.id
where itr.partner not like bio.mom
Please note, if your interactions table is going to have large amount of data then having a "not like" clause will degrade performance. Also, you may wish to normalize the interaction table so that you do not need to keep all the partners in same row.
As mentioned in the comments, you should normalize your data structure. But the following should work with what you have:
SELECT *
FROM interactions i
WHERE behavior = 'play'
AND NOT EXISTS
(SELECT 1
FROM biography b
WHERE i.partner = b.mom
AND i.id = b.id)
So take the rows in interactions where:
Behavior = play
Partner does not have an exact match in biography.mom for the same id

Extract different rates associated with one ID

I have a single loan database with a user_id, loan_id, interest_rate, loan_date and other stuff that isn't relevant here.
How would I extract all the user_id's for those who took out at least two loans, and had the later ones at better interest rates.
select member_id, Annual_interest_rate, count(*)
from (select member_id, Annual_interest_rate, count(*)
from loan_book
group by member_id
having count(*)>1)
group by member_id, Annual_interest_rate
It shows the stuff from the subquery but with count 1 instead of count 2
Does the subquery destroy the necessary info? is there a way to write it as one query?
sample table
user loan air date
0001 2345 2.6 09/03
0002 1346 2.6 03/05
0003 1118 3.7 05/03
0002 6756 1.2 05/08
0003 1286 3.2 01/10
0001 2222 3.0 09/11
the result would be:
user loan air date
0002 6756 2.6 05/08
0003 1286 3.2 01/10
as those were the two loans that had better interest rates than their predecessors. If there are more than two then the ones that were better than one of their predeccessors should show
Here is a query that might work or at least the approach might help with some ideas.
SELECT LB2.*
FROM loan_book LB1 INNER JOIN loan_book LB2
ON LB1.user_id = LB2.user_id
AND LB1.loan_id != LB2.loan_id
AND LB1.loan_date < LB2.loan_date
AND LB1.interest_rate > LB2.interest_rate
You join the table with itself so each user will have two loans in each row and then you can do the necessary comparisons and groupings from the result. Hope this helps.

Oracle SQL: Grouping and allocating data based on 3 criteria

Quite a long-winded question:
As a hypothetical situation i am trying to split a table of data between two companies: OPM, MON.
|NAME |ACCOUNT |BALANCE |COMPANY
_______________________________________________
|SMITH |11111 |100 |
|SMITH |11111 |150 |
|HUNTER |11121 |200 |
|HUNTER |11131 |250 |
|LITTLE |11141 |300 |
|RIDLEY |11151 |300 |
|RIDLEY |11151 |100 |
|ARMSTRONG |11161 |150 |
|ARMSTRONG |11171 |150 |
|HENRY |11181 |100 |
There are several scenario's with the customer data. 1. Customer has two accounts, both have the same account number, but with different balances. 2. Customer has two accounts, different account numbers and different balances. 3. Customer has one account, one account number, one balance.
I need to write out logic in SQL / PL-SQL that enables the data to fulfill an allocation to either of the two different companies and that also follows rules.
A customer, regardless of how many accounts, must be allocated to the same company.
The value of accounts must be roughly equal.
The volume of accounts must be roughly equal.
I accept the limitation in the data i have provided, but the logic must be extrapolated to larger datasets. What is the best method to achieve this?
What you are trying to do is a bin-packing problem, and this is generally hard. However, you simply state that the two groups need to be approximately equal. So, I would suggest adding up the balances for each customer and taking a stratified sample:
select name, balance,
(case when mod(seqnum, 2) = 0 then 'Company1' else 'Company2' end)
from (select name, sum(balance) as balance,
row_number() over (order by balance) as seqnum
from table t
group by name
) n
Note: this is an approximate approach. It puts half the accounts in each group, and they should have similar total balances. There are many cases where this will not produce an optimal solution (such as when one "name" has very large balances compared to everyone else), but it might be good enough.

Items getting double-counted in SQL Server, dependent counting logic not working right

I am counting the number of RFIs (requests for info) from various agencies. Some of these agencies are also part of a task force (committee). Currently this SQL combines the agencies and task forces into one list and counts the RFIs for each. The problem is, if the RFI belongs to a task force (which is also assigned to an agency), I only want it to count for the task force and not for the agency. However, if the agency does not have a task force assigned to the RFI, I want it to still count for the agency. The RFIs are linked to various agencies through a _LinkEnd table, but that logic works just fine. Here is the logic thus far:
SELECT t.Submitting_Agency, COUNT(DISTINCT t.Count) AS RFICount
FROM (
SELECT RFI_.Submitting_Agency, RFI_.Unique_ID, _LinkEnd.EntityType_ID1, _LinkEnd.Link_ID as Count
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630'
UNION ALL
SELECT RFI_.Task_Force__Initiative AS Submitting_Agency, RFI_.Unique_ID, _LinkEnd.EntityType_ID1, _LinkEnd.Link_ID as Count
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630' AND RFI_.Task_Force__Initiative IS NOT NULL) t
GROUP BY t.Submitting_Agency
How can I get it to only count an RFI one time, even though the two fields are combined? For instance, here are sample records from the RFI_ table:
---------------------------------------------------------------------------
| Unique_ID | Submitting_Agency | Task_Force__Initiative | Date_Submitted |
---------------------------------------------------------------------------
| 1 | Social Service | Flood Relief TF | 2011-05-08 |
---------------------------------------------------------------------------
| 2 | Faith-Based Init. | Homeless Shelter Min. | 2011-06-08 |
---------------------------------------------------------------------------
| 3 | Psychology Group | | 2011-05-04 |
---------------------------------------------------------------------------
| 4 | Attorneys at Law | | 2011-05-05 |
---------------------------------------------------------------------------
| 5 | Social Service | | 2011-05-10 |
---------------------------------------------------------------------------
So assuming only one link existed to one RFI for each of these, the count should be as follows:
Social Service 1
Faith-Based Unit. 0
Psychology Group 1
Attorneys at Law 1
Flood Relief TF 1
Homeless Shelter Min. 1
Note that if both an agency and a task force are in one record, then the task force gets the count, not the agency. But it is possible for the agency to have a record without a task force, in which case the agency gets the count. How could I get this to work in this fashion so that RFIs are not double-counted? As it stands both the agency and the task force get counted, which I do not want to happen. The task force always gets the count, unless that field is blank, then the agency gets it.
I guess a simple COLESCE() would do the trick?
SELECT COLAESCE(Task_Force__Initiative, Submitting_Agency), COUNT(DISTINCT _LinkEnd.Link_ID) AS RFICount
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630'
GROUP BY COLAESCE(Task_Force__Initiative, Submitting_Agency);
Rather than:
SELECT t.Submitting_Agency ...
Try
SELECT
CASE t.[Task_Force__Initiative]
WHEN NULL THEN -- Or whatever value constitutes "empty"
t.[Submitting_Agency]
ELSE
t.[Task_Force__Initiative]
END ...
and then GROUP BY the same.
http://msdn.microsoft.com/en-us/library/ms181765.aspx
The result will be that your count will aggregate from the proper specified grouping point, rather than from the single agency column.
EDIT: From your example it looks like you don't use NULL for the empty field but maybe a blank string? In that case you'll want to replace the NULL in the CASE above with the proper "blank" value. If it is NULL then you can COALESCE as suggested in the other answer.
EDIT: Based on what I think your schema is... and your WHERE criteria
SELECT
COALESCE(RFI_.[Task_Force__Initiative], RFI_.[Submitting_Agency]),
COUNT(*)
FROM
RFI_
JOIN _LinkEnd
ON RFI_.[Unique_ID]=_LinkEnd.[Entity_ID1]
WHERE
_LinkEnd.[Link_ID] LIKE 'CAS%'
AND RFI_.[Date_Submitted] BETWEEN '20110430' AND '20110630'
GROUP BY
COALESCE(RFI_.[Task_Force__Initiative], RFI_.[Submitting_Agency])

Best Approach to Processing SQL Data problem

I have a Data intensive problem which requires a lot of massaging and data manipulation and I'm putting this out there to see if anyone has an idea as to how to approach it.
In simplest form. I have a lot of tables which can be joined together to give me a price listing for dentists and how much each charges for a procedure.
so we have multiple tables that looks like this.
Dentist | Procedure1 | Procedure2 | Procedure3 | .........| Procedure?
John | 500 | 342 | 434 | .........| 843
Dave | 343 | 434 | 322 | NULLs....|
Mary | 500 | 342 | 434 | .........| 843
Linda | 500 | 342 | Null | .........| 843
Dentists can have different number of procedures and different pricing for each procedures. But there are a lot of Dentists that have the same number of procedures and the same rates that goes with it. Internally, we create a unique ID for each of these so-called fee listings.
like John would be 001, Dave would be 002, but Mary would be fee 001 and Linda would be 003
It's not so bad if I have to deal with this data once but these fee listings comes in flat files (csvs) which i basically have to DTS up to a SQL server to work with. and they come on a monthly bases. The pricing could change from month to month for each dentist which then would put them in a different unique ID internally.
Can someone shed some light on as to how to best approach this problem so that it's most efficient to process on a monthly basis without having to do tons of data manipulation?
what's the best approach to finding out the duplicates of the fee listings?
How do i keep track of updating a Dentist's fee listing incase they change their rates the next month? if Mary decides to charge a different fee for procedure2, then she would have a different unique ID internally. how do i keep track of that on a monthly bases without having to delete everything and re-insert?
There are a few million fee listings that I'm working with and some have standard rules that are based on zipcodes and some are just unique fee listings, what's the approach here?
I can write some kind of ad-hoc .net program to work with it but it's a lot of data and working straight in SQL server would be easier for me.
any help would be great, thanks guys.
You probably need to unpivot the data to normalize it - so that you end up with:
Doctor: DoctorID, DoctorDetails...
FeeSchedule: DoctorID, ScheduleID, EffectiveDate, OtherDetailAtThisLevel...
FeeScheduleDetail: ScheduleID, ProcedureCode, Fee, OtherDetailAtThisLevel...
When the data comes in for a doctor, it is pivoted, a new schedule is created and the detail rows are created from the unpivoted data.
SSIS has an unpivot component which is fine - you would load the schedule first and then the detail. If the format varies significantly, you might need a custom data source or just avoid SSIS.
This system would keep track of new schedules for doctors. If the schedule is identical for a doctor, you could simply not insert it.
If this logic is extensive, you could load the data to staging tables (SSIS or whatever) and do all this in SQL (T-SQL also has an UNPIVOT operator). That can have advantages in that the code is all in one place and can do all its operations in sets.
Regarding the zip codes, if the doctor doesn't have a fee, are these like usual and customary fee? This could simply be determined from the zip code of the doctor row. In this case you have a few options. You can overlay the doctor fee schedule over a zip code fee schedule:
ZipCodeSchedule: ZipScheduleID, ZipCode, EffectiveDate
ZipCodeScheduleDetail: ZipScheduleID, ProcedureCode, Fee
Or you could save this in the regular feeschedule (potentially with some kind of flag that it was defaulted to the UCR).