Keep duplicate rows in Google BigQuery - google-bigquery

I have a dataset [lipid] that extracted from electronic medical record system (EMRS). In that EMRS, the physician give order to obtain a laboratory blood profile from a patient with a unique order number BUT with a different service types. So, if one order has 4 service types, EMRS will record the event on 4 rows (identical [duplicates] order number in Order_no column, BUT with a different service types in Service_type column) like this;
Order_no
Service_type
1
TC
1
HDL
1
TG
1
LDL
Sometimes, one order may has <4 service types, hence, order will be like that;
Order_no
Service_type
1
TC
1
HDL
1
TG
1
LDL
2
TC
2
HDL
4
TC
4
HDL
4
LDL
5
TC
5
TG
5
LDL
6
TC
8
TC
8
HDL
8
TG
8
LDL
What I'm trying to do is write a query that keeps orders that has four identical Order_no but different Service_type like this;
Order_no
Service_type
1
TC
1
HDL
1
TG
1
LDL
8
TC
8
HDL
8
TG
8
LDL
How can I write this query in Google BigQuery?

Use below simple approach
select * from your_table
qualify count(*) over(partition by Order_no) > 3
if applied to sample data in your question - output is
In case if you need to count ONLY distinct services - use below
select * from your_table
qualify count(distinct Service_type) over(partition by Order_no) > 3

Related

SQL : Create columns from the distinct values of a column [duplicate]

I have a dataset [lipid] that extracted from electronic medical record system (EMRS). In that EMRS, the physician give order to obtain a laboratory blood profile from a patient with a unique order number BUT with a different service types. So, if one order has 4 service types, EMRS will record the event on 4 rows (identical [duplicates] order number in Order_no column, BUT with a different service types in Service_type column) like this;
Order_no
Service_type
Result
1
TC
230
1
HDL
40
1
TG
150
1
LDL
90
Sometimes, one order may has <4 service types, hence, order will be like that;
Order_no
Service_type
Result
1
TC
230
1
HDL
40
1
TG
150
1
LDL
90
2
TC
230
2
HDL
40
4
TC
230
4
HDL
40
4
LDL
90
5
TC
230
5
TG
150
5
LDL
90
6
TC
230
8
TC
230
8
HDL
40
8
TG
150
8
LDL
90
What I'm trying to do is writing a query that keeps Order_no column and change direction of table as well as merge identical order number in one row like this;
Order_no
TC
HDL
TG
LDL
1
230
40
150
90
2
250
66
4
199
39
99
5
299
45
190
6
400
8
400
40
250
290
How can I write this query in Google BigQuery?
Use below approach
select * from your_table
pivot (any_value(Result) for Service_type in ('TC', 'HDL', 'TG', 'LDL'))
In case if Service Type is not known in advance - you can use below
execute immediate (select '''
select * from your_table
pivot (any_value(Result) for Service_type in (''' || string_agg(distinct "'" || Service_type || "'") ||
"))"
from your_table
)
You can use PIVOT.
Example:
WITH your_table AS
(
SELECT 1 AS Order_no, 'TC' AS Service_type, 230 AS Result
UNION ALL
SELECT 1, 'HDL', 40
UNION ALL
SELECT 1, 'TG', 150
UNION ALL
SELECT 1, 'LDL', 90
)
SELECT *
FROM your_table PIVOT(SUM(Result) FOR Service_type IN ('TC', 'HDL', 'TG', 'LDL'))

How do I add rows in one table in specific conditions?

I work on an Oracle database.
I have a table (it is a join table) but this is how it looks:
CustomerID days_attached Startdate enddate team
1 7 01-01-2016 08-01-2016 A
1 2 09-01-2016 10-01-2016 B
1 8 01-02-2016 09-02-2016 A
2 1 01-02-2017 02-02-2016 C
2 8 08-05-2017 16-05-2017 C
I need to know how long a person is attached to a specific team. A person can be attached to a person for a X amount of days. That person could be in a team. For instance in this case, how long is a person attached to team A = 7+8 15 days.
How do I get this in a SQL statement?
Our app only supports SQL not PL/sql .
I expect an output like:
CustomerID days_attached team
1 15 A
1 2 B
2 9 C
select customer, team, sum(dayattached) from table_name group by customer, team
hopefully this will help u

How to create cartesian products between records for each group separately?

Suppose I sell services that span a time interval (days, months or even years). I have a Products table, where each product is listed, together with the Customer_ID and Service_start and Service_end date.
Now I want to list all combinations of pairs (Service_start, Service_end) inside each customer; e.g. (table sorted by Customer_ID)
Lp Service_start Service_end Customer_ID
--------------------------------------------
1 2-Feb-2014 8-Aug-2014 1
2 5-May-2014 20-Dec-2014 1
3 7-Jul-2014 9-Sep-2014 1
4 13-Jan-2014 13-Jan-2015 2
.. ... ... ...
I want to turn into
Lp Service_start Service_end Customer_ID
--------------------------------------------
1 2-Feb-2014 8-Aug-2014 1
2 2-Feb-2014 20-Dec-2014 1
3 2-Feb-2014 9-Sep-2014 1
4 5-May-2014 8-Aug-2014 1
5 5-May-2014 20-Dec-2014 1
6 5-May-2014 9-Sep-2014 1
7 13-Jan-2014 8-Aug-2014 1
8 13-Jan-2014 20-Dec-2014 1
9 13-Jan-2014 9-Sep-2014 1
10 13-Jan-2014 13-Jan-2015 2
... ... ... ...
The table is big enough that it doesn't fit into memory.
How it can be achievable by SQL? Or SAS?
You can do this in SAS and SQL. Here is the SQL idea:
select ss.service_start, se.service_end, ss.customer_id
from (select distinct customer_id, service_start from table) ss join
(select distinct customer_id service_end from table) se
on ss.customer_id = se.customer_id;
This is compatible with SAS proc sql.
In most dialects of SQL, you can add the lp column using row_number() over (order by customer_id, service_start, service_end). In SAS, you can use monotonic() or a data step after proc sql.

Select n amount of random rows where n is proportionate to each value's % of total population

I have a table of 58 million customer records. Each customer has a market value (EN, US, FR etc.)
I'm trying to select a 100k sample set which contains customers from every market. The ratio of customers per market in the sample must match the ratios in the actual table.
So if UK customers account for 15% of the records in the customer table then there must be 15k UK customers in the 100k sample set and the same then for each market.
Is there a way to do this?
First, a simple random sample should do pretty well on representing the market sizes. What you are asking for is a stratified sample.
One way to get such a sample is to order the data randomly and assign a sequential number in each group. Then normalize the sequential number to be between 0 and 1, and finally order by the normalized value and choose the top "n" rows:
select top 100000 c.*
from (select c.*,
row_number() over (partition by market order by rand(checksum(newid()))
) as seqnum,
count(*) over (partition by market) as cnt
from customers c
) c
order by cast(seqnum as float) / cnt
It may be clear what is happening if you look at the data. Consider taking a sample of 5 from:
1 A
2 B
3 C
4 D
5 D
6 D
7 B
8 A
9 D
10 C
The first step assigns a sequential number randomly within each market:
1 A 1
2 B 1
3 C 1
4 D 1
5 D 2
6 D 3
7 B 2
8 A 2
9 D 4
10 C 2
Next, normalize these values:
1 A 1 0.50
2 B 1 0.50
3 C 1 0.50
4 D 1 0.25
5 D 2 0.50
6 D 3 0.75
7 B 2 1.00
8 A 2 1.00
9 D 4 1.00
10 C 2 1.00
Now, if you take the top 5, you will get the first five values which is a stratified sample.
Using a sample that big a casual extraction will give you a sample with good statitical approximation of the original population, as pointed out by Gordon Linoff.
To force the equal percentage between the population and the sample you can calculate and use all the needed parameter: the dimension of the population and the dimension of the partition, with the addition of a random ID.
Declare #sampleSize INT
Set #sampleSize = 100000
With D AS (
SELECT customerID
, Country
, Count(customerID) OVER (PARTITION BY Null) TotalData
, Count(customerID) OVER (PARTITION BY Country) CountryData
, Row_Number() OVER (PARTITION BY Country
ORDER BY rand(checksum(newid()))) ID
FROM customer
)
SELECT customerID
, Country
FROM D
WHERE ID <= Round((Cast(CountryData as Float) / TotalData) * #sampleSize, 0)
ORDER BY Country
SQLFiddle demo with less data.
Be aware that the approximation of the function in the WHERE condition can make the returned data a little less or a little more of the desired one, for example in the demo the rows returned are 9 instead of 10.

Grouping a row based on field in a different table in oracle

I am working with these two tables for the past two days
parts_list table:
PART_ID VENDOR_ID LABEL
1 5 A
1 2 B
1 3 C
2 2 D
2 3 E
3 3 F
vendor_prsdnc table:
VENDOR_ID PRSCDNC
5 3
2 2
3 1
Can anybody please tell me how to retrieve the label of each part from the vendor with highest precedence? For example the part with id one is supplied by 3 vendors but we need the one from vendor with highest precedence ie 5. The expected result is:
PART_ID VENDOR_ID LABEL
1 5 A
2 2 D
3 3 F
[Vendor Id is not proportional with the precedence ]
I have this query
SELECT
SDS.PART_ID,
SDSIS.VENDOR_ID,
MAX(SDSIS.PRSCDNC)
FROM PARTS_LIST SDS,VENDOR_PRSDNC SDSIS
WHERE SDS.VENDOR_ID=SDSIS.VENDOR_ID
GROUP BY SDS.PART_ID,SDSIS.VENDOR_ID;
but it does not return the expected result.
Not tested ,but it should work i think
select part_id,vendor_id,label
from
(
select pl.part_id
,pl.vendor_id
,pl.label
,vp.prscdnc
,max(vp.prscdnc) over (partition by pl.part_id) mx
from part_list pl,vendor_prsdnc vp
where pl.vendor_id=vp.vendor_id
)
where prscdnc =mx;