Add temporary column with number in sequence in BigQuery - sql

I have two columns: customers and orders. orders has customer_id column. So customer can have many orders. I need to find order number in sequence (by date). So result should be something like this:
customer_id order_date number_in_sequence
----------- ---------- ------------------
1 2020-01-01 1
1 2020-01-02 2
1 2020-01-03 3
2 2019-01-01 1
2 2019-01-02 2
I am going to use it in WITH clause. So I don't need to add it to the table.

You need row_number() :
select t.*,
row_number() over (partition by customer_id order by order_date) as number_in_sequence
from table t;

Related

Select earliest date and count rows in table with duplicate IDs

I have a table called table1:
id created_date
1001 2020-06-01
1001 2020-01-01
1001 2020-07-01
1002 2020-02-01
1002 2020-04-01
1003 2020-09-01
I'm trying to write a query that provides me a list of distinct IDs with the earliest created_date they have, along with the count of rows each id has:
id created_date count
1001 2020-01-01 3
1002 2020-02-01 2
1003 2020-09-01 1
I managed to write a window function to grab the earliest date, but I'm having trouble figuring out where to fit the count statement in one:
SELECT
id,
created_date
FROM ( SELECT
id,
created_date,
row_number() OVER(PARTITION BY id ORDER BY created_date) as row_num
FROM table1)
) AS a
WHERE row_num = 1
You would use aggregation:
select id, min(create_date), count(*)
from table1
group by id;
I find it amusing that you want to use window functions -- which are considered more advanced -- when lowly aggregation suffices.

Count lead duplicate rows

I have the below table
Table A:
row_number id start_dt end_dt cust_dt cust_id
1 101 4/8/19 4/20/19 4/10/19 725
2 101 4/21/19 5/20/19 4/10/19 456
3 101 5/1/19 6/30/19 4/10/19 725
4 101 7/1/19 8/20/19 4/10/19 725
I need to count "duplicates" in a table for testing purposes.
Criteria:
Need to exclude the start_dt and end_dt from my calculation.
It's only a duplicate if lead row is duplicated. So, for example row 1, row 3 or 4 are the same but only row 3 and 4 would be considered duplicates in this example.
What I have tried:
rank with a lead and self join but that doesn't seem to be working on my end.
How can I count the id to determine if there are duplicates?
Output: (something like below)
count id
2 101
End results for me is to have a count of 1 for the table
count id
1 101
Use row_number analytical function as following (gaps and island problem):
Select count(1), id from
(Select t.*,
row_number() over (order by row_number) as rn,
row_number() over (partition by id, cust_dt, cust_id order by row_number) as part_rn
From your_table t)
Group by id, cust_dt, cust_id, (rn-part_rn)
Having count(1) > 1
db<>fiddle demo
Cheers!!
If your definition of a duplicated row is: the CUST_IDin the lead row (with same id order by row_number) equalst to the current CUST_ID,
you may write it down simple using the LEAD analytic function.
select ID, ROW_NUMBER, CUST_ID,
case when CUST_ID = lead(CUST_ID) over (partition by id order by ROW_NUMBER) then 1 end is_dup
from tab
ID ROW_NUMBER CUST_ID IS_DUP
---------- ---------- ---------- ----------
101 1 725
101 2 456
101 3 725 1
101 4 725
The aggregated query to get the number of duplicated rows would than be
with dup as (
select ID, ROW_NUMBER, CUST_ID,
case when CUST_ID = lead(CUST_ID) over (partition by id order by ROW_NUMBER) then 1 end is_dup
from tab)
select ID, sum(is_dup) dup_cnt
from dup
group by ID
ID DUP_CNT
---------- ----------
101 1

rank function only returns 1 with date in redshift

I'm running the code below in redshift. I want to get a ranking of the order when a customer purchased a product based on the date. Each purchase has a unique ticketid, each customer has a unique customer_uuid, and each product has a unique product_id. The code below is returning 1 for all rankings and I'm not sure why. Is there an error in my code or is there a problem with ranking by a date field in redshift? Does anyone see how to modify this code to correct the issue.
code:
select customer_uuid,
product_id,
date,
ticketid
rank()
over(partition by customer_uuid,
product_id,
ticketid order by date asc) as rank
from table
order by customer_uuid, product_id
data:
customer_uuid product_id ticketid date
1 2 1 1/1/18
1 2 2 1/2/18
1 2 3 1/3/18
output:
customer_uuid product_id ticketid date rank
1 2 1 1/1/18 1
1 2 2 1/2/18 1
1 2 3 1/3/18 1
desired output:
customer_uuid product_id ticketid date rank
1 2 1 1/1/18 1
1 2 2 1/2/18 2
1 2 3 1/3/18 3
First, you have ticket_id in the partition by, which makes each row unique.
Second, you are using rank(). If you want an enumeration, do you want row_number()?
row_number() over(partition by customer_uuid, product_id order by date asc) as rank
I want to get a ranking of the order when a customer purchased a product based on the date. Each purchase has a unique ticketid, each customer has a unique customer_uuid, and each product has a unique product_id.
Basically you have unique (customer_uuid, product_id, ticket_id) tuples. If you use those as a partition, the rank will always be 1, since there is only one record per partition.
You just need to remove the ticket_id from the partition:
rank() over(
partition by customer_uuid, product_id
order by date
) as rank
Note: rank() will give an equal position to records that share the same (customer_uuid, product_id, date).

Combining COUNT and RANK - PostgreSQL

What I need to select is total number of trips made by every 'id_customer' from table user and their id, dispatch_seconds, and distance for first order. id_customer, customer_id, and order_id are strings.
It should looks like this
+------+--------+------------+--------------------------+------------------+
| id | count | #1order id | #1order dispatch seconds | #1order distance |
+------+--------+------------+--------------------------+------------------+
| 1ar5 | 3 | 4r56 | 1 | 500 |
| 2et7 | 2 | dc1f | 5 | 100 |
+------+--------+------------+--------------------------+------------------+
Cheers!
Original post was edited as during discussion S-man helped me to find exact problem solution. Solution by S-man https://dbfiddle.uk/?rdbms=postgres_10&fiddle=e16aa6008990107e55a26d05b10b02b5
db<>fiddle
SELECT
customer_id,
order_id,
order_timestamp,
dispatch_seconds,
distance
FROM (
SELECT
*,
count(*) over (partition by customer_id), -- A
first_value(order_id) over (partition by customer_id order by order_timestamp) -- B
FROM orders
)s
WHERE order_id = first_value -- C
https://www.postgresql.org/docs/current/static/tutorial-window.html
A window function which gets the total record count per user
B window function which orders all records per user by timestamp and gives the first order_id of the corresponding user. Using first_value instead of min has one benefit: Maybe it could be possible that your order IDs are not really increasing by timestamp (maybe two orders come in simultaneously or your order IDs are not sequential increasing but some sort of hash)
--> both are new columns
C now get all columns where the "first_value" (aka the first order_id by timestamp) equals the order_id of the current row. This gives all rows with the first order by user.
Result:
customer_id count order_id order_timestamp dispatch_seconds distance
----------- ----- -------- ------------------- ---------------- --------
1ar5 3 4r56 2018-08-16 17:24:00 1 500
2et7 2 dc1f 2018-08-15 01:24:00 5 100
Note that in these test data the order "dc1f" of user "2et7" has a smaller timestamp but comes later in the rows. It is not the first occurrence of the user in the table but nevertheless the one with the earliest order. This should demonstrate the case first_value vs. min as described above.
You are on the right track. Just use conditional aggregation:
SELECT o.customer_id, COUNT(*)
MAX(CASE WHEN seqnum = 1 THEN o.order_id END) as first_order_id
FROM (SELECT o.*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_timestamp ASC) as seqnum
FROM orders o
) o
GROUP BY o.customer_id;
Your JOIN is not necessary for this query.
You can use window function :
select distinct customer_id,
count(*) over (partition by customer_id) as no_of_order
min(order_id) over (partition by customer_id order by order_timestamp) as first_order_id
from orders o;
I think there are many mistakes in your original query, your rank isn't partitioned, the order by clause seems incorrect, you filter out all but one "random" order, then apply the count, the list goes on.
Something like this seems closer to what you seem to want?
SELECT
customer_id,
order_count,
order_id
FROM (
SELECT
a.customer_id,
a.order_count,
a.order_id,
RANK() OVER (PARTITION BY a.order_id, a.customer_id ORDER BY a.order_count DESC) AS rank_id
FROM (
SELECT
customer_id,
order_id,
COUNT(*) AS order_count
FROM
orders
GROUP BY
customer_id,
order_id) a) b
WHERE
b.rank_id = 1;

Oracle SQL Help Data Totals

I am on Oracle 12c and need help with the simple query.
Here is the sample data of what I currently have:
Table Name: customer
Table DDL
create table customer(
customer_id varchar2(50),
name varchar2(50),
activation_dt date,
space_occupied number(50)
);
Sample Table Data:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Sample Data Output
The query I am looking for will provide the following:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Total_Space null null 30
Here is a slightly hack-y approach to this, using the grouping function ROLLUP(). Find out more.
SQL> select coalesce(customer_id, 'Total Space') as customer_id
2 , name
3 , activation_dt
4 , sum(space_occupied) as space_occupied
5 from customer
6 group by ROLLUP(customer_id, name, activation_dt)
7 having grouping(customer_id) = 1
8 or (grouping(name) + grouping(customer_id)+ grouping(activation_dt)) = 0;
CUSTOMER_ID NAME ACTIVATIO SPACE_OCCUPIED
------------ ------------ --------- --------------
abc abc-001 12-SEP-16 20
xyz xyz-001 12-SEP-16 10
Total Space 30
SQL>
ROLLUP() generates intermediate totals for each combination of column; the verbose HAVING clause filters them out and retains only the grand total.
What you want is a bit unusual, as if customer_id is integer, then you have to cast it to string etc, but it this is your requirement, then if be achieved this way.
SELECT customer_id,
name,
activation_dt,
space_occupied
FROM
(SELECT 1 AS seq,
customer_id,
name,
activation_dt,
space_occupied
FROM customer
UNION ALL
SELECT 2 AS seq,
'Total_Space' AS customer_id,
NULL AS name,
NULL AS activation_dt,
sum(space_occupied) AS space_occupied
FROM customer
)
ORDER BY seq
Explanation:
Inner query:
First part of union all; I added 1 as seq to give 1
hardcoded with your resultset from customer.
Second part of union
all: I am just calculating sum(space_occupied) and hardcoding other
columns, including 2 as seq
Outer query; Selecting the data
columns and order by seq, so Total_Space is returned at last.
Output
+-------------+---------+---------------+----------------+
| CUSTOMER_ID | NAME | ACTIVATION_DT | SPACE_OCCUPIED |
+-------------+---------+---------------+----------------+
| abc | abc-001 | 12-SEP-16 | 20 |
| xyz | xyz-001 | 12-SEP-16 | 10 |
| Total_Space | null | null | 30 |
+-------------+---------+---------------+----------------+
Seems like a great place to use group by grouping sets seems like this is what they were designed for. Doc link
SELECT coalesce(Customer_Id,'Total_Space') as Customer_ID
, Name
, ActiviatioN_DT
, sum(Space_occupied) space_Occupied
FROM customer
GROUP BY GROUPING SETS ((Customer_ID, Name, Activation_DT, Space_Occupied)
,())
The key thing here is we are summing space occupied. The two different grouping mechanisms tell the engine to keep each row in it's original form and 1 records with space_occupied summed; since we group by () empty set; only aggregated values will be returned; along with constants (coalesce hardcoded value for total!)
The power of this is that if you needed to group by other things as well you could have multiple grouping sets. imagine a material with a product division, group and line and I want a report with sales totals by division, group and line. You could simply group by () to get grand total, (product_division, Product_Group, line) to get a product line (product_Divsion, product_group) to get a product_group total and (product_division) to get a product Division total. pretty powerful stuff for a partial cube generation.