Generate a sequence of records from a base record - missing-data

Is there a way to expand table 1 into table 2? It is to output each integer between start_no and end_no as a seq_no field, and take the other fields of the original table to form a new table (table 2).
Table 1:
date source market channel_no start_no end_no err_type
---------- ------ ------ ---------- -------- ------ --------
2022.06.01 src55 SZ 2011 565663 565665 1
2022.06.01 src55 SZ 2011 565918 565920 1
2022.06.01 src55 SZ 2011 566010 566012 1
2022.06.01 src55 SZ 2011 566363 566365 1
2022.06.01 src55 SZ 2011 566512 566513 1
Table 2:
date source market channel_no err_type seq_no
---------- ------ ------ ---------- -------- ------
2022.06.01 src55 SZ 2011 1 565663
2022.06.01 src55 SZ 2011 1 565664
2022.06.01 src55 SZ 2011 1 565665
2022.06.01 src55 SZ 2011 1 565918
2022.06.01 src55 SZ 2011 1 565919
2022.06.01 src55 SZ 2011 1 565920
2022.06.01 src55 SZ 2011 1 566010
2022.06.01 src55 SZ 2011 1 566011
2022.06.01 src55 SZ 2011 1 566012
2022.06.01 src55 SZ 2011 1 566363
2022.06.01 src55 SZ 2011 1 566364
2022.06.01 src55 SZ 2011 1 566365
2022.06.01 src55 SZ 2011 1 566512
2022.06.01 src55 SZ 2011 1 566513

You can use function each and cj(cross join) to loop through each row of original table and conduct cross join with the sequence data table generated by start_no and end_no, where the cross join returns the Cartesian product of rows from the tables in the join. In the end, you use function unionAll to combine all the intermediate tables into the final output table.
The simulated data, the algorithm, and the results are listed as follows:
Simulated data:
start_no=[565663,565918,566010,566363,566512]
end_no=[565665,565920,566012,566365,566513]
tb=table(take(2022.06.01,5) as date,take(`src55,5) as source,take(`SZ,5) as market,take(2011,5) as channel_no,start_no , end_no,take(1,5) as err_type);
tb;
date source market channel_no start_no end_no err_type
---------- ------ ------ ---------- -------- ------ --------
2022.06.01 src55 SZ 2011 565663 565665 1
2022.06.01 src55 SZ 2011 565918 565920 1
2022.06.01 src55 SZ 2011 566010 566012 1
2022.06.01 src55 SZ 2011 566363 566365 1
2022.06.01 src55 SZ 2011 566512 566513 1
The algorithm:
def f(tb,i){
tt=select date,source,market,channel_no,err_type from tb where rowNo(date)=i
start_no=tb[i]['start_no']
end_no=tb[i]['end_no']
return cj(tt,table(start_no..end_no as seq_no))
}
tb=unionAll(each(f{tb},1..size(tb)-1),false)
Results:
date source market channel_no err_type seq_no
---------- ------ ------ ---------- -------- ------
2022.06.01 src55 SZ 2011 1 565663
2022.06.01 src55 SZ 2011 1 565664
2022.06.01 src55 SZ 2011 1 565665
2022.06.01 src55 SZ 2011 1 565918
2022.06.01 src55 SZ 2011 1 565919
2022.06.01 src55 SZ 2011 1 565920
2022.06.01 src55 SZ 2011 1 566010
2022.06.01 src55 SZ 2011 1 566011
2022.06.01 src55 SZ 2011 1 566012
2022.06.01 src55 SZ 2011 1 566363
2022.06.01 src55 SZ 2011 1 566364
2022.06.01 src55 SZ 2011 1 566365
2022.06.01 src55 SZ 2011 1 566512
2022.06.01 src55 SZ 2011 1 566513

Related

Pandas multiindex dataframe: calculation applied to all columns in an index level

I am working with a large Pandas dataframe that has multiple index levels in both rows and columns, something like:
df
Metric | Population Homes
Year | 2018 2019 2020 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ---- ---- ---- ----
A 1 | 100 110 120 50 52 52
2 | 200 205 210 80 80 80
3 | 300 300 300 100 100 100
B 1 | 50 60 70 20 22 24
2 | 100 100 100 40 40 40
3 | 150 140 130 50 47 44
I need to perform calculations for groups of columns, eg. find ratio between population and homes.
Step by step that would be:
# 1. Calculation
R = df["Population"] / df["Homes"]
R
Year | 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ----
A 1 | 2.0 2.1 2.3
2 | 2.5 2.6 2.6
3 | 3.0 3.0 3.0
B 1 | 2.5 2.7 2.9
2 | 2.5 2.5 2.5
3 | 3.0 3.0 3.0
# 2. Re-build multiindex for columns (there are various levels, showing only one here)
R = pd.concat([R],keys=["Ratio"],axis=1)
R
| Ratio
Year | 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ----
A 1 | 2.0 2.1 2.3
2 | 2.5 2.6 2.6
3 | 3.0 3.0 3.0
B 1 | 2.5 2.7 2.9
2 | 2.5 2.5 2.5
3 | 3.0 3.0 3.0
# 3. Concat previous calculation to the main dataframe
df = pd.concat([df,R],axis=1,sort=True) # I need the sort=True to avoid a performance Warning
df
Metric | Population Homes Ratio
Year | 2018 2019 2020 2018 2019 2020 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ---- ---- ---- ---- ---- ---- ----
A 1 | 100 110 120 50 52 52 2.0 2.1 2.3
2 | 200 205 210 80 80 80 2.5 2.6 2.6
3 | 300 300 300 100 100 100 3.0 3.0 3.0
B 1 | 50 60 70 20 22 24 2.5 2.7 2.9
2 | 100 100 100 40 40 40 2.5 2.5 2.5
3 | 150 140 130 50 47 44 3.0 3.0 3.0
I can write the above expresions in a single line, but as I mentioned I have various index levels and it becomes complicated... Is there a way to do something simpler?
I would have guessed:
df["Ratio"] = df["Population"]/df["Homes"]
But it throws a "ValueError: Item must have length equal to number of levels."
Thanks!
Let's do some dataframe reshaping like this:
import pandas as pd
from io import StringIO
# Create your input dataframe
csvdata = StringIO("""100 110 120 50 52 52
200 205 210 80 80 80
300 300 300 100 100 100
50 60 70 20 22 24
100 100 100 40 40 40
150 140 130 50 47 44""")
df = pd.read_csv(csvdata, sep='\s\s+', header=None, engine='python')
df.index = pd.MultiIndex.from_product([[*'AB'],[1,2,3]])
df.columns = pd.MultiIndex.from_product(['Population Homes'.split(' '),
[2018, 2019, 2020]])
df_out=df.stack().eval('Ratio = Population / Homes').unstack().round(1)
print(df_out)
Output:
Homes Population Ratio
2018 2019 2020 2018 2019 2020 2018 2019 2020
A 1 50 52 52 100 110 120 2.0 2.1 2.3
2 80 80 80 200 205 210 2.5 2.6 2.6
3 100 100 100 300 300 300 3.0 3.0 3.0
B 1 20 22 24 50 60 70 2.5 2.7 2.9
2 40 40 40 100 100 100 2.5 2.5 2.5
3 50 47 44 150 140 130 3.0 3.0 3.0
Using stack, eval and unstack.
I've later found this thread, and it's led me in the right direction (or at least what I had in my mind...):
pandas dataframe select columns in multiindex
In order to operate with all columns under the same index level, I had in mind something like this (doesn't work):
df["Ratio"] = df["Population"]/df["Homes"]
From the above thread and pandas doc on slicers (https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#using-slicers) I got to the following expression, which does work:
df.loc[:,("Ratio",slice(None)] = df.loc[:,("Population",slice(None)] / df.loc[:,("Homes",slice(None)]
Changes needed were:
use .loc
within the .loc[...], first need to get all rows with the colon ":"
then use brackets to indicate the multiindex levels
and use slide(None) to get all values of the index at the last level (for the example above...)

How to deduplicate table rows with the same date and keep the row with the most current date stamp?

A client (e-commerce store) doesn't possess a very well-built database. For instance, there are many users with a lot of shopping orders (=different IDs) for exactly the same products and on the same day. It is obvious that these seemingly multiple orders are in many cases just one unique order. At least that's what we have decided to work with to simplify the issue. (I am trying to do a basic data analytics.)
My table might look like this:
| Email | OrderID | Order_date | TotalAmount |
| ----------------- | --------- | ---------------- | ---------------- |
|customerA#gmail.com| 1 |Jan 01 2021 1:00PM| 2000 |
|customerA#gmail.com| 2 |Jan 01 2021 1:03PM| 2000 |
|customerA#gmail.com| 3 |Jan 01 2021 1:05PM| 2000 |
|customerA#gmail.com| 4 |Jan 01 2021 1:10PM| 2000 |
|customerA#gmail.com| 5 |Jan 01 2021 1:14PM| 2000 |
|customerA#gmail.com| 6 |Jan 03 2021 3:55PM| 3000 |
|customerA#gmail.com| 7 |Jan 03 2021 4:00PM| 3000 |
|customerA#gmail.com| 8 |Jan 03 2021 4:05PM| 3000 |
|customerB#gmail.com| 9 |Jan 04 2021 2:10PM| 1000 |
|customerB#gmail.com| 10 |Jan 04 2021 2:20PM| 1000 |
|customerB#gmail.com| 11 |Jan 04 2021 2:30PM| 1000 |
|customerB#gmail.com| 12 |Jan 06 2021 5:00PM| 5000 |
|customerC#gmail.com| 13 |Jan 09 2021 3:00PM| 4000 |
|customerC#gmail.com| 14 |Jan 09 2021 3:06PM| 4000 |
And my desired result would look like this:
| Email | OrderID | Order_date | TotalAmount |
| ----------------- | --------- | ---------------- | ---------------- |
|customerA#gmail.com| 5 |Jan 01 2021 1:14PM| 2000 |
|customerA#gmail.com| 8 |Jan 03 2021 4:05PM| 3000 |
|customerA#gmail.com| 11 |Jan 04 2021 2:30PM| 1000 |
|customerA#gmail.com| 12 |Jan 06 2021 5:00PM| 5000 |
|customerA#gmail.com| 14 |Jan 09 2021 3:06PM| 4000 |
I would guess this might be a common problem, but is there a simple solution to this?
Maybe there is, but I certainly don't seem to come up with one any time soon. I'd like to see even a complex solution, btw :-)
Thank you for any kind of help you can provide!
Do you mean this?
WITH
indata(Email,OrderID,Order_ts,TotalAmount) AS (
SELECT 'customerA#gmail.com', 1,TO_TIMESTAMP( 'Jan 01 2021 01:00PM','Mon DD YYYY HH12:MIAM'),2000
UNION ALL SELECT 'customerA#gmail.com', 2,TO_TIMESTAMP( 'Jan 01 2021 01:03PM','Mon DD YYYY HH12:MIAM'),2000
UNION ALL SELECT 'customerA#gmail.com', 3,TO_TIMESTAMP( 'Jan 01 2021 01:05PM','Mon DD YYYY HH12:MIAM'),2000
UNION ALL SELECT 'customerA#gmail.com', 4,TO_TIMESTAMP( 'Jan 01 2021 01:10PM','Mon DD YYYY HH12:MIAM'),2000
UNION ALL SELECT 'customerA#gmail.com', 5,TO_TIMESTAMP( 'Jan 01 2021 01:14PM','Mon DD YYYY HH12:MIAM'),2000
UNION ALL SELECT 'customerA#gmail.com', 6,TO_TIMESTAMP( 'Jan 03 2021 03:55PM','Mon DD YYYY HH12:MIAM'),3000
UNION ALL SELECT 'customerA#gmail.com', 7,TO_TIMESTAMP( 'Jan 03 2021 04:00PM','Mon DD YYYY HH12:MIAM'),3000
UNION ALL SELECT 'customerA#gmail.com', 8,TO_TIMESTAMP( 'Jan 03 2021 04:05PM','Mon DD YYYY HH12:MIAM'),3000
UNION ALL SELECT 'customerB#gmail.com', 9,TO_TIMESTAMP( 'Jan 04 2021 02:10PM','Mon DD YYYY HH12:MIAM'),1000
UNION ALL SELECT 'customerB#gmail.com',10,TO_TIMESTAMP( 'Jan 04 2021 02:20PM','Mon DD YYYY HH12:MIAM'),1000
UNION ALL SELECT 'customerB#gmail.com',11,TO_TIMESTAMP( 'Jan 04 2021 02:30PM','Mon DD YYYY HH12:MIAM'),1000
UNION ALL SELECT 'customerB#gmail.com',12,TO_TIMESTAMP( 'Jan 06 2021 05:00PM','Mon DD YYYY HH12:MIAM'),5000
UNION ALL SELECT 'customerC#gmail.com',13,TO_TIMESTAMP( 'Jan 09 2021 03:00PM','Mon DD YYYY HH12:MIAM'),4000
UNION ALL SELECT 'customerC#gmail.com',14,TO_TIMESTAMP( 'Jan 09 2021 03:06PM','Mon DD YYYY HH12:MIAM'),4000
)
,
-- need a ROW_NUMBER() to identify the last row within the day (order descending to get 1.
-- can't filter by an OLAP function, so in a fullselect, and WHERE cond in the final SELECT
with_rank AS (
SELECT
*
, ROW_NUMBER() OVER(PARTITION BY email,DAY(order_ts) ORDER BY order_ts DESC) AS rank
FROM INDATA
)
SELECT
*
FROM with_rank
WHERE rank = 1;
-- out Email | OrderID | Order_ts | TotalAmount | rank
-- out ---------------------+---------+---------------------+-------------+------
-- out customerA#gmail.com | 5 | 2021-01-01 13:14:00 | 2000 | 1
-- out customerA#gmail.com | 8 | 2021-01-03 16:05:00 | 3000 | 1
-- out customerB#gmail.com | 11 | 2021-01-04 14:30:00 | 1000 | 1
-- out customerB#gmail.com | 12 | 2021-01-06 17:00:00 | 5000 | 1
-- out customerC#gmail.com | 14 | 2021-01-09 15:06:00 | 4000 | 1

How to create churn table from transactional data?

Currently my Transaction Table has customer's transaction data for each month. Account_ID identifies the customer's ID. Order_ID is the number of orders that the customer had made. Reporting_week_start_date is the week which begins on Monday where each transaction is made (Date_Purchased).
How do i create a new table to identify the customer_status after each transaction has been made? Note that the new table has the Reporting_week_start_date until current date despite no transactions has been made .
Customer_Status
- New : customers who made their first paid subscription
- Recurring : customers with continuous payment
- Churned : when customers' subscriptions had expired and there's no renewal within the next month/same month
- Reactivated : customers who had churned and then returned to re-subscribe
Transaction Table
Account_ID | Order_ID | Reporting_week_start_date| Date_Purchased | Data_Expired
001 | 1001 | 31 Dec 2018 | 01 Jan 2019 | 08 Jan 2019
001 | 1001 | 07 Jan 2019 | 08 Jan 2019 | 15 Jan 2019
001 | 1001 | 14 Jan 2019 | 15 Jan 2019 | 22 Jan 2019 #Transaction 1
001 | 1001 | 21 Jan 2019 | 22 Jan 2019 | 29 Jan 2019
001 | 1001 | 28 Jan 2019 | 29 Jan 2019 | 31 Jan 2019
001 | 1002 | 28 Jan 2019 | 01 Feb 2019 | 08 Feb 2019
001 | 1002 | 04 Feb 2019 | 08 Feb 2019 | 15 Feb 2019 #Transaction 2
001 | 1002 | 11 Feb 2019 | 15 Feb 2019 | 22 Feb 2019
001 | 1002 | 18 Feb 2019 | 22 Feb 2019 | 28 Feb 2019
001 | 1003 | 25 Feb 2019 | 01 Mar 2019 | 08 Mar 2019
001 | 1003 | 04 Mar 2019 | 08 Mar 2019 | 15 Mar 2019
001 | 1003 | 11 Mar 2019 | 15 Mar 2019 | 22 Mar 2019 #Transaction 3
001 | 1003 | 18 Mar 2019 | 22 Mar 2019 | 29 Mar 2019
001 | 1003 | 25 Mar 2019 | 29 Mar 2019 | 31 Mar 2019
001 | 1004 | 27 May 2019 | 01 Jun 2019 | 08 Jun 2019
001 | 1004 | 03 Jun 2019 | 08 Jun 2019 | 15 Jun 2019 #Transaction 4
001 | 1004 | 10 Jun 2019 | 15 Jun 2019 | 22 Jun 2019
001 | 1004 | 17 Jun 2019 | 22 Jun 2019 | 29 Jun 2019
001 | 1004 | 24 Jun 2019 | 29 Jun 2019 | 30 Jun 2019
Expected Output
Account_ID | Order_ID | Reporting_week_start_date| Customer_status
001 | 1001 | 31 Dec 2018 | New
001 | 1001 | 07 Jan 2019 | New #Transaction 1
001 | 1001 | 14 Jan 2019 | New
001 | 1001 | 21 Jan 2019 | New
001 | 1001 | 28 Jan 2019 | New
001 | 1002 | 28 Jan 2019 | Recurring
001 | 1002 | 04 Feb 2019 | Recurring #Transaction 2
001 | 1002 | 11 Feb 2019 | Recurring
001 | 1002 | 18 Feb 2019 | Recurring
001 | 1003 | 25 Feb 2019 | Churned
001 | 1003 | 04 Mar 2019 | Churned #Transaction 3
001 | 1003 | 11 Mar 2019 | Churned
001 | 1003 | 18 Mar 2019 | Churned
001 | 1003 | 25 Mar 2019 | Churned
001 | - | 1 Apr 2019 | Churned
001 | - | 08 Apr 2019 | Churned
001 | - | 15 Apr 2019 | Churned
001 | - | 22 Apr 2019 | Churned
001 | - | 29 Apr 2019 | Churned
001 | - | 29 Apr 2019 | Churned
001 | - | 06 May 2019 | Churned
001 | - | 13 May 2019 | Churned
001 | - | 20 May 2019 | Churned
001 | - | 27 May 2019 | Churned
001 | 1004 | 27 May 2019 | Reactivated
001 | 1004 | 03 Jun 2019 | Reactivated #Transaction 4
001 | 1004 | 10 Jun 2019 | Reactivated
001 | 1004 | 17 Jun 2019 | Reactivated
001 | 1004 | 24 Jun 2019 | Reactivated'
...
...
...
current date
I think you just want window functions and case logic. Assuming the date you are referring to is Reporting_week_start_date, then the logic looks something like this:
select t.*,
(case when Reporting_week_start_date = min(Reporting_week_start_date) over (partition by account_id)
then 'New'
when Reporting_week_start_date < dateadd(lag(Reporting_week_start_date) over (partition by account_id order by Reporting_week_start_date), interval 1 month)
then 'Recurring'
when Reporting_week_start_date < dateadd(lead(Reporting_week_start_date) over (partition by account_id order by Reporting_week_start_date), interval -1 month)
then 'Churned'
else 'Reactivated'
end) as status
from transactions t;
These are not exactly the rules you have specified. But they seem very reasonable interpretations of what you want to do.

Select from two dates, automatically displaying consecutive rows data

I expect to select from two date, automatically displaying consecutive time rows data.
e.g:
Select *
from somefunction('2013/5','2019/3');
Expected result:
Year | Month
-----+------
2013 | 5
2013 | 6
.. | ..
2013 | 12
.. | ..
.. | ..
2019 | 1
2019 | 2
2019 | 3
I have solved the problem, and the solution is provided here .
declare #dStart datetime = '2013/05/01'
,#dEnd datetime = '2019/03/31';
SELECT year(Dateadd(month,number,#dStart)) as year,month(Dateadd(month,number,#dStart)) as month
FROM master..spt_values
WHERE
type = 'P'
AND number <= DATEDIFF(month, #dStart, #dEnd)
GO
year | month
---: | ----:
2013 | 5
2013 | 6
2013 | 7
2013 | 8
2013 | 9
2013 | 10
2013 | 11
2013 | 12
2014 | 1
2014 | 2
2014 | 3
2014 | 4
2014 | 5
2014 | 6
2014 | 7
2014 | 8
2014 | 9
2014 | 10
2014 | 11
2014 | 12
2015 | 1
2015 | 2
2015 | 3
2015 | 4
2015 | 5
2015 | 6
2015 | 7
2015 | 8
2015 | 9
2015 | 10
2015 | 11
2015 | 12
2016 | 1
2016 | 2
2016 | 3
2016 | 4
2016 | 5
2016 | 6
2016 | 7
2016 | 8
2016 | 9
2016 | 10
2016 | 11
2016 | 12
2017 | 1
2017 | 2
2017 | 3
2017 | 4
2017 | 5
2017 | 6
2017 | 7
2017 | 8
2017 | 9
2017 | 10
2017 | 11
2017 | 12
2018 | 1
2018 | 2
2018 | 3
2018 | 4
2018 | 5
2018 | 6
2018 | 7
2018 | 8
2018 | 9
2018 | 10
2018 | 11
2018 | 12
2019 | 1
2019 | 2
2019 | 3
db<>fiddle here

count of nulls over a window

I need to get the count of nulls group by ID but excluding the month and sales in group by
sample data
id custname reportdate sales
1 xx 31-JAN-17 1256
1 xx 31-MAR-17 <null>
1 xx 30-JUN-17 5678
1 xx 31-DEC-17 <null>
1 xx 31-JAN-18 1222
1 xx 31-MAR-18 <null>
1 xx 30-JUN-18 5667
1 xx 31-DEC-18 7890
2 yy 31-JAN-17 1223
2 yy 31-APR-17 3435
2 yy 30-JUN-17 <null>
2 yy 31-DEC-17 4567
2 yy 31-JAN-18 5678
2 yy 31-APR-18 <null>
2 yy 30-JUN-18 <null>
2 yy 31-DEC-18 2345
what i need as a output
id custname reportdate sales count(Sales nulls)
1 xx 31-JAN-17 1256 2
1 xx 31-MAR-17 <null> 2
1 xx 30-JUN-17 5678 2
1 xx 31-DEC-17 <null> 2
1 xx 31-JAN-18 1222 1
1 xx 31-MAR-18 <null> 1
1 xx 30-JUN-18 5667 1
1 xx 31-DEC-18 7890 1
2 yy 31-JAN-17 1223 1
2 yy 31-APR-17 3435 1
2 yy 30-JUN-17 <null> 1
2 yy 31-DEC-17 9643 1
2 yy 31-JAN-18 5678 2
2 yy 31-APR-18 <null> 2
2 yy 30-JUN-18 <null> 2
2 yy 31-DEC-18 2345 2
As you can see i have multiple years and i need the partition on id and year NOT MONTH
Using a case expression in count window function.
select t.*,count(case when sales is null then 1 end) over(partition by id) as null_cnt_per_id
from tbl t