Calculating attrition in SQL - sql

I am more or less noob to sql so I would really appreciate if anyone of you could give some hints how start to deal with the task below.
We have a database of donatons made by regular donors to an NGO.
Fields: donor_id, date
We need to crate an attrition table listing the donation periods and the proportion of donors who are still donating the organization after n months.
As we count donors from their first donation, the first period is 100%, than the request should check if the donor gave donation in the 2nd, 3rd Nth month after the donors's first donation.
Donation 1 2 3 4 5 6 7 8 9 10 11
donor 1 1 1 1 0 1 0 1 0 0 0 0
donor 2 1 1 0 0 0 1 0 0 1 0 0
donor 3 1 0 1 0 0 0 0 0 0 0 0
Any idea? :) Thank you!
PS: until know we used excel or google sheets for this but now we got a databse with 50 million rows, so I have been told to find a solution quickly.

Related

how to access repeat purchase records for the next three months without self join?

I have a table with customer transaction information, for example records for one customer (identified by customer_id) look like this:
order_id
bk_date
booking_has_insurance_indicator
1
7/20
0
2
8/2
0
3
8/3
1
4
8/9
1
5
11/6
0
6
12/2
0
7
12/6
0
8
12/7
0
I'd like to find out for each customer, for each order_id, if there's repeat purchase within 90 days and how many of those, also if so, whether there's insurance attached. For example, for order_id = 1, there's three repeat purchase (order_id = 2,3,4) within 90 days and there exist orders with insurance (order_id = 3,4). Ideal output would look like
order_id
bk_date
repeat_count
repeat_has_insurance_indicator
1
7/20
3
1
2
8/2
2
1
3
8/3
2
1
4
8/9
1
0
5
11/6
3
0
6
12/2
2
0
7
12/6
1
0
8
12/7
0
0
I'm aware that if I only want to access the next order record I can use LEAD window function without joining, but with question above, I could only think of self join to join each order_id to the ones with bk_date within 90 days. However, given the volume of the data with millions of customers, self join is also not an option due to memory limit. Could someone help me if there's a more efficient solution?

rolling sum of a column in pandas dataframe at variable intervals

I have a list of index numbers that represent index locations for a DF. list_index = [2,7,12]
I want to sum from a single column in the DF by rolling through each number in list_index and totaling the counts between the index points (and restart count at 0 at each index point). Here is a mini example.
The desired output is in OUTPUT column, which increments every time there is another 1 from COL 1 and RESTARTS the count at 0 on the location after the number in the list_index.
I was able to get it to work with a loop but there are millions of rows in the DF and it takes a while for the loop to run. It seems like I need a lambda function with a sum but I need to input start and end point in index.
Something like lambda x:x.rolling(start_index, end_index).sum()? Can anyone help me out on this.
You can try of cummulative sum and retrieving only 1 values related information , rolling sum with diffferent intervals is not possible
a = df['col'].eq(1).cumsum()
df['output'] = a - a.mask(df['col'].eq(1)).ffill().fillna(0).astype(int)
Out:
col output
0 0 0
1 1 1
2 1 2
3 0 0
4 1 1
5 1 2
6 1 3
7 0 0
8 0 0
9 0 0
10 0 0
11 1 1
12 1 2
13 0 0
14 0 0
15 1 1

how to calculate the specific accumulated amount in t-sql

For each row, I need to calculate the integer part from dividing by 4. For each subsequent row, we add the remainder of the division by 4 previous and current lines and look at the whole part and the remainders from dividing by 4. Consider the example below:
id val
1 22
2 1
3 1
4 2
5 1
6 6
7 1
After dividing by 4, we look at the whole part and the remainders. For each id we add up the accumulated points until they are divided by 4:
id val wh1 rem1 wh2 rem2 RESULT(wh1+wh2)
1 22 5 2 0 2 5
2 1 0 1 (3/4=0) 3%4=3 0
3 1 0 1 (4/4=1) 4%4=0 1
4 2 0 2 (2/4=0) 2%4=2 0
5 1 0 1 (3/4=0) 3%4=3 0
6 7 1 2 (5/4=1) 5%4=1 2
7 1 0 1 (2/4=0) 2%4=1 0
How can I get the next RESULT column with sql?
Data of project:
http://sqlfiddle.com/#!18/9e18f/2
The whole part from the division into 4 is easy, the problem is to calculate the accumulated remains for each id, and to calculate which of them will also be divided into 4

How to return a group of rows when one row meets "where" criteria in SQL Anywhere

I am somewhat overwhelmed by what I am trying to do, since I have only been using SQL for 3 days now, but I already love the increased functionality over MS query. The need for the IN function is what drove me to learn about this, and I thank the community for the info here to get me through learning that.
I tried looking thru other questions, but I couldn't find one in which the intent was to group more than two rows, or to group a varying number of rows. This means that count and duplicate are both out as options.
What I am doing is analyzing a table of part number information that spans multiple store locations. The table gives a row to each instance of a part number, so if all 15 stores have some sort of history for a given part number, that part number will have 15 rows in the table.
I am wanting to look at other store's history for parts that meet the criteria of 0 sales history for my location. The purpose is to see if they can be transferred to another store instead of being returned to the vendor and incurring a restock fee.
Here is a simplified version of the table organized in the way I would want the output to be structured. I got here by having suspected part numbers and using the list of them as a text string in IN() but I want to go about this the other way and build a list of part numbers from sales data in this table.
Branch| Part_No| Description| Bin Qty|current 12 mo sales|previous 12 mo sales|
------|--------|------------|---------|-------------------|--------------------|
20 CA38385 SUPPORT 2 1 1
23 CA38385 SUPPORT 1 0 0
25 CA38385 SUPPORT 0 0 1
20 DFC10513 Hdw Kit 0 1 0
23 DFC10513 Hdw Kit 1 0 0
07 DFC10513 Hdw Kit 0 1 0
3 D59096 VALVE 0 0 12
5 D59096 VALVE 0 0 4
6 D59096 VALVE 4 6 12
8 D59096 VALVE 0 0 0
33 D59096 VALVE 11 14 18
21 D59096 VALVE 4 4 4
22 D59096 VALVE 0 0 0
23 D59096 VALVE 10 0 0
24 D59096 VALVE 0 0 0
25 D59096 VALVE 0 0 0
26 D59096 VALVE 2 2 0
1 TE67401 Repair Kit 1 1 2
21 TE67401 REPAIR KIT 1 3 0
22 TE67401 REPAIR KIT 0 1 0
I am branch 23, so the start of the query as I understand it would be
Select * from part_information
Group By part_number
Having IN(Branch) 23 and bin qty > 0 and current_12_mo_sales=0 and previous_12_mo_sales = 0
Can you point me down the right track? This table has approx. 200000 rows in it, so I really need to learn how to do this. I really don't see a better way.
Thank you in advance for your help and or criticism -Cody
Select * from part_information
where part_number not in (
select part_number from part_information
where branch = 23 and bin_qty > 0 -- etc...
)
(Apologies for lack of formatting).
This ended up working the way I wanted
SELECT pi_Branch, pi_Franchise, pi_Part_No, pi_Description, pi_Bin_Qty,
pi_Bin, pi_current_12_mo_sales, pi_previous_12_mo_sales, pi_Inventory_Cost,
pi_Return_Indicator
From Part_Information
Where pi_Part_No IN (Select pi_Part_No
From Part_Information
Where pi_Branch=23 And
pi_Bin_Qty>0 And pi_current_12_mo_sales<=0
And pi_previous_12_mo_sales<=0)
I was thinking that this had to be some complex process, but in reality, two simple queries were all that was needed.
I would still be interested in anyone's opinion on a better or more efficient way of handling this.
Thanks Mischa for getting me there!

How to perform a Distinct Sum using MDX?

So I have data like this:
Date EMPLOYEE_ID HEADCOUNT TERMINATIONS
1/31/2011 1 1 0
2/28/2011 1 1 0
3/31/2011 1 1 0
4/30/2011 1 1 0
...
1/31/2012 1 1 0
2/28/2012 1 1 0
3/31/2012 1 1 0
1/31/2012 2 1 0
2/28/2011 2 1 0
3/31/2011 2 1 0
4/30/2011 2 0 1
1/31/2012 3 1 0
2/28/2011 3 1 0
3/31/2011 3 1 0
4/30/2011 3 1 0
...
1/31/2012 3 1 0
2/28/2012 3 1 0
3/31/2012 3 1 0
And I want to sum up the headcount, but I need to remove the duplicate entries from the sum by the employee_id. From the data you can see employee_id 1 occurs many times in the table, but I only want to add its headcount column once. For example if I rolled up on year I might get a report using this query:
with member [Measures].[Distinct HeadCount] as
??? how do I define this???
select { [Date].[YEAR].children } on ROWS,
{ [Measures].[Distinct HeadCount] } on COLUMNS
from [someCube]
It would product this output:
YEAR Distinct HeadCount
2011 3
2012 2
Any ideas how to do this with MDX? Is there a way to control which row is used in the sum for each employee?
You can use an expression like this:
WITH MEMBER [Measures].[Distinct HeadCount] AS
Sum(NonEmpty('the set of the employee ids', 'all the dates of the current year (ie [Date].[YEAR].CurrentMember)'), [Measures].[HeadCount])
If you want a more generic expression you can use this:
WITH MEMBER [Measures].[Distinct HeadCount] AS
Sum(NonEmpty('the set of the employee ids',
Descendants(Axis(0).Item(0).Item(0).Hierarchy.CurrentMember, Axis(0).Item(0).Item(0).Hierarchy.CurrentMember.Level, LEAVES)),
IIf(IsLeaf(Axis(0).Item(0).Item(0).Hierarchy.CurrentMember),
[Measures].[HeadCount],
NULL))