How to aggregate number of customers in SQL Server? - sql

I have transaction data like this:
| Time_Stamp | Customer_ID | Amount | Department | Pay_Method | Channel |
|---------------------|-------------|--------|------------|-------------|------------|
| 2018-03-07 14:23:33 | 374856829 | 14.63 | Fruit | Credit Card | Mobile App |
I have written an aggregation procedure like this:
INSERT INTO Days
(
Year,
Month,
Day,
Department,
Pay_Method,
Total_Dollars,
Total_Transactions,
Total_Customers
)
SELECT
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method,
SUM(Amount),
COUNT(*),
COUNT(DISTINCT(Customer_ID))
FROM
Transactions
GROUP BY
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method
Which populates a data mart table like this:
| Year | Month | Day | Department | Pay_Method | Total_Dollars | Total_Transactions | Total_Customers |
|------|-------|-----|------------|------------|---------------|--------------------|-----------------|
| 2018 | 3 | 7 | Home | Cash | 2398540.57 | 543084 | 325783 |
| 2018 | 3 | 7 | Home | Credit | 7458392.47 | 1587695 | 758643 |
So far, so good.
I then have procedures which feed the charts UI like this:
SELECT
Year,
Month,
Day,
SUM(Total_Dollars),
SUM(Total_Transactions),
SUM(Total_Customers)
FROM
Days
WHERE
IIF(#Department IS NULL, Department, #Department) AND
IIF(#Pay_Method IS NULL, Pay_Method, #Pay_Method)
GROUP BY
Year,
Month,
Day
This all works great for Total_Transactions and Total_Dollars, but not for Total_Customers.
The Total_Customers numbers in the Days table are correct in each row, for that specific combination of Year, Month, Day, Department and Pay_Method, but when two of those rows are summed together, the total becomes inaccurate, because the same customer may have made multiple transactions using different Department(s) and Pay_Method(s) on the same date. The numbers become even more inaccurate when adding days together to get monthly customer counts, etc...
I thought the solution would be to try and trick SQL Server into considering "all" as possible values for the various "group by" fields, and played around with group by and case quite a bit but couldn't figure it out. Essentially, in addition to my Days table containing every specific combination of Year, Month, Day, Department and Pay_Method, I also need to generate rows where Year, Month, Day, Department and Pay_Method are considered as "any" or "all". Lastly, I don't need to generate rows where Year is "any" and Month and Day are specified (although it wouldn't hurt really), as no one cares for totals of March 7th in any year, etc...
Can someone help me write the query to properly populate my Days table?

Your problem is because the "grain" of your model is wrong. Grain is the term given to the level of detail in a fact table.
You always want to store your facts at the finest level of detail, then you can aggregate your data correctly. You were already at that point with your first table.
Rather than aggregating the data (incorrectly) into your second table, simply rewrite or amend that table to break your date/time into the fields you require for reporting.
By the way, if this is truly representative of your data, I suspect that you might actually be hiding an error in your transaction count. You may need a finer level of detail than "department", and I suspect it might be a concept like "product". What would happen to your model if a customer bought both apples and oranges?

Related

SQL - Identify if a user is present every month

I am performing some data analysis on users who have made transactions over the course of three months.
What I would like to do is identify customers who made specific transaction types (Credit) in every single month present in the data table over those two years. As you can see in the data table below, User A has performed a Credit transaction in months 1,2,3 and I would like a flag saying "Frequent" applied to the customer.
User B, however, has not performed a credit transaction every month (month 2 was Debit), and so I would like them to have a different flag name (e.g. "Infrequent").
How can I use SQL to identify if a user has made a specific transaction type each month?
| Date | User | Amount | Transaction Type | **Flag ** |
| 2022-01-15 | A | $15.00 | Credit | **Flag ** |
...
| 2022-02-15 | A | $15.00 | Credit | **Flag ** |
...
| 2022-03-15 | A | $15.00 | Credit | **Flag ** |
...
...
| 2022-01-15 | B | $15.00 | Credit | **Flag ** |
...
| 2022-02-15 | B | $15.00 | Debit | **Flag ** |
...
| 2022-03-15 | B | $15.00 | Credit | **Flag ** |
I have tried the following - hoping there is a better or more simple way.
SELECT
Date, User, Amount, Transaction_Type,
CASE WHEN Count(present) = 3 THEN 'Frequent' ELSE 'Infrequent'
FROM Transactions
LEFT JOIN (
SELECT
User,Month(Date),Count(Transaction_Type) as present
FROM
Transactions
WHERE
Transaction_Type = 'Credit'
GROUP BY
User,Month(Date)
Having
Count(Transaction_Type) > 0
) subquery
ON subquery.User = Transaction.User
GROUP BY
Date,User,Amount,Transaction_Type
That is the way I would approach it. Assuming you are using T-SQL I would make the following changes. Instead of having the LEFT JOIN be to a sub-query, I would make the sub-query a CTE and then joint to that. I find it easier to grok when the main query is not full of sub-queries and you can test the CTE on its own more easily, plus if performance becomes an issue is relatively trivial to convert the CTE to a temp table. without affecting the main query too much.
You have a couple of problems I think. the first is that your subquery is going to return you the count of the credits in each month. If I make 3 credits in January this is going to flag me as frequent because the total is more than 3. You probably want to do a
COUNT(DISTINCT Transaction_type) AS hasCredit
to identify if there is AT LEAST ONE credit transaction, then have another aggregation that
SUM(hasCredit)
to get the number of months in which a credit appears.
using nested sub-queries means your LEFT JOIN would now be two sub-queries deep and dissapearing off the right hand side of your screen. Writing them as CTEs keeps the main logic clean and script narrow.
I think this does what you need, but can't test it because I don't have any sample data.
WITH CTE_HasCredit AS
(
SELECT
User
,Month(Date) AS [TransactionMonth]
,Count(DISTINCT Transaction_Type) AS [hasCredit]
FROM
Transactions
WHERE
Transaction_Type = 'Credit'
GROUP BY
User
,Month(Date)
Having
Count(Transaction_Type) > 0
)
,
CTE_isFrequent AS
(
SELECT
User
,SUM(hasCredit) AS [TotalCredits]
FROM
CTE_HasCredit
GROUP BY
User
)
SELECT
TXN.Date
, TXN.User
, TXN.Amount
, TXN.Transaction_Type
,CASE
WHEN FRQ.TotalCredits >= 3 THEN 'Frequent'
ELSE 'Infrequent'
END AS [customerType]
FROM
Transactions AS TXN
LEFT JOIN
CTE_isFrequent AS FRQ ON FRQ.User = TXN.User
GROUP BY
TXN.Date
,TXN.User
,TXN.Amount
,TXN.Transaction_Type
I don't think you need the GROUP BY on the main query either; it would de-dupe transactions for the same day for the same amount.
You might also want to look at the syntax for COUNT() OVER(). These would allow you to do the calculations in the main query and would look something like.
,CASE
WHEN COUNT(DISTINCT TXN.Transaction_Type) OVER(PARTITION BY User, MONTH(TXN.Date),TXN.Transaction_Type) >=3 THEN 'Frequent'
ELSE 'Infrequent'
END AS [customerType2]
This second way would give you customer type for both the Debits and Credits. I am not aware of a way to filter the COUNT() OVER() to just Credits, for that you would need to use the CTE method.

Structuring Month-Based Data in SQL

I'm curious about what the best way to structure data in a SQL database where I need to keep track of certain fields and how they differ month to month.
For example, if I had a users table in which I was trying to store 3 different values: name, email, and how many times they've logged in each month. Would it be best practice to create a new column for each month and store the number of times they logged in that month under that column? Or would it be better to create a new row/table for each month?
My instinct says creating new columns is the best way to reduce redundancy, however I can see it getting a little unwieldy when the number of columns in the table changes over time. (I was also thinking that if I were to do it by column, it would warrant having a total_column that keeps track of all months at a time).
Thanks!
In my opinion, the best approach is to store each login for each user.
Use a query to summarize the data the way you need it when you query it.
You should only be thinking about other structures if summarizing the detail doesn't meet performance requirements -- which for a monthly report don't seem so onerous.
Whatever you do, storing counts in separate columns is not the right thing to do. Every month, you would need to add another column to the table.
I'm not an expert but in my opinion, it is best to store data in a separate table (in your case). That way you can manipulate the data easily and you don't have to modify the table design in the future.
PK: UserID & Date or New Column (Ex: RowNo with auto increment)
+--------+------------+-----------+
| UserID | Date | NoOfTimes |
+--------+------------+-----------+
| 01 | 2018.01.01 | 1 |
| 01 | 2018.01.02 | 3 |
| 01 | 2018.01.03 | 5 |
| .. | | |
| 02 | 2018.01.01 | 2 |
| 02 | 2018.01.02 | 6 |
+--------+------------+-----------+
Or
PK: UserID, Year & Month or New Column (Ex: RowNo with auto increment)
+--------+------+-------+-----------+
| UserID | Year | Month | NoOfTimes |
+--------+------+-------+-----------+
| 01 | 2018 | Jan | 10 |
| 01 | 2018 | feb | 13 |
+--------+------+-------+-----------+
Before you create the table, please take a look at the database normalization. Especially 1st (1NF), 2nd (2NF) and 3rd (3NF) normalization forms.
https://www.tutorialspoint.com/dbms/database_normalization.htm
https://www.lifewire.com/database-normalization-basics-1019735
https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/
https://www.studytonight.com/dbms/database-normalization.php
https://medium.com/omarelgabrys-blog/database-normalization-part-7-ef7225150c7f
Either approach is valid, depending on query patterns and join requirements.
One row for each month
For a user, the row containing login count for the month will be inserted when data is available for the month. There will be 1 row per month per user. This design will make it easier to do joins by month column. However, multiple rows will need to be accessed to get data for a user for the year.
-- column list
name
email
month
login_count
-- example entries
'user1', 'user1#email.com','jan',100
'user2', 'user2#email.com','jan',65
'user1', 'user1#email.com','feb',90
'user2', 'user2#email.com','feb',75
One row for all months
You do not need to dynamically add columns, since number of months is known in advance. The table can be initially created to accommodate all months. By default, all month_login_count columns would be initialized to 0. Then, the row would be updated as the login count for the month is populated. There will be 1 row per user. This design is not the best for doing joins by month. However, only one row will need to be accessed to get data for a user for the year.
-- column list
name
email
jan_login_count
feb_login_count
mar_login_count
apr_login_count
may_login_count
jun_login_count
jul_login_count
aug_login_count
sep_login_count
oct_login_count
nov_login_count
dec_login_count
-- example entries
'user1','user1#email.com',100,90,0,0,0,0,0,0,0,0,0,0
'user2','user2#email.com',65,75,0,0,0,0,0,0,0,0,0,0

SQL payments matrix

I want to combine two tables into one:
The first table: Payments
id | 2010_01 | 2010_02 | 2010_03
1 | 3.000 | 500 | 0
2 | 1.000 | 800 | 0
3 | 200 | 2.000 | 300
4 | 700 | 1.000 | 100
The second table is ID and some date (different for every ID)
id | date |
1 | 2010-02-28 |
2 | 2010-03-01 |
3 | 2010-01-31 |
4 | 2011-02-11 |
What I'm trying to achieve is to create table which contains all payments before the date in ID table to create something like this:
id | date | T_00 | T_01 | T_02
1 | 2010-02-28 | 500 | 3.000 |
2 | 2010-03-01 | 0 | 800 | 1.000
3 | 2010-01-31 | 200 | |
4 | 2010-02-11 | 1.000 | 700 |
Where T_00 means payment in the same month as 'date' value, T_01 payment in previous month and so on.
Is there a way to do this?
EDIT:
I'm trying to achieve this in MS Access.
The problem is that I cannot connect name of the first table's column with the date in the second (the easiest way would be to treat it as variable)
I added T_00 to T_24 columns in the second (ID) table and was trying to UPDATE those fields
set T_00 =
iif(year(date)&"_"&month(date)=2010_10,
but I realized that that would be to much code for access to handle if I wanted to do this for every payment period and every T_xx column.
Even if I would write the code for T_00 I would have to repeat it for next 23 periods.
Your Payments table is de-normalized. Those date columns are repeating groups, meaning you've violated First Normal Form (1NF). It's especially difficult because your field names are actually data. As you've found, repeating groups are a complete pain in the ass when you want to relate the table to something else. This is why 1NF is so important, but knowing that doesn't solve your problem.
You can normalize your data by creating a view that UNIONs your Payments table.
Like so:
CREATE VIEW NormalizedPayments (id, Year, Month, Amount) AS
SELECT id,
2010 AS Year,
1 AS Month,
2010_01 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
2 AS Month,
2010_02 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
3 AS Month,
2010_03 AS Amount
FROM Payments
And so on if you have more. This is how the Payments table should have been designed in the first place.
It may be easier to use a date field with the value '2010-01-01' instead of a Year and Month field. It depends on your data. You may also want to add WHERE Amount IS NOT NULL to each query in the UNION, or you might want to use Nz(2010_01,0.000) AS Amount. Again, it depends on your data and other queries.
It's hard for me to understand how you're joining from here, particularly how the id fields relate because I don't see how they do with the small amount of data provided, so I'll provide some general ideas for what to do next.
Next you can join your second table with this normalized Payments table using a method similar to this or a method similar to this. To actually produce the result you want, include a calculated field in this view with the difference in months. Then, create an actual Pivot Table to format your results (like this or like this) which is the proper way to display data like your tables do.

Attributes of my Time dimension table in star schema

I'm building a DW with a star schema modeling. I'll use it for a BI project with pentaho.
I'll have of course a time dimension table. I'll analyze my fact table with differents granularity (day, week, month year, perhaps other)
Should I put one attribute for each of those granularity in my dimension table (so I have one day attribute, one month attribute, one year attribute ...) or should I just write the date and then calculate everything with this date (get the month of the date, the year of the date ...)?
thks a lot for your help
In addition to day, week, month, and year, you should think of other attributes like "company holiday", or "fiscal quarter". This can be an enormous resource for driving the same query off of different time windows.
I would add the attributes of the dates as their own columns. This does not take up significantly more space, and generally gives the query optimiser a better shot at working out how many of the dimension table records match a given criterion (for example, that the day_of_month = 31).
Typically, the more, the merrier.
Here is an example I'm using...
ledger#localhost-> select * from date_dimension where date = '2015-12-25';
-[ RECORD 1 ]----+--------------------
date | 2015-12-25
year | 2015
month | 12
monthname | December
day | 25
dayofyear | 359
weekdayname | Friday
calendarweek | 52
formatteddate | 25. 12. 2015
quartal | Q4
yearquartal | 2015/Q4
yearmonth | 2015/12
yearcalendarweek | 2015/52
weekend | Weekday
americanholiday | Holiday
austrianholiday | Holiday
canadianholiday | Holiday
period | Christmas season
cwstart | 2015-12-21
cwend | 2015-12-27
monthstart | 2015-12-01
monthend | 2015-12-31 00:00:00
It's based on queries from the PostgreSQL wiki here... https://wiki.postgresql.org/wiki/Date_and_Time_dimensions
It would be interesting to augment this with further things:
Religious days (Easter, some of the numerous Saints' days, Ramadan, Jewish festivals, etc)
Statutory holidays for relevant jurisdictions. The firm I work for winds up publicizing Irish banking holidays because a number of the customers pay via bank transfers.
If you operate in France, you might want Lundi, Mardi, Mercredi, ... rather than English day names.
Daylight Savings Time (as true/false) would be a nice addition.

Comparing in SQL and SUM

I really couldn't figure out a good title for this question, but I have a problem that I'm sure you can help me with!
I have a query which outputs something like this:
Month | Year | Subcategory | PrivateLabel | Price
-------------------------------------------------
1 | 2010 | 666 | No | -520
1 | 2010 | 666 | No | -499,75
1 | 2010 | 666 | No | -59,95
1 | 2010 | 666 | No | -49,73
1 | 2010 | 666 | No | -32,95
I want to SUM on the price because all the other data is the same. I thought I could do this with SUM and GROUP BY, but I can't figure out how to do it or at least it doesn't output the right result.
The query is an inner join between two tables, if that helps.
select
month
,year
,subcategory
,privatelabel
,sum(price) as [total sales]
from
a inner join b ...
where
any where clauses
group by
month
,year
,subcategory
,privatelabel
should work if i am understanding you correctly.. every colum in the select either needs to be part of the group by or an aggregate function on all rows in the group
added a fiddle.. mainly as i didn't know about he text to DDL functionality and wanted to test it ;-) (thanks Michael Buen)
http://sqlfiddle.com/#!3/35c1c/1
note the where clause is a place holder..
select month, year, subcategory, privatelabel, sum(price)
from (put your query in here) dummyName
group by month, year, subcategory, privatelabel
Basic idea is it will run your current query to get above output then do the sum and group by on the result.
You query has to be in parentheses and you have to give it some name e.g. dummyName. As long as it's unique in the sql and preferably not a key word, doesn't matter what it is.
There might be a way of doing all this in one go, but without the sql for your query we can't help.