Identifying Parent transaction on set of records in SQL - sql

I have a list of transactions and its payments. I am looking for finding the parent payment for the transactions to identify the repeated customers. For example, I have list as below:
Transaction Payment1 Payment2 Payment3 Bucket
100 A B C P1
110 B P1
120 D E P2
130 D E F P2
140 C B P1
160 F K P2
170 C A P1
Parent Transaction is the final result. It need not be A and D always, which ever is the best in finding the Unique value, for eg: Instead of A, it can B or C if we can derive it easily.
I tried going by iterations, first comparing column 3 with col1 and col2 values and deriving the Col1 of first finding as parent. But some where it is not working at all. I have more than million transactions to derive the parent payment to identify the unique customer.
Example is for transaction 100, I used 3 different payment cards (Like Visa, Master Card, AMEX, Debit card, Gift card). I might be using any of these cards in any other transactions. For example Payment B, I used B in Transaction 110. So 100 and 110 should be under same bucket. For transaction 140 and 170, I used payments C,B & C,A. All these cards are from the same person. SO all these transactions should come under same bucket. I want to identify that bucket. Let us name all these set of transactions as P1 and if I query on P1, I should get all these transactions. Same applies to other set of transactions.

Ok thanks.. I figured out a solution, wrote a function in R, took the data in R and wrote a loop and did a logic of identifying parent transaction and put it back into database. Since it is millions of records, it is taking 2 days to execute but with accuracy.

Related

Getting Unique Data Based on Two Columns and Date

I am new to SAS (and Proc SQL) and I am working this out as an exercise to improve my familiarity with SAS, but can't seem to get the correct solution.
I have resort data for two neighboring resorts that contains a guest identifier, resort identifier, when the person was admitted into the resort, and when they left. I have already sorted the data by guest identifier, admission date, and leave date. The data looks something like this:
ID Resort Admission_Date Leave_Date
1 B 15SEP2020 24SEP2020
1 A 24SEP2020 01OCT2020
1 B 25SEP2020 27SEP2020
1 B 28DEC2020 29DEC2020
2 B 07FEB2020 09FEB2020
2 A 09FEB2020 22FEB2020
3 B 26DEC2019 29DEC2019
3 B 30JAN2021 23FEB2021
3 A 23FEB2021 12MAR2021
3 B 13APR2021 16APR2021
3 B 05MAY2021 07MAY2021
My goal here is to identify those guests that went from resort A to resort B (and vice versa). I realize that some guests visited both resorts multiple times. To avoid this issue of multiple resort visits I would like to summarize the data so that we only have the first "switch" between hotels. In other words, once a guest switches from resort A to B (or from B to A) we do not care if they go back to the first resort.
Thus, the end dataset should look something like this:
ID Resort Admission_Date Leave_Date
1 B 15SEP2020 24SEP2020
1 A 24SEP2020 01OCT2020
2 B 07FEB2020 09FEB2020
2 A 09FEB2020 22FEB2020
3 B 26DEC2019 29DEC2019
3 A 23FEB2021 12MAR2021
I realize that this may have a simple solution, but I am not able to come up with it on my own at this time so any help on this is greatly appreciated!

SQL-sum over dynamic period

I have 2 tables: Customers and Actions, where each customer has uniqe ID (which can be found in each table).
Part of the customers became club members at a specific date (change between the customers). I'm trying to summarize their purchases until that date, and to get those who purchase more than (for example) 200 until they become club members.
For example, I can have the following customer:
custID purchDate purchAmount
1 2015-05-12 100
1 2015-07-12 150
1 2015-12-29 320
Now, assume that custID=1 became a club member at 2015-12-25; in that case, I'd like to get SUM(purchAmount)=250 (pay attention that I'd like to get this customer because 250>200).
I tried the following:
SELECT cust.custID, SUM(purchAmount)totAmount
FROM customers cust
JOIN actions act
ON cust.custID=act.custID
WHERE act.clubMember=1
AND cust.purchDate<act.clubMemberDate
GROUP BY cust.custID
HAVING totAmount>200;
Is it the right way to "attack" this question, or should I use something like while loop over the clubMemberDate (which telling the truth-I don't know how to do)?
I'm working with Teradata.
Your help will be appreciated.

Designing a SSAS Cube with many to many reference table.

I have a table with the following schema.
Dim LocationProductsMapping Table (the key in this table is not used anywhere. It is just a primary key column. )
Key Location Products
XX A1 P1
XX A2 P2
XX A3 P3
XX A1 P2
XX A3 P2
* Dim Products (Say P1 , P2 , P3 as keys)
* Dim SellingMode (Say S1 , S2 , S3 as keys)
* Dim Shop (Shop1,Shop10,Shop100)
Fact Sales table
Product SellingMode Shop Revenue
P1 S1 Shop1 $100
P1 S2 Shop10 $400
P1 S1 Shop100 $100
P1 S3 Shop1 $100
P2 S2 Shop10 $400
P1 S1 Shop100 $100
P3 S3 Shop1 $100
Now I need to build a CUBE.
How should I create a dimension which should include My Location products mapping? (i.e, when I filter by Location, I should only get the following data of the location ).
Output
SellingMode1 $2000 Revenue 20 number of Products
SellingMode2 $3000 Revenue 25 number of products
I tried to create a hierarchy in Dimension like Location,ProductKey. But that doens't help. Values are not proper and filter condition is not applied.
I cannot change the table schema
LocationProductsMapping table is not included in the Datasource view automatically. ( I added this )
I don’t have the Dimension created for “LocationProductsMapping “, (understandable as this is the reference table)
I am not sure if you have a separate location table. If you have one, build a dimension from it, otherwise use the Location column from the LocationProductsMapping table to build the dimension.
Then create a new measure group from your LocationProductsMapping table. As Analysis Services cannot have measure groups without measures, use the count which the wizard normally suggests. Make this measure invisible, as it is not useful for users. Then on the "Dimension Usage" tab of cube designer, make sure your mapping measure group is related to the Product and Location dimensions, and set the relationship from the main measure group to location to "Many-to-many", selecting the mapping measure group.
And you are done. Analysis Services handles the rest for you.

Optimal selection for ordering multiple items (parts) from multiple suppliers (vendors)

The task here is to define the optimal (as detailed below) way of ordering items (parts) from suppliers.
The relevant parts of the table schema (with some sample data) are
Items
ID NUMBER
1 Item0001
2 Item0002
3 Item0003
Suppliers
ID NAME DELIVERY DISCOUNT
1 Supplier0001 0 0
2 Supplier0002 0 0.025
3 Supplier0003 20 0
DELIVERY is the delivery charge (in dollars) levied by that supplier on each delivery. DISCOUNT is the settlement discount (as a percentage i.e. 2.5% for ID=2 above) allowed by that supplier for on time payment.
SupplierItems
SUPPLIER_ID ITEM_ID PRICE
1 2 21.67
1 5 45.54
1 7 32.97
This is the many-to-many join between suppliers and items with the price that supplier charges for that item (in dollars). Every item has at least 1 supplier but some have more than one. A supplier may have no items.
PartsRequests
ID ITEM_ID QUANTITY LOCATION_ID ORDER_ID
1 59 4 2 (null)
2 89 5 2 (null)
3 42 4 2 (null)
This table is a request from a field site for parts to be ordered and delivered by the supplier to that site. A delivery of any number of items to a site attracts a delivery charge. When the parts are ordered, the ORDER_ID is inserted into the table so we are only concerned with those where ORDER_ID IS NULL
The question is, what is the optimal way to order these parts for each `LOCATION' where there are 3 optimal solutions that need to be presented to the user for selection.
The combination of orders with the least number of suppliers
The combination of orders with the lowest total cost i.e. The sum of QUANTITY*PRICE for each item plus the DELIVERY for each order summed over all orders ignoring DISCOUNT
As item 2 but accounting for DISCOUNT
Clearly I need to determine the combinations of orders that are available and then determining the optimal ones becomes trivial but I am a bit stuck on an efficient way to deal with building the combinations.
I have built some SQL fiddles in SQL Server 2008 with random data. This one has 100 items, 10 suppliers and 100 requests. This one has 1000 items, 50 suppliers and 250 requests. The table schema is the same.
Update
I reasoned that the solution had to be recursive and I built a nice table valued function to get but I ran into the 32 hard limit on recursion in SQL Server. I was uncomfortable with it anyway because it hinted more of a procedural language solution than a RDMS.
So I am now playing with CTE recursion.
The root query is:
SELECT DISTINCT
'' SOLUTION_ID
,LOCATION_ID
,SUPPLIER_ID
,(subquery I haven't quite worked out) SOLE_SUPPLIER
FROM PartsRequests pr
INNER JOIN
SupplierItems si ON pr.ITEM_ID=si.ITEM_ID
WHERE pr.ORDER_ID IS NULL
This gets all the suppliers that can supply the required items and is certainly a solution, probably not optimal. The subquery sets a flag if the supplier is the sole supplier of any product required for that location; if so they must be part of any solution.
The recursive part is to remove suppliers one by one by means of CTE.SUPPLIER_ID<>CTE.SUPPLIER_ID and add them if they still cover all the items. The SOLUTION_ID will be a CSV list of the suppliers removed, partly to uniquely identify each solution and partly to check against so I get combinations instead of permutations.
Still working on the details, the purpose of this update was to allow the Community to say "Yay, looks like that will work" or, alternatively "You moron, that won't work because ..."
Thanks
This is a more general answer (as in, not sql) as I think solving this problem will require something more powerful. Your first scenario is to select a minimum number of suppliers. This problem can be seen as a set cover problem as you are trying to cover all demands per site with the suppliers. This problem is already NP-complete.
Your third scenario seems to be basically the same as the second. You just have to take the discount into account in the prices, assuming you pay on time for every order.
The second scenario is at least NP-hard as I see a lot of resemblance with the facility location problem. You are trying to decide which suppliers (facilities) to use (open) to cover your orders (demands) based on their prices and delivery costs (opening costs).
Enumerating your possible solutions seems infeasible as with 10 suppliers, you have 2^10 possibilities of using them, further complicated by the distribution of demands internally.
I would suggest some dynamic programming to first select the suppliers that you have to use (=they are the only ones that deliver a specific thing), eliminating some possibilities (if the cost for supplier A +delivery cost A< cost for supplier B) and then trying to expand your set of possible solutions. Linear programming is also a valid train of thought.

Sql query for calculating room prices

Hi I have a problem i am working on for a while now , let say i have a view lets call it room_price looking like that :
room | people | price | hotel
1 | 1 | 200 | A
2 | 2 | 99 | A
3 | 3 | 95 | A
4 | 1 | 90 | B
5 | 6 | 300 | B
i am looking for the lowest price in given hotel for x amount of people
for 1 i would expect i will have :
hotel | price
A | 200
B | 90
for 2 i would have :
hotel | price
A | 99
it is because hotel B have no rooms that can exactly fit 2 persons. 6 can not be used for less (or more) than 6 people.
for hotel A price is 99 it is because i use room 2
for 6 result should be :
hotel | price
A | 394
B | 300
so for hotel A i take rooms 1,2,3 and for hotel B lowest price would be for one room 5 for 300
I did it with restriction that i will be able to fit people max in to 3 rooms and that is acceptable but my query is to slow :( it looks something like that :
select a.hotel,a.price+a1.price+a2.price
from room_price a, room_price a1, room_price a2
where
a.room<> a1.room
and a1.room<> a2.room
and a.room<> a2.room
and a.hotel = a1.hotel
and a.hotel = a2.hotel
after that i made a grup by hotel and took min(price) and it worked ... but executing 3 times query that gets me room_price and than Cartesian product of that took to much time. There are around 5000 elements in room_price and it is a rather complicated sql which generates this data (takes dates start end multiple prices, currency exchange...)
I can use sql, custom functions ... or anything that will make this work fast , but i would prefer to stay on database level without need to process this data in application (i am using java) as i will be extending this further on to add some additional data to the query.
I would be grateful for any help .
Query itself:
WITH RECURSIVE
setup as (
SELECT 3::INT4 as people
),
room_sets AS (
SELECT
n.hotel,
array[ n.room ] as rooms,
n.price,
n.people
FROM
setup s,
room_price n
WHERE
n.people <= s.people
UNION ALL
SELECT
rs.hotel,
rs.rooms || n.room,
rs.price + n.price as price,
rs.people + n.people as people
FROM
setup s,
room_sets rs
join room_price n using (hotel)
WHERE
n.room > rs.rooms[ array_upper( rs.rooms, 1 )]
AND rs.people + n.people <= s.people
),
results AS (
SELECT
DISTINCT ON (rs.hotel)
rs.*
FROM
room_sets rs,
setup s
WHERE
rs.people = s.people
ORDER BY
rs.hotel, rs.price
)
SELECT * FROM results;
Tested it on this dataset:
CREATE TABLE room_price (
room INT4 NOT NULL,
people INT4 NOT NULL,
price INT4 NOT NULL,
hotel TEXT NOT NULL,
PRIMARY KEY (hotel, room)
);
copy room_price FROM stdin WITH DELIMITER ',';
1,1,200,A
2,2,99,A
3,3,95,A
4,1,90,B
5,6,300,B
\.
Please note that it will become much slower when you'll add more rooms to your base.
Ah, to customize for how many people you want results - change the setup part.
Wrote detailed explanation on how it works.
It looks like your query as typed is incorrect with the FROM clause... it looks like aliases are out of whack
from room_price a, room_price,a1 room_price,room_price a2
and should be
from room_price a, room_price a1, room_price a2
That MIGHT be giving the query a false alias / extra table giving some sort of Cartesian product making it hang....
--- ok on the FROM clause...
Additionally, and just a thought... Since the "Room" appears to be an internal auto-increment ID column, it will never be duplicated, such as Room 100 in hotel A and Room 100 in hotel B. Your query to do <> on the room make sense so you are never comparing across the board on all 3 tables...
Why not force the a1 and a2 joins to only qualify for room GREATER than "a" room. Otherwise you'll be re-testing the same conditions over and over. From your example data, just on hotel A, you have room IDs of 1, 2 and 3. You are thus comparing
a a1 a2
1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1
Would it help to only compare where "a1" is always greater than "a" and "a2" is always greater than "a1" thus doing tests of
a a1 a2
1 2 3
would give the same results as all the rest, and thus bloat your result down to one record in this case... but then, how can you really compare against a location of only TWO room types "hotel B". You would NEVER get an answer since your qualification for rooms is
a <> a1 AND
a <> a2 AND
a1 <> a2
You may want to try cutting down to only a single self-join for a1, a2 and keep the compare only to the two, such as
select a1.hotel, a1.price + a2.price
from room_price a1, room_price a2
where a1.hotel = a2.hotel
and a2.room > a1.room
For hotel "A", you would thus have final result comparisons of
a1 a2
1 2
1 3
2 3
and for hotel "B"
a1 a2
4 5
The implementation of <> is a going to have a rather large impact when you start to look at larger data sets. Especially if the prior filtering doesn't drastically reduce its size. By using this you may potentially negate the possiblity of the direct query being optimised and implementing indexing but also the view may not implement indexing because SQL will attempt to run the filters for the query and the view against the tables in as few statements as possible (pending optimisations done by the engine).
I would ideally start with the view and confirm it's properly optimised. Just looking at the query itself this has a better chance of being optimised;
SELECT
a.hotel, a.price + a1.price + a2.price
FROM
room_price a,
room_price,
room_price a1,
room_price a2
WHERE
(a.room > a1.room OR a.room < a1.room) AND
(a1.room > a2.room OR a1.room < a2.room) AND
(a.room > a2.room OR a.room < a2.room) AND
a.hotel = a1.hotel AND
a.hotel = a2.hotel
It appears to return the same results, but I'm not sure how you implement this query in your overall solution. So consider just the nature of the changes to the existing query and what you have done already.
Hopefully that helps. If not you might need to consider what the view is doing and how it's working a view that returns results from a temp table or variable can't implement indexing either. In that case maybe generating an indexed temp table would be better for you.