knowing which item is owned by each customer - spark SQL - sql

lets say I have a table of customers, the contains 2 columns:
id
preferences array - array of strings, of length 3, might contain nulls. preferences are different for each customer, so one might care about color and the other will not.
as an example:
id | preferences array
|
-------------------------------------------
1 | {'color:red','shape:triangle','speed:high'}
2 | {'age:14','color:blue',null}
I also have a table of items, with again 2 columns. again, id and preferences array - but this time, the array can be of any length:
id | preferences array
----------------------------------
|
1 | {'color:red','shape:triangle','speed:high','hand:third'}
|
2 | {'shape:circle'}
an items is matched to a customer if all of the strings in the customer's preferences appear in the item's preferences array. not all the strings in the item's preferences array has to appear on the customers preferences array, though.
i need to create a new table, in which one of the columns is the customer id, and the other is an array of all of the items id that matched to the customer.
customer_id | items
----------------------------------
|
1 | {3,4,7,300,4190..., 6000}
|
2 | {3,5617}
.
.
.
19,456 | {1551, 1456,3000}
please note that I need a solution that will work even for a lot of items and customers (around 10,000).
how can I do this using SQL (spark sql, specifically)

Hmmm . . . One method is to explode the arrays and join. The following gets the customer/item pairs:
select c.id as customer_id, i.id as item_id
from (customers c lateral view
explode(c.preferences_array) as c_preference
) join
(items i lateral view
explode(i.preferences_array) as i_preference
)
on c_preference = i.preference
group by c.id, i.id, size(c.preferences_array)
having count(*) = size(c.preferences_array);
You can reaggregate to get the list of items for each customer.
Note: This does not return customers with no preferences. Although they technically meet the requirements of your question, I suspect they don't meet the spirit of what you want to do.

Related

Aggregating or Bundle a Many to Many Relationship in SQL Developer

So I have 1 single table with 2 columns : Sales_Order called ccso, Arrangement called arrmap
The table has distinct values for this combination and both these fields have a Many to Many relationship
1 ccso can have Multiple arrmap
1 arrmap can have Multiple ccso
All such combinations should be considered as one single bundle
Objective :
Assign a final map to each of the Sales Order as the Largest Arrangement in that Bundle
Example:
ccso : 100-10015 has 3 arrangements --> Now each of those arrangements have a set of Sales Orders --> Now those sales orders will also have a list of other arrangements and so on
(Image : 1)
Therefore the answer definitely points to something recursively checking. Ive managed to write the below code / codes and they work as long as I hard code a ccso in the where clause - But I don't know how to proceed after this now. (I'm an accountant by profession but finding more passion in coding recently) I've searched the forums and web for things like
Recursive CTEs,
many to many aggregation
cartesian product etc
and I'm sure there must be a term for this which I don't know yet. I've also tried
I have to use sqldeveloper or googlesheet query and filter formulas
sqldeveloper has restrictions on on some CTEs. If recursive is the way I'd like to know how and if I can control the depth to say 4 or 5 iterations
Ideally I'd want to update a third column with the final map if possible but if not, then a select query result is just fine
Codes I've tried
Code 1: As per Screenshot
WITH a1(ccso, amap) AS
(SELECT distinct a.ccso, a.arrmap
FROM rg_consol_map2 A
WHERE a.ccso = '100-10115' -- this condition defines the ultimate ancestors in your chain, change it as appropriate
UNION ALL
SELECT m.ccso, m.arrmap
FROM rg_consol_map2 m
JOIN a1
ON M.arrmap = a1.amap -- or m.ccso=a1.ccso
) /*if*/ CYCLE amap SET nemap TO 1 /*else*/ DEFAULT 0
SELECT DISTINCT amap FROM (SELECT ccso, amap FROM a1 ORDER BY 1 DESC) WHERE ROWNUM = 1
In this the main challenge is how to remove the hardcoded ccso and do a join for each of the ccso
Code 2 : Manual CTEs for depth
Here again the join outside the CTE gives me an error and sqldeveloper does not allow WITH clause with UPDATE statement - only works for select and cannot be enclosed within brackets as subtable
SELECT distinct ccso FROM
(
WITH ar1 AS
(SELECT distinct arrmap
FROM rg_consol_map
WHERE ccso = a.ccso
)
,so1 AS
(SELECT DISTINCT ccso
FROM rg_consol_map
WHERE arrmap IN (SELECT arrmap FROM ar1)
)
,ar2 AS
(SELECT DISTINCT ccso FROM rg_consol_map
where arrmap IN (select distinct arrmap FROM rg_consol_map
WHERE ccso IN (SELECT ccso FROM so1)
))
SELECT ar1.arrmap, NULL ccso FROM ar1
union all
SELECT null, ar2.ccso FROM ar2
UNION ALL
SELECT NULL arrmap, so1.ccso FROM so1
)
Am I Missing something here or is there an easier way to do this? I read something about MERGE and PROC SQL JOIN but was unable to get them to work but if that's the way to go ahead I will try further if someone can point me in the direction
(Image : 2)
(CSV File : [3])
Edit : Fixing CSV file link
https://github.com/karan360note/karanstackoverflow.git
I suppose can be downloaded from here IC mapping many to many.csv
Oracle 11g version is being used
Apologies in advance for the wall of text.
Your problem is a complex, multi-layered Many-to-Many query; there is no "easy" solution to this, because that is not a terribly ideal design choice. The safest best does literally include multiple layers of CTE or subqueries in order to achieve all the depths you want, as the only ways I know to do so recursively rely on an anchor column (like "parentID") to direct the recursion in a linear fashion. We don't have that option here; we'd go in circles without a way to track our path.
Therefore, I went basic, and with several subqueries. Every level checks for a) All orders containing a particular ARRMAP item, and then b) All additional items on those orders. It's clear enough for you to see the logic and modify to your needs. It will generate a new table that contains the original CCSO, the linking ARRMAP, and the related CCSO. Link: https://pastebin.com/un70JnpA
This should enable you to go back and perform the desired updates you want, based on order # or order date, etc... in a much more straightforward fashion. Once you have an anchor column, a CTE in the future is much more trivial (just search for "CTE recursion tree hierarchy").
SELECT DISTINCT
CCSO, RELATEDORDER
FROM myTempTable
WHERE CCSO = '100-10115'; /* to find all orders by CCSO, query SELECT DISTINCT RELATEDORDER */
--WHERE ARRMAP = 'ARR10524'; /* to find all orders by ARRMAP, query SELECT DISTINCT CCSO */
EDIT:
To better explain what this table generates, let me simplify the problem.
If you have order
A with arrangements 1 and 2;
B with arrangement 2, 3; and
C with arrangement 3;
then, by your initial inquiry and image, order A should related to orders B and C, right? The query generates the following table when you SELECT DISTINCT ccso, relatedOrder:
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | C |
| B | A |
+----------------------+
| C | A |
| C | B |
+-------+--------------+
You can see here if you query WHERE CCSO = 'A' OR RelatedOrder = 'A', you'll get the same relationships, just flipped between the two columns.
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | A |
+----------------------+
| C | A |
+-------+--------------+
So query only CCSO or RelatedOrder.
As for the results of WHERE CCSO = '100-10115', see image here, which includes all the links you showed in your Image #1, as well as additional depths of relations.

Issue displaying empty value of repeated columns in Google Data Studio

I've got an issue when trying to visualize in Google Data Studio some information from a denormalized table.
Context: I want to gather all the contact of a company and there related orders in a table in Big Query. Contacts can have no order or multiple orders. Following Big Query best practice, this table is denormalized and all the orders for a client are in arrays of struct. It looks like this:
Fields Examples:
+-------+------------+-------------+-----------+
| Row # | Contact_Id | Orders.date | Orders.id |
+-------+------------+-------------+-----------+
|- 1 | 23 | 2019-02-05 | CB1 |
| | | 2020-03-02 | CB293 |
|- 2 | 2321 | - | - |
|- 3 | 77 | 2010-09-03 | AX3 |
+-------+------------+-------------+-----------+
The issue is when I want to use this table as a data source in Data Studio.
For instance, if I build a table with Contact_Id as dimension, everything is fine and I can see all my contacts. However, if I add any dimensions from the Orders struct, all info from contact with no orders are not displayed. For instance, all info from Contact_Id 2321 is removed from the table.
Have you find any workaround to visualize these empty arrays (for instance as null values)?
The only solution I've found is to build an intermediary table with the orders unnested.
The way I've just discovered to work around this is to add an extra field in my DS-> BQ connector:
ARRAY_LENGTH(fields.orders) AS numberoforders
This will return zero if the array is empty - you can then create calculated fields within DataStudio - using the "numberoforders" field to force values to NULL or zero.
You can fix this behaviour by changing a little your query on the BigQuery connector.
Instead of doing this:
SELECT
Contact_id,
Orders
FROM myproject.mydataset.mytable
try this:
SELECT
Contact_id,
IF(ARRAY_LENGTH(Orders) > 0, Orders, [STRUCT(CAST(NULL AS DATE) AS date, CAST(NULL AS STRING) AS id)]) AS Orders
FROM myproject.mydataset.mytable
This way you are forcing your repeated field to have, at least, an array containing NULL values and hence Data Studio will represent those missing values.
Also, if you want to create new calculated fields using one of the nested fields, you should check before if the value is NULL to avoid filling all NULL values. For example, if you have a repeated and nested field which can be 1 or 0, and you want to create a calculated field swaping the value, you should do:
IF(myfield.key IS NOT NULL, IF(myfield.key = 1, 0, 1), NULL)
Here you can see what happens if you check before swaping and if you don't:
Original value No check Check
1 0 0
0 1 1
NULL 1 NULL
1 0 0
NULL 1 NULL

Get total count and first 3 columns

I have the following SQL query:
SELECT TOP 3 accounts.username
,COUNT(accounts.username) AS count
FROM relationships
JOIN accounts ON relationships.account = accounts.id
WHERE relationships.following = 4
AND relationships.account IN (
SELECT relationships.following
FROM relationships
WHERE relationships.account = 8
);
I want to return the total count of accounts.username and the first 3 accounts.username (in no particular order). Unfortunately accounts.username and COUNT(accounts.username) cannot coexist. The query works fine removing one of the them. I don't want to send the request twice with different select bodies. The count column could span to 1000+ so I would prefer to calculate it in SQL rather in code.
The current query returns the error Column 'accounts.username' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. which has not led me anywhere and this is different to other questions as I do not want to use the 'group by' clause. Is there a way to do this with FOR JSON AUTO?
The desired output could be:
+-------+----------+
| count | username |
+-------+----------+
| 1551 | simon1 |
| 1551 | simon2 |
| 1551 | simon3 |
+-------+----------+
or
+----------------------------------------------------------------+
| JSON_F52E2B61-18A1-11d1-B105-00805F49916B |
+----------------------------------------------------------------+
| [{"count": 1551, "usernames": ["simon1", "simon2", "simon3"]}] |
+----------------------------------------------------------------+
If you want to display the total count of rows that satisfy the filter conditions (and where username is not null) in an additional column in your resultset, then you could use window functions:
SELECT TOP 3
a.username,
COUNT(a.username) OVER() AS cnt
FROM relationships r
JOIN accounts a ON r.account = a.id
WHERE
r.following = 4
AND EXISTS (
SELECT 1 FROM relationships t1 WHERE r1.account = 8 AND r1.following = r.account
)
;
Side notes:
if username is not nullable, use COUNT(*) rather than COUNT(a.username): this is more efficient since it does not require the database to check every value for nullity
table aliases make the query easier to write, read and maintain
I usually prefer EXISTS over IN (but here this is mostly a matter of taste, as both techniques should work fine for your use case)

SQL group related rows in a list

I'm a bit stuck with this...
I have items table:
id | name
1 | item 1
2 | item 2
3 | item 3
4 | item 4
and related items table:
id | item_id | related_item_id
2 | 1 | 2
3 | 1 | 4
so this means that item 1 is related to items 2 and 4.
Now I'm trying to display these in a list where related items follow always the main item they are related to:
item 1
item 2
item 4
item 3
Then I can visually show that these items 2 and 4 are related to item one and draw something like:
item 1
-- item 2
-- item 4
item 3
To be honest, haven't got any ideas myself. I quess I could query for items which are not related to any other item and get a list of "parent items" and then query relations separately in a script loop. This is not definately the sexiest solution...
I am assuming that this question is about ordering the items list, without duplicates. That is, a given item does not have more than one parent (which I ask in a comment).
If so, you can do this with a left outer join and cleverness in the order by.
select coalesce(r.related_item_id, i.id) as item_id
from items i left join
related r
on i.id = r.related_item_id
order by coalesce(r.item_id, i.id),
(r.related_item_id is null) desc;
The left outer join identifies parents because they will not have any rows that match. If so, the coalesce() finds them and uses the item id.
In my opinion , rather than implementing this logic in a query , you should move it to your actual code.
assuming that item_ids are sequential, you can find the largest number of item_id, then in a loop
you can find related_item_id to each item_id and make a convenient data structure out of it.
This functionality comes under the category of hierarchical queries. In Oracle its handled by connect by clause not sure about mysql. But you can search "hierarchical queries mysql" to get the answer.

Creating new table from data of other tables

I'm very new to SQL and I hope someone can help me with some SQL syntax. I have a database with these tables and fields,
DATA: data_id, person_id, attribute_id, date, value
PERSONS: person_id, parent_id, name
ATTRIBUTES: attribute_id, attribute_type
attribute_type can be "Height" or "Weight"
Question 1
Give a person's "Name", I would like to return a table of "Weight" measurements for each children. Ie: if John has 3 children names Alice, Bob and Carol, then I want a table like this
| date | Alice | Bob | Carol |
I know how to get a long list of children's weights like this:
select d.date,
d.value
from data d,
persons child,
persons parent,
attributes a
where parent.name='John'
and child.parent_id = parent.person_id
and d.attribute_id = a.attribute_id
and a.attribute_type = "Weight';
but I don't know how to create a new table that looks like:
| date | Child 1 name | Child 2 name | ... | Child N name |
Question 2
Also, I would like to select the attributes to be between a certain range.
Question 3
What happens if the dates are not consistent across the children? For example, suppose Alice is 3 years older than Bob, then there's no data for Bob during the first 3 years of Alice's life. How does the database handle this if we request all the data?
1) It might not be so easy. MS SQL Server can PIVOT a table on an axis, but dumping the resultset to an array and sorting there (assuming this is tied to some sort of program) might be the simpler way right now if you're new to SQL.
If you can manage to do it in SQL it still won't be enough info to create a new table, just return the data you'd use to fill it in, so some sort of external manipulation will probably be required. But you can probably just use INSERT INTO [new table] SELECT [...] to fill that new table from your select query, at least.
2) You can join on attributes for each unique attribute:
SELECT [...] FROM data AS d
JOIN persons AS p ON d.person_id = p.person_id
JOIN attributes AS weight ON p.attribute_id = weight.attribute_id
HAVING weight.attribute_type = 'Weight'
JOIN attributes AS height ON p.attribute_id = height.attribute_id
HAVING height.attribute_type = 'Height'
[...]
(The way you're joining in the original query is just shorthand for [INNER] JOIN .. ON, same thing except you'll need the HAVING clause in there)
3) It depends on the type of JOIN you use to match parent/child relationships, and any dates you're filtering on in the WHERE, if I'm reading that right (entirely possible I'm not). I'm not sure quite what you're looking for, or what kind of database you're using, so no good answer. If you're new enough to SQL that you don't know the different kinds of JOINs and what they can do, it's very worthwhile to learn them - they put the R in RDBMS.
when you do a select, you need to specify the exact columns you want. In other words you can't return the Nth child's name. Ie this isn't possible:
1/2/2010 | Child_1_name | Child_2_name | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name | Child_2_name
Each record needs to have the same amount of columns. So you might be able to make a select that does this:
1/2/2010 | Child_1_name
1/2/2010 | Child_2_name
1/2/2010 | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name
1/4/2010 | Child_2_name
And then in a report remap it to how you want it displayed