Postgres: Find matching array or array that contains t - sql

I'm using postgres 9.5 and have a table that looks something like this:
+----+--------------+-------------------------------+
| id | string_field | array_field |
|----+--------------+-------------------------------|
| 1 | string a | [apple, orange, banana, pear] |
| 2 | string b | [apple, orange, banana] |
| 3 | string c | [apple, orange] |
| 4 | string d | [apple, pear] |
| 5 | string e | [orange, apple] |
+----+--------------+-------------------------------+
Is it possible to query the DB for rows where array_field is, or contains [apple, orange, banana]? The results should return rows with id 1 and 2.

Try something like this:
where array_field #> ARRAY['apple', 'orange', 'banana']::varchar[]
Documentation

Related

PostgreSQL: Return value with case if text is found in varchar array

Imagine the data:
id item category basket
1 Banana {"Fruit"} {"Veggie","Health"}
2 Carrot {"Veggie"} {"Health","Beauty","Art"}
3 Banana {"Fruit","Health"} {"Beauty","Art","Veggie","Health"}
4 Potato {"Beauty","Veggie","Art"} {"Beauty","Veggie"}
5 Lipstick {"Fruit"} {"Veggie", "Health", "Beauty"}
I would like to obtain:
id item category basket include_item
1 Banana {"Fruit"} {"Veggie","Health"} add_fruit
2 Carrot {"Veggie"} {"Health","Beauty","Art"} add_veggie
3 Banana {"Fruit","Health"} {"Beauty","Art","Veggie","Health"} add_fruit
4 Potato {"Beauty","Veggie","Art"} {"Beauty","Veggie"} add_veggie
5 Lipstick {"Fruit"} {"Veggie", "Health", "Beauty"} do_not_add
I attempted the code:
select *,
case
-- when category::char like '%Fruit%' and item = 'Banana' then 'add_fruit'
-- when 'Vegetable'=any(category) and item in ('Potato', 'Carrot') then 'add_veggie'
else 'do_not_add'
end as include_item
from my_table
Neither of the commented options worked. How should I adjust the code to meet both criteria in case ?
Both category and basket are of type character varying[]
You could normalize your table, so that you don't need to make string comparisons.
select *,
case
when category like '%Fruit%' and item = 'Banana' then 'add_fruit'
when category like '%Veggie%' and item in ('Potato', 'Carrot') then 'add_veggie'
else 'do_not_add'
end as include_item
from ordertab
id | item | category | basket | include_item
-: | :------- | :------------------------ | :--------------------------------- | :-----------
1 | Banana | {"Fruit"} | {"Veggie","Health"} | add_fruit
2 | Carrot | {"Veggie"} | {"Health","Beauty","Art"} | add_veggie
3 | Banana | {"Fruit","Health"} | {"Beauty","Art","Veggie","Health"} | add_fruit
4 | Potato | {"Beauty","Veggie","Art"} | {"Beauty","Veggie"} | add_veggie
5 | Lipstick | {"Fruit"} | {"Veggie", "Health", "Beauty"} | do_not_add
db<>fiddle here

SELECT 1 ID and all belonging elements

I try to create a json select query which can give me back the result on next way.
1 row contains 1 main_message_id and belonging messages. (Like the bottom image.) The json format is not a requirement, if its work with other methods, it will be fine.
I store the data as like this:
+-----------------+---------+----------------+
| main_message_id | message | sub_message_id |
+-----------------+---------+----------------+
| 1 | test 1 | 1 |
| 1 | test 2 | 2 |
| 1 | test 3 | 3 |
| 2 | test 4 | 4 |
| 2 | test 5 | 5 |
| 3 | test 6 | 6 |
+-----------------+---------+----------------+
I would like to create a query, which give me back the data as like this:
+-----------------+-----------------------+--+
| main_message_id | message | |
+-----------------+-----------------------+--+
| 1 | {test1}{test2}{test3} | |
| 2 | {test4}{test5}{test6} | |
| 3 | {test7}{test8}{test9} | |
+-----------------+-----------------------+--+
You can use json_agg() for that:
select main_message_id, json_agg(message) as messages
from the_table
group by main_message_id;
Note that {test1}{test2}{test3} is invalid JSON, the above will return a valid JSON array e.g. ["test1", "test2", "test3"]
If you just want a comma separated list, use string_agg();
select main_message_id, string_ag(message, ', ') as messages
from the_table
group by main_message_id;

Group column of pyspark dataframe by taking only unique values from two columns

I want group a column based on unique values from two columns of pyspark dataframe. The output of the dataframe should be such that once some value used for groupby and if it is present in another column then it should not repeat.
|------------------|-------------------|
| fruit | fruits |
|------------------|-------------------|
| apple | banana |
| banana | apple |
| apple | mango |
| orange | guava |
| apple | pineapple |
| mango | apple |
| banana | mango |
| banana | pineapple |
| -------------------------------------|
I have tried to group by using single column and it needs to be modified or some other logic should be required.
df9=final_main.groupBy('fruit').agg(collect_list('fruits').alias('values'))
I am getting following output from above query;
|------------------|--------------------------------|
| fruit | values |
|------------------|--------------------------------|
| apple | ['banana','mango','pineapple'] |
| banana | ['apple'] |
| orange | ['guava'] |
| mango | ['apple'] |
|------------------|--------------------------------|
But I want following output;
|------------------|--------------------------------|
| fruit | values |
|------------------|--------------------------------|
| apple | ['banana','mango','pineapple'] |
| orange | ['guava'] |
|------------------|--------------------------------|
This looks like a connected components problem. There are a couple ways you can go about doing this.
1. GraphFrames
You can use the GraphFrames package. Each row of your dataframe defines an edge, and you can just create a graph using df as edges and a dataframe of all the distinct fruits as vertices. Then call the connectedComponents method. You can then manipulate the output to get what you want.
2. Just Pyspark
The second method is a bit of a hack. Create a "hash" for each row like
hashed_df = df.withColumn('hash', F.sort_array(F.array(F.col('fruit'), F.col('fruits'))))
Drop all non-distinct rows for that column
distinct_df = hashed_df.dropDuplicates(['hash'])
Split up the items again
revert_df = distinct_df.withColumn('fruit', F.col('hash')[0]) \
.withColumn('fruits', F.col('hash')[1])
Group by the first column
grouped_df = revert_df.groupBy('fruit').agg(F.collect_list('fruits').alias('group'))
You might need to "stringify" your hash usingF.concat_ws if Pyspark complains, but the idea is the same.

How to query multiple rows into a single-line string result?

Let's say we have this table named phrases and it has contents like so:
phrases
+----+--------+
| id | phrase |
+----+--------+
| 1 | the |
| 2 | quick |
| .. | ... |
| 8 | lazy |
| 9 | dog |
+----+--------+
Desired result
+---------------------------------------------+
| sentence |
+---------------------------------------------+
| the quick brown fox jumps over the lazy dog |
+---------------------------------------------+
What should be the query statement such that it would result into a single result string like the above?
You can use STRING_AGG with an empty delimiter, and probably with a sort order
SELECT STRING_AGG(phrase , '' ORDER BY id) as sentence
FROM phrases;

When Querying Many-To-Many Relationship in SQL, Return Multiple Connections As an Array In Single Row?

Basically, I have 3 tables, titles, providers, and provider_titles.
Let's say they look like this:
| title_id | title_name |
|------------|----------------|
| 1 | San Andres |
| 2 |Human Centipede |
| 3 | Zoolander 2 |
| 4 | Hot Pursuit |
| provider_id| provider_name |
|------------|----------------|
| 1 | Hulu |
| 2 | Netflix |
| 3 | Amazon_Prime |
| 4 | HBO_GO |
| provider_id| title_id |
|------------|----------------|
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 3 | 1 |
| 3 | 3 |
| 4 | 4 |
So, clearly there are titles with multiple providers, yeah? Typical many-to-many so far.
So what I'm doing to query it is with a JOIN like the following:
SELECT * FROM provider_title JOIN provider ON provider_title.provider_id = provider.provider_id JOIN title ON title.title_id = provider_title.title_id WHERE provider.name IN ('Netflix', 'HBO_GO', 'Hulu', 'Amazon_Prime')
Ok, now to the actual issue. I don't want repeated title names back, but I do want all of the providers associated with the title. Let me explain with another table. Here is what I am getting back with the current query, as is:
| provider_id| provider_name | title_id | title_name |
|------------|---------------|----------|---------------|
| 1 | Hulu | 1|San Andreas |
| 1 | Hulu | 2|Human Centipede|
| 2 | Netflix | 1|San Andreas |
| 3 | Amazon_Prime | 1|San Andreas |
| 3 | Amazon_prime | 3|Zoolander 2 |
| 4 | HBO_GO | 4|Hot Pursuit |
But what I really want would be something more like
| provider_id| provider_name |title_id| title_name|
|------------|-----------------------------|--------|-----------|
| [1, 2, 3] |[Hulu, Netflix, Amazon_Prime]| 1|San Andreas|
Meaning I only want distinct titles back, but I still want each title's associated providers. Is this only possible to do post-sql query with logic iterating through the returned rows?
Depending on your database engine, there may be an aggregation function to help achieve this.
For example, this SQLfiddle demonstrates the postgres array_agg function:
SELECT t.title_id,
t.title_name,
array_agg( p.provider_id ),
array_agg( p.provider_name )
FROM provider_title as pt
JOIN
provider as p
ON pt.provider_id = p.provider_id
JOIN title as t
ON t.title_id = pt.title_id
GROUP BY t.title_id,
t.title_name
Other database engines have equivalents. For example:
mySQL has group_concat
Oracle has listagg
sqlite has group_concat (as well!)
If your database isn't covered by the above, you can google '[Your database engine] aggregate comma delimited string'