I would like to select all rows from my database where one row contains at least two terms from a set of words/array.
As an example:
I have the following array:
'{"test", "god", "safe", "name", "hello", "pray", "stay", "word", "peopl", "rain", "lord", "make", "life", "hope", "whatever", "makes", "strong", "stop", "give", "television"}'
and I got a tweet dataset stored in the database. So i would like to know which tweets (column name: tweet.content) contain at least two of the words.
My current code looks like this (but of course it only selects one word...):
CREATE OR REPLACE VIEW tweet_selection AS
SELECT tweet.id, tweet.content, tweet.username, tweet.geometry,
FROM tweet
WHERE tweet.topic_indicator > 0.15::double precision
AND string_to_array(lower(tweet.content)) = ANY(SELECT '{"test", "god", "safe", "name", "hello", "pray", "stay", "word", "peopl", "rain", "lord", "make", "life", "hope", "whatever", "makes", "strong", "stop", "give", "television"}'::text[])
so the last line needs to be adjustested somehow, but i have no idea how - maybe with a inner join?!
I have the words also stored with a unique id in a different table.
A friend of mine recommended getting a count for each row, but i have no writing access for adding an additional column in the original tables.
Background:
I am storing my tweets in a postgres database and I applied a LDA (Latent dirichlet allocation) on the dataset. Now i got the generated topics and the words associated with each topic (20 topics and 25 words).
select DISTINCT ON (tweet.id) tweet.id, tweet.content, tweet.username, tweet.geometry
from tweet
where
tweet.topic_indicator > 0.15::double precision
and (
select count(distinct word)
from
unnest(
array['test', 'god', 'safe', 'name', 'hello', 'pray', 'stay', 'word', 'peopl', 'rain', 'lord', 'make', 'life', 'hope', 'whatever', 'makes', 'strong', 'stop', 'give', 'television']::text[]
) s(word)
inner join
regexp_split_to_table(lower(tweet.content), ' ') v (word) using (word)
) >= 2
Related
I have a S3 bucket with 500,000+ json records, eg.
{
"userId": "00000000001",
"profile": {
"created": 1539469486,
"userId": "00000000001",
"primaryApplicant": {
"totalSavings": 65000,
"incomes": [
{ "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
]
}
}
}
I created a new table in Athena
CREATE EXTERNAL TABLE profiles (
userId string,
profile struct<
created:int,
userId:string,
primaryApplicant:struct<
totalSavings:int,
incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'
I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:
SELECT jsonData
FROM "sampledb"."profiles"
CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)
WHERE jsonData.incometype='SALARY'
This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:
{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}
Now I have been asked how many users have two or more "SALARY" entries, eg.
"incomes": [
{ "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
],
I'm not sure how to go about this.
How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?
Do I have to iterate over the array?
What should the result look like?
UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:
SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1
This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.
Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.
You can combine filter with cardinality to filter array elements having incomeType = 'SALARY' more than once.
This can be further improve so that intermediate array is not materialized by using reduce (see examples in the docs; I'm not quoting them here, since they do not directly answer your question).
I'm looking to query a table for a distinct list of values in a given JSON column.
In the code snippet below, the Survey_Results table has 3 columns:
Name, Email, and Payload. Payload is the JSON object to I want to query.
Table Name: Survey_Results
Name Email Payload
Ying SmartStuff#gmail.com [
{"fieldName":"Product Name", "Value":"Calculator"},
{"fieldName":"Product Price", "Value":"$54.99"}
]
Kendrick MrTexas#gmail.com [
{"fieldName":"Food Name", "Value":"Texas Toast"},
{"fieldName":"Food Taste", "Value":"Delicious"}
]
Andy WhereTheBass#gmail.com [
{"fieldName":"Band Name", "Value":"MetalHeads"}
{"fieldName":"Valid Member", "Value":"TRUE"}
]
I am looking for a unique list of all fieldNames mentioned.
The ideal answer would be query giving me a list containing "Product Name", "Product Price", "Food Name", "Food Taste", "Band Name", and "Valid Member".
Is something like this possible in Postgres?
Use jsonb_array_elements() in a lateral join:
select distinct value->>'fieldName' as field_name
from survey_results
cross join json_array_elements(payload)
field_name
---------------
Product Name
Valid Member
Food Taste
Product Price
Food Name
Band Name
(6 rows)
How to find distinct Food Name values?
select distinct value->>'Value' as food_name
from survey_results
cross join json_array_elements(payload)
where value->>'fieldName' = 'Food Name'
food_name
-------------
Texas Toast
(1 row)
Db<>fiddle.
Important. Note that the json structure is illogical and thus unnecessarily large and complex. Instead of
[
{"fieldName":"Product Name", "Value":"Calculator"},
{"fieldName":"Product Price", "Value":"$54.99"}
]
use
{"Product Name": "Calculator", "Product Price": "$54.99"}
Open this db<>fiddle to see that proper json structure implies simpler and faster queries.
I have a table visitors(id, email, first_seen, sessions, etc.)
and another table trackings(id, visitor_id, field, value) that stores custom, user supplied data.
I want to query these and merge the visitor data columns and the trackings into a single column called data
For example, say I have two trackings
(id: 3, visitor_id: 1, field: "orders_made", value: 2)
(id: 4, visitor_id: 1, field: "city", value: 'new york')
and a visitor
(id: 1, email: 'hello#gmail.com, sessions: 5)
I want the result to be on the form of
(id: 1, data: {email: 'hello#gmail.com', sessions: 5, orders_made: 2, city: 'new york'})
What's the best way to accomplish this using Postgres 9.4?
I'll start by saying trackings is a bad idea. If you don't have many things to track, just store json instead; that's what it's made for. If you have a lot of things to track, you'll become very unhappy with the performance of trackings over time.
First you need a json object from trackings:
-- WARNING: Behavior of this with duplicate field names is undefined!
SELECT json_object(array_agg(field), array_agg(value)) FROM trackings WHERE ...
Getting json for visitors is relatively easy:
SELECT row_to_json(email, sessions) FROM visitors WHERE ...;
I recommend you do not just squash all those together. What happens if you have a field called email? Instead:
SELECT row_to_json((SELECT
(
SELECT row_to_json(email, sessions) FROM visitors WHERE ...
) AS visitor
, (
SELECT json_object(array_agg(field), array_agg(value)) FROM trackings WHERE ...
) AS trackings
));
I'm trying to find rows with duplicate fields in an array of structs within a Google BigQuery table, using the new Standard SQL. The data in the table (simplified) where each row looks a bit like this:
{
"Session": "abc123",
"Information" [
{
"Identifier": "e8d971a4-ef33-4ea1-8627-f1213e4c67dc"
},
{
"Identifier": "1c62813f-7ec4-4968-b18b-d1eb8f4d9d26"
},
{
"Identifier": "e8d971a4-ef33-4ea1-8627-f1213e4c67dc"
}
]
}
My end goal is to display the rows that have Information entities with duplicate Identifier values present. However, most of the queries I attempt get an error message of the following form:
Cannot access field Identifier on a value with type ARRAY<STRUCT<Identifier STRING>>
Is there a way to work with the data inside of a STRUCT within an ARRAY?
Here's my first attempt at a query:
SELECT
Session,
Information
FROM
`events.myevents`
WHERE
COUNT(DISTINCT Information.Identifier) != ARRAY_LENGTH(Information.Identifier)
LIMIT
1000
And another using a subquery:
SELECT
Session,
Information
FROM (
SELECT
Session,
Information,
COUNT(DISTINCT Information.Identifier) AS info_count_distinct,
ARRAY_LENGTH(Information) AS info_count
FROM
`events.myevents`
WHERE
COUNT(DISTINCT Information.Identifier) != ARRAY_LENGTH(Information.Identifier)
LIMIT
1000)
WHERE
info_count != info_count_distinct
Try below
SELECT Session, Identifier, COUNT(1) AS dups
FROM `events.myevents`, UNNEST(Information)
GROUP BY Session, Identifier
HAVING dups > 1
ORDER BY Session
Should give you what you expect plus number of dups.
Like below (example)
Session Identifier dups
abc123 e8d971a4-ef33-4ea1-8627-f1213e4c67dc 2
abc345 1c62813f-7ec4-4968-b18b-d1eb8f4d9d26 3
I have two tables items and content.
items:|ID|menu|img
table
itemcontent |ID|parent|title|content
content holds is paired to items by parent holding the title and content
i want to search all the items and also print out those records wich do not have a title present in the itemcontent table
whereby the titles will be printed as "Empty".
so printing out the output would look something like:
title: test1 and ID: items.ID=1
title: Empty and ID: items.ID=2
title: Empty and ID: items.ID=3
title: test2 and ID: items.ID=4
title: Empty and ID: items.ID=5
etc...
I tried the following and then some but to no avail:
SELECT items.*, itemcontent.title, itemcontent.content
FROM items, itemcontent
WHERE itemcontent.title LIKE '%$search%'
AND itemcontent.parent = items.ID
order by title ASC
A little help would be much appreciated
Since you want all the rows from items whether or not they have a match in itemcontent, plus a field from itemcontent when there is a match you need to use an OUTER JOIN:
SELECT items.*, COALESCE(itemcontent.title, 'empty'), itemcontent.content
FROM items LEFT OUTER JOIN itemcontent ON itemcontent.parent = items.ID
WHERE (itemcontent.title LIKE '%$search%' OR itemcontent.title IS NULL)
ORDER BY items.ID, itemcontent.title ASC
There are small differences among SQL dialects (for instance, not all versions have COALESCE) so if you want a more precise answer indicate which product you're using.
Just to be sure you might want to ORDER BY itemcontent.title and not just title or select itemcontent.title AS title? Do you have a field called title in the items table?