REGEX in Snowflake/SQL for Serialized Ruby Hash values - sql

This is a tough one (for me at least). I need to obtain the words after "host_goal_to_have" and "host_goal_to_be_able" and there's not really a strong pattern to signify the end.
For example, the result for "host_goal_to_have" should be: "established a thriving community of actors supporting each other and turning passion into income!". And for "host_goal_to_be_able" being: "reach people and unlock the artist inside them"
Sample:
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
signup_recaptcha:
hide_welcome_checklist: true
seen_spaces_migration_welcome_intro: true
host_goal_to_have: established a thriving community of actors supporting each other
and turning passion into income!
host_goals_updated_at: '2023-02-11T23:00:52.441Z'
host_goal_to_be_able: reach people and unlock the artist inside them
seen_event_form: true

So we can start with the same data:
with data as (
select * from values
($$--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
signup_recaptcha:
hide_welcome_checklist: true
seen_spaces_migration_welcome_intro: true
host_goal_to_have: established a thriving community of actors supporting each other
and turning passion into income!
host_goals_updated_at: '2023-02-11T23:00:52.441Z'
host_goal_to_be_able: reach people and unlock the artist inside them
seen_event_form: true$$)
)
and then we can split it into lines, find the matching "start of line tokens [a-z_]: and assume these are not embedded into a values section. Then separate token: value from value
select d.column1
,s.*
,RLIKE(s.value, '^[a-z_]+:.*$') as r1
,regexp_substr(s.value, '^[a-z_]+:(.*)$', 1,1,'c',1) as r2
,iff(r1, r2, s.value) as ss
from data as d,
table(split_to_table(d.column1, '\n')) as s
after this we can chain the values back together:
with data as (
select * from values
($$--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
signup_recaptcha:
hide_welcome_checklist: true
seen_spaces_migration_welcome_intro: true
host_goal_to_have: established a thriving community of actors supporting each other
and turning passion into income!
host_goals_updated_at: '2023-02-11T23:00:52.441Z'
host_goal_to_be_able: reach people and unlock the artist inside them
seen_event_form: true$$)
), pass_1 as (
select d.column1
,s.seq
,s.index
,regexp_substr(s.value, '^[a-z_]+:') as t
,nvl(t, lag(t)ignore nulls over(partition by s.seq order by s.index)) as token
,iff(RLIKE(s.value, '^[a-z_]+:.*$'), regexp_substr(s.value, '^[a-z_]+:(.*)$', 1,1,'c',1), s.value) as ss
from data as d,
table(split_to_table(d.column1, '\n')) as s
)
select
seq,
token,
listagg(ss) within group (order by index) as val
from pass_1
where token is not null
group by 1,2
;
which gives:
SEQ
TOKEN
VAL
1
seen_spaces_migration_welcome_intro:
true
1
host_goals_updated_at:
'2023-02-11T23:00:52.441Z'
1
signup_recaptcha:
1
host_goal_to_have:
established a thriving community of actors supporting each other and turning passion into income!
1
host_goal_to_be_able:
reach people and unlock the artist inside them
1
hide_welcome_checklist:
true
1
seen_event_form:
true
which can be filter via HAVING:
select
seq,
token,
trim(listagg(ss) within group (order by index)) as val
from pass_1
where token is not null
group by 1,2
having token in ('host_goal_to_have:', 'host_goal_to_be_able:')
SEQ
TOKEN
VAL
1
host_goal_to_have:
established a thriving community of actors supporting each other and turning passion into income!
1
host_goal_to_be_able:
reach people and unlock the artist inside them

Related

How to query many-to-many relations based on logical expression with multiple parameters?

Considering the table with the following schema, I'm trying to create a dynamic query that accepts a logical expression with multiple variables, so I could filter employees that have skills only with specific technology or category names.
For example, I'm inputing pseudo expression (technology = 'aws cognito' OR technology = 'azure active directory') AND category = 'framework', so I know that the first two variables are technologies and the last one is a category of technologies.
How the possible query could look like, considering that the number of variables can differ and overall the expression may be multi-leveled? The way that the query will be generated is another question, but I'm having troubles with the actual structure of it.
I've tried querying a very simple case like that (schema name is skills):
SELECT *
FROM "skills"."employees" "e"
LEFT JOIN "skills"."skills" "s" ON "s"."employee_id"="e"."id"
LEFT JOIN "skills"."technologies" "t" ON "t"."id"="s"."technology_id"
WHERE ("t"."name" = 'aws cognito'
AND "t"."name" = 'azure active directory')
However, the query does not return any entries, but I have employees in my database that have skills with both of these technologies, so I thought that the query will return them. It works with the OR operator, but it was expected. I have a very bad feeling that I'm doing something fundamentally wrong, but, to be honest, I just cannot wrap my head around it.
Update: Sample data
Employees
id first_name
1 Andrew
2 James
Technologies
id name
1 aws cognito
2 azure active directory
Skills
id employee_id technology_id
1 1 1
2 1 2
3 2 1
Based on the data and on the logical expression technology = 'aws cognito' AND technology = 'azure active directory' I'm expecting the result to be only the employee with id 1 (Andrew).
For an arbitrary set of requried skills you can use a grouping and counting method
with skillset(tname) as(
values
('aws cognito'),
('azure active directory')
)
SELECT e.id, e.first_name
FROM "skills"."employees" e
JOIN "skills"."skills" s ON s.employee_id=e.id
JOIN "skills"."technologies" t ON t.id=s.technology_id
JOIN skillset st ON t.name = st.tname
GROUP BY e.id, e.first_name
HAVING count(*) = (select count(*) from skillset)

How many bitcoins were transferred from one wallet to another?

The problem is simple: I want to query how many BTC were transferred from Wallet A to Wallet B with as many hops as blocks in the blockchain.
Ex.
A transferred 1 BTC to C and 1 BTC to D.
C transferred 0.1 to B
D transferred 0.5 to E and 0.5 to F
E transferred 0.1 to B
Total 0.2 BTC transferred from A to B
I figure I could do this by using bigquery on the blockchain. The problem is that I do not know how to create a recursive query like that. My SQL skills tend to zero.
The cause is noble. I have few addresses that were used in what proved to be a ponzi scheme 1. I have other set of addresses that are being used in ANOTHER scheme, which I believe is another scam (2) laundering money from scheme 1.
I know who is the person behind scam 2.
If I prove that a great amount of BTCs from the first scam went to the wallets related to the second scam, it could be strong indication that they are the same.
Note that I've said a great amount of BTCs. I know that some of BTCs may wind up at the wallets of scheme 2 by chance, but for the majority to end up there is not at all a coincidence.
Disclosure: I am NOT obtaining any financial benefits from this, I only intend to reveal this scammer.
Since you did not post a data structure your mileage may vary. Here is a hypothetical (I know zero about bitcoin data structures) bitcoin chain structure. Use a recursive CTE to create an anchor and self call. I am using Source and target below, however, they could be exchanged with bitcoin semantics.
Sql Fiddle
DECLARE #T TABLE(ChainID INT, SourceID INT, TargetID INT, Amount INT)
INSERT #T VALUES
(1,100,300,1),
(2,900,800,1),
(1,100,400,1),
(2,800,700,1),
(1,300,200,1),
(1,400,500,1),
(2,700,600,1),
(1,500,600,1),
(1,500,200,1),
(2,600,500,1),
(2,500,400,1)
DECLARE #ChainID INT = 2
--Get the first source of a chain !If natural order, if there is a more suitable order field then use it!
DECLARE #StartID INT = (SELECT SourceID FROM (SELECT SourceID,RN=ROW_NUMBER() OVER (ORDER BY ChainID) FROM #T WHERE ChainID = #ChainID ) AS X WHERE RN=1)
;WITH RecursiveWalk AS
(
--Anchor
SELECT
SourceID,
TargetID = T.TargetID,
LevelID = 1
FROM
#T T
WHERE
T.SourceID = #StartID AND ChainID = #ChainID
UNION ALL
--Recursive bit
SELECT
T.SourceID,
TargetID = T.TargetID,
LevelID = LevelID + 1
FROM
#T T
INNER JOIN RecursiveWalk RW ON T.SourceID = RW.TargetID
WHERE
ChainID=#ChainID
)
SELECT
SourceID,
TargetID,
LevelID
FROM
RecursiveWalk

Sql for distinct record comparison

I am comparing a table to itself trying to determine whether an email in one record is being used in any one of four other columns in another record.
To make this easier, lets look at an example (simplified):
Name: Bob
Office Email: bob#aaa.com
Home Email: bob#home.com
Mobile Email: bobster#gmail.com
.
Name: Rob
Office Email: rob#bbb.com
Home Email: bob#home.com
Mobile Email: robert#gmail.com
Now I have a sql statement like this:
select c1.ContactId id1, c1.FullName Name1, 'Office Email 1' EmailType1, c1.EMailAddress1 Email,
c2.ContactId id2, c2.FullName Name2,
CASE c1.EmailAddress1
WHEN c2.EMailAddress1 THEN 'Office Email 1'
WHEN c2.Si_OfficeEmail2nd THEN 'Office Email 2'
WHEN c2.EMailAddress2 THEN 'Mobile Email'
WHEN c2.pc_hmemail THEN 'Home Email'
ELSE '?'
END EmailType2,
CASE c1.EmailAddress1
WHEN c2.EMailAddress1 THEN c2.EMailAddress1
WHEN c2.Si_OfficeEmail2nd THEN c2.Si_OfficeEmail2nd
WHEN c2.EMailAddress2 THEN c2.EMailAddress2
WHEN c2.pc_hmemail THEN c2.pc_hmemail
ELSE '?'
END DuplicateEmail
from Contact c1, Contact c2
where (
LTRIM(RTRIM(c1.EMailAddress1 )) = LTRIM(RTRIM(c2.EMailAddress1))
Or LTRIM(RTRIM(c1.EMailAddress1 )) = LTRIM(RTRIM(c2.EMailAddress2))
Or LTRIM(RTRIM(c1.EMailAddress1 )) = LTRIM(RTRIM(c2.pc_hmemail))
Or LTRIM(RTRIM(c1.EMailAddress1 )) = LTRIM(RTRIM(c2.Si_OfficeEmail2nd))
)
And c1.ContactId <> c2.ContactId
And c1.StateCode = 0
and c2.StateCode = 0
order by c1.FullName, c2.FullName
Unfortunately, because Bob and Rob have the same email 'type' (Home Email) that is duplicated due to a typo, my query returns two records, one which shows that Bobs email is duplicated in Robs email, and a second that Robs email is duplicated in Bobs email.
I only need one record. I'm sure this is a common problem but I don't quite know how to describe this problem well enough to have a search engine return something useful.
Perhaps there is a better way of going about this? If not, other than jumping through a bunch of intermediate temporary tables to eliminate these equivalent records, is there a way to write a single query for this?
The solution to your problem is to add the condition: c1.contactId < c2.ContactId. This limits the pairs you are looking at.
If you are looking at emails, you might find a faster approach to look directly at emails. Something like the following will return all emails (on separate rows) that are duplicated:
select e.*
from (select e.*, COUNT(*) over (partition by email) as NumTimes
from ((select contactId, 'Office' as which, EmailAddress1 as email
from Contact
) union all
(select contactId, 'Office2', Si_OfficeEmail2nd
from Contact
) union all
(select contact_id, 'Home', pc_hmemail
from Contact
) union all
(select contact_id, 'Mobile', EmailAddress2
from Contact
)
) e
where email is not null and email <> ''
) e
where NumTimes > 1
order by email
I'd first suggest to continue to normalise your datastructure. A person may have several types of contact information. Therefore the personID, typeID and value can be placed into another table. From this table you can create another relation with a type table, where you keep track of the different contact types (e.g. Home E-mail, Work e-mail, Twitter, linkedIn, Facebook etc). It does not only improve the extendibility of your system but also enables to run these types of queries much more efficiently.
SELECT user.username FROM user u LEFT JOIN contactinfo ci ON u.user_id=ci.user_id LEFT JOIN contacttype ct ON ci.type_id=ct.type_id GROUP BY ci.value HAVING count(value)>1 would be the query to find any duplicate source

Alternative to using GROUP BY without aggregates to retrieve distinct "best" result

I'm trying to retrieve the "Best" possible entry from an SQL table.
Consider a table containing tv shows:
id, title, episode, is_hidef, is_verified
eg:
id title ep hidef verified
1 The Simpsons 1 True False
2 The Simpsons 1 True True
3 The Simpsons 1 True True
4 The Simpsons 2 False False
5 The Simpsons 2 True False
There may be duplicate rows for a single title and episode which may or may not have different values for the boolean fields. There may be more columns containing additional info, but thats unimportant.
I want a result set that gives me the best row (so is_hidef and is_verified are both "true" where possible) for each episode. For rows considered "equal" I want the most recent row (natural ordering, or order by an abitrary datetime column).
3 The Simpsons 1 True True
5 The Simpsons 2 True False
In the past I would have used the following query:
SELECT * FROM shows WHERE title='The Simpsons' GROUP BY episode ORDER BY is_hidef, is_verified
This works under MySQL and SQLite, but goes against the SQL spec (GROUP BY requiring aggragates etc etc). I'm not really interested in hearing again why MySQL is so bad for allowing this; but I'm very interested in finding an alternative solution that will work on other engines too (bonus points if you can give me the django ORM code for it).
Thanks =)
In some way similar to Andomar's but this one really works.
select C.*
FROM
(
select min(ID) minid
from (
select distinct title, ep, max(hidef*1 + verified*1) ord
from tbl
group by title, ep) a
inner join tbl b on b.title=a.title and b.ep=a.ep and b.hidef*1 + b.verified*1 = a.ord
group by a.title, a.ep, a.ord
) D inner join tbl C on D.minid = C.id
The first level tiebreak converts bits (SQL Server) or MySQL boolean to an integer value using *1, and the columns are added to produce the "best" value. You can give them weights, e.g. if hidef > verified, then use hidef*2 + verified*1 which can produce 3,2,1 or 0.
The 2nd level looks among those of the "best" scenario and extracts the minimum ID (or some other tie-break column). This is essential to reduce a multi-match result set to just one record.
In this particular case (table schema), the outer select uses the direct key to retrieve the matched records.
This is basically a form of the groupwise-maximum-with-ties problem. I don't think there is a SQL standard compliant solution. A solution like this would perform nicely:
SELECT s2.id
, s2.title
, s2.episode
, s2.is_hidef
, s2.is_verified
FROM (
select distinct title
, episode
from shows
where title = 'The Simpsons'
) s1
JOIN shows s2
ON s2.id =
(
select id
from shows s3
where s3.title = s1.title
and s3.episode = s1.episode
order by
s3.is_hidef DESC
, s3.is_verified DESC
limit 1
)
But given the cost of readability, I would stick with your original query.

How to write a query returning non-chosen records

I have written a psychological testing application, in which the user is presented with a list of words, and s/he has to choose ten words which very much describe himself, then choose words which partially describe himself, and words which do not describe himself. The application itself works fine, but I was interested in exploring the meta-data possibilities: which words have been most frequently chosen in the first category, and which words have never been chosen in the first category. The first query was not a problem, but the second (which words have never been chosen) leaves me stumped.
The table structure is as follows:
table words: id, name
table choices: pid (person id), wid (word id), class (value between 1-6)
Presumably the answer involves a left join between words and choices, but there has to be a modifying statement - where choices.class = 1 - and this is causing me problems. Writing something like
select words.name
from words left join choices
on words.id = choices.wid
where choices.class = 1
and choices.pid = null
causes the database manager to go on a long trip to nowhere. I am using Delphi 7 and Firebird 1.5.
TIA,
No'am
Maybe this is a bit faster:
SELECT w.name
FROM words w
WHERE NOT EXISTS
(SELECT 1
FROM choices c
WHERE c.class = 1 and c.wid = w.id)
Something like that should do the trick:
SELECT name
FROM words
WHERE id NOT IN
(SELECT DISTINCT wid -- DISTINCT is actually redundant
FROM choices
WHERE class == 1)
SELECT words.name
FROM
words
LEFT JOIN choices ON words.id = choices.wid AND choices.class = 1
WHERE choices.pid IS NULL
Make sure you have an index on choices (class, wid).