Databricks spark sql to show associated strings from hashed strings

Databricks spark sql to show associated strings from hashed strings - sql

I'm using a query in databricks like this :
select * from thisdata where hashed_string in (sha2("mystring1", 512),sha2("mystring2", 512),sha2("mystring3", 512))
This works well and gives me the data I need, but is there a way to show the associated string to the hashed string?
example
mystring1 - 1494219340aa5fcb224f6b775782f297ba5487
mystring2 - 5430af17738573156426276f1e01fc3ff3c9e1
Probably not as theres a reason for it to be hashed, but just checking if there is a way.

If you have table with string and corresponding hash columns then you can perform inner join instead of using IN clause. After joining, using concat_ws function you can get the required result.
Let's say, you create a table with name hashtable where you have columns mystring and hashed_mystring and other table name as maintable.
You can use below query to join and extract the result in the required format.
select concat_ws('-',h.mystring, m.hashed_string) from maintable m
inner join hashtable h on m.hashed_string = h.hashed_mystring

Related

Why is Big Query creating a new column instead of joining two columns when using a Join?

When I use a Join in BigQuery, it completes it but creates a new column which are named Id_1 and Date_1 with the same information from the primary key. What could cause this? Here is the code.
SELECT
*
FROM
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`
JOIN
`bellabeat-case-study-373821.bellabeat_case_study.sleep_day`
ON
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`.Id = `bellabeat-case-study-373821.bellabeat_case_study.sleep_day`.Id
AND `bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`.Date = `bellabeat-case-study-373821.bellabeat_case_study.sleep_day`.Date
I made the query and expected the tables to join by the Primary keys of Id and Date, but instead this created two new columns with the same information.

When you use * in the select list the ON variant of a JOIN clause produces all columns from both tables in the result set. If there are columns with the same name on both sides, then both will show up in the result [with slightly different names] as you can see.
You can use the USING variant of the JOIN clause instead, that merges the columns and produces only one resulting column for each column mentioned in the USING clause. This is probably what you want. See BigQuery - INNER JOIN.
Your query could take the form:
SELECT
*
FROM
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`
JOIN
`bellabeat-case-study-373821.bellabeat_case_study.sleep_day`
USING (Id, Date)
Note: USING can only be used when the columns you want to join with have the exact same name. It won't be possible to use it if a column is, for example, called id in one table and employee_id in the other one.

Postgres Query based on Delimited String

we have a column in one of our tables that consist of a delimited string of ID's, like this: 72,73,313,502.
What I need to do in the query, is to parse that string, and for each ID, join it to another table. So in the case of this string, I'd get 4 rows.......like the ID and the name from the other table.
Can this be done in a query?

One option is regexp_split_to_table() and a lateral join. Assuming that the CSV string is stored in table1(col) and that you want to join each element against table2(id), you would do:
select ... -- whatever columns you want
from table1 t1
cross join lateral regexp_split_to_table(t1.col, ',') x(id)
inner join table2 t2 on t2.id = x.id::int
It should be noted that storing CSV strings in a relational database is bad design, and should almost always be avoided. Each value in the list should be stored as a separate row, using the proper datatype. Recommended reading: Is storing a delimited list in a database column really that bad?

Unnesting 3rd level dependency in Google BigQuery

I'm trying to Replace the schema in existing table using BQ. There are certain fields in BQ which have 3-5 level schema dependency.
For Ex. comsalesorders.comSalesOrdersInfo.storetransactionid this field is nested under two fields.
Since I'm using this to replace existing table, I can not change the field names in query.
The query looks similar to this
SELECT * REPLACE(comsalesorders.comSalesOrdersInfo.storetransactionid AS STRING) FROM CentralizedOrders_streaming.orderStatusUpdated, UNNEST(comsalesorders) AS comsalesorders, UNNEST(comsalesorders.comSalesOrdersInfo) AS comsalesorders.comSalesOrdersInfo
BQ enables unnesting first schema field but presents problem for 2nd nesting.
What changes do I need to make to this query to use UNNEST() for such depedndent schemas ?

Given that you don't have a schema, I will try to provide a generalized answer. Please try to understand the difference between the 2 queries.
-- Provide an alias for each unnest (as if each is a separate table)
select c.stuff
from table
left join unnest(table.first_level_nested) a
left join unnest(a.second_level_nested) b
left join unnest(b.third_level_nested) c
-- b and c won't work here because you are 'double unnesting'
select c.stuff
from table
left join unnest(table.first_level_nested) a
left join unnest(first_level_nested.second_level_nested) b
left join unnest(first_level_nested.second_level_nested.third_level_nested) c

I'm not sure I understand your question, but as I could guess, you want to change one column type to another type, such as STRING.
The UNNEST function is only used with columns that are array types, for example:
"comsalesorders":["comSalesOrdersInfo":{}, comSalesOrdersInfo:{}, comSalesOrdersInfo:{}]
But not with this kind of columns:
"comSalesOrdersInfo":{"storeTransactionID":"X1056-943462","ItemsWarrenty":0,"currencyCountry":"USD"}
Therefore, if a didn't misunderstand your question, I would make a query like this:
SELECT *, CAST(A.comSalesOrdersInfo.storeTransactionID as STRING)
FROM `TABLE`, UNNEST(comsalesorders) as A

Postgresql, sql command, join table with similar string, only string "OM:" is at the begin

I wanna join table.
left join
c_store on o_order.customer_note = c_store.store_code
String in field is almost same, just contains "OM:" on start of field, for example, field from o_order.customer_note is
OM:4008
and from c_store.store_code is
4008
Is possible to join table c_store.store_code based on remove (or replace ) from every field in o_order.customer_note?
I tried
c_store on replace(o_order.customer_note, '', 'OM:') = c_store.store_code
but no success. I think, this is only for rename column name, right? Sorry for this question, I am new in this.
Thanks.

Use a string concatenation in your join condition:
SELECT ...
FROM o_order o
LEFT JOIN c_store c
ON o.customer_note = 'OM:' || c.store_code::text;
But not that while the above logic might fix your query in the short term, in the long term the better fix would be to have proper join columns setup in your database. That is, it is desirable to be able to do joins on equality alone. This would let Postgres use an index, if it exists.

How to join two columns using 'on' statement if values in each column are not exactly the same?

I want to join two columns from two different tables where values in column A is not exactly the same as column B. I mean values in column A (which is of type text) is part of values in column B (of type text as well)
I don't find any SQL operation that fits what I need.
For example: this is a value from column A:
'bad-things-gone'
And this is the corresponding value from column B:
'/article/bad-things-gone'
I am using the inner join technique.
select
articles.title, counted_views.top_counts
from
articles
inner join
counted_views on articles.column_A (operation) counted_views.column_B;

If the prefix is always /article/ you could just concat() that.
SELECT articles.title,
counted_views.top_counts
FROM articles
INNER JOIN counted_views
ON counted_views.column_b = concat('/article/', articles.column_a);
If the prefix is variable you could use LIKE. It compares strings by simple patterns.
SELECT articles.title,
counted_views.top_counts
FROM articles
INNER JOIN counted_views
ON counted_views.column_b LIKE concat('%', articles.column_a);
% is a wildcard for any character.
If there's also a suffix you can append another % at the end.

There are many way to make such a weak join, which are mainly different in performance and from the database vendor.
Some common approaches, and the resulting join condition:
Normalize the strings e.g. by removing all non Alpha chars and only compare this.
ON regexp_replace(upper(column_b),[^A-Z],'') = regexp_replace(upper(column_b),[^A-Z],'')
Use database functions which returns the distance between strings (see [https://en.wikipedia.org/wiki/Levenshtein_distance]).
ON EDIT_DISTANCE(column_b, column_a) < 6
Use database functions which only check if string a is included in b.
ON contains(column_b, column_a)
The above functions like regexp_replace are oracle specific, but similar exists for all major databases.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Databricks spark sql to show associated strings from hashed strings - sql

Related

Why is Big Query creating a new column instead of joining two columns when using a Join?

Postgres Query based on Delimited String

Unnesting 3rd level dependency in Google BigQuery

Postgresql, sql command, join table with similar string, only string "OM:" is at the begin

How to join two columns using 'on' statement if values in each column are not exactly the same?

Categories

Resources