How to parse string from one column into delimited values in SQL - sql

This is my column in Redshift
SHIPMENT_ID
-----------------------------------------
FBA15KS66741, FBA15KS6673D
FBA15NHV7PXX (Oct 20th)
FBA15XNW0SWY 27 balance 2 of 2
FBA15M575MDL & FBA15M59W1Y5
FBA15NHV7PXX (Oct 20th)
FBA15D7WPZVR /FBA15D7WWTPK/FBA15D7WW1GL
I would like to make it
SHIPMENT_ID
-----------------------------------------
FBA15KS66741, FBA15KS6673D
FBA15NHV7PXX
FBA15XNW0SWY
FBA15M575MDL, FBA15M59W1Y5
FBA15NHV7PXX
FBA15D7WPZVR, FBA15D7WWTPK, FBA15D7WW1GL
In SQL only, what is the best way to handle this?

This works in PostgreSQL, so may work in Redshift depending on feature availability in PG8.
WITH items AS
(
SELECT shipment_id,
ARRAY_TO_STRING(REGEXP_MATCHES(shipment_id,'FBA15[0-9a-zA-z]{7}','g'),'') AS unique_shipment_ids
FROM dat
)
SELECT shipment_id,
STRING_AGG(unique_shipment_ids,',') AS shipment_id_csv
FROM items
GROUP BY shipment_id;
I've assumed:
Each item begins with the characters 'FBA15'
There are exactly 7 characters after the first 5
You can edit the regexp pattern if my assumptions are incorrect.
The approach is:
Use REGEXP_MATCHES to capture each item within each row. This creates multiple rows per unique value in shipment_id
Use ARRAY_TO_STRING to convert those values to text, rather than text[]
Use STRING_AGG to join them back together with a comma separator
I found that I could not use STRING_AGG directly around REGEXP_MATCHES as I get the error aggregate function calls cannot contain set-returning function calls, so opted for a CTE. I assume a subquery would work as well.

Related

Count the most popular occurrences of a hashtag in a string column postgreSQL

I have a column in my dataset with the following format:
hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]
I have managed to count the hashtags like so:
SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST
But I'm not sure if it's possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:
hashtags count
christmas 3
newyears 2
The datatype is just varchar, but I'm a bit confused on how I should approach this. Any help would be appreciated!
That's a bad idea to store this data. It's risky because we don't know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.
Anyway, if you can't improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.
So the general idea is something like this:
WITH unnested AS
(SELECT
UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
Due to the braces and spaces within your column, this will not produce the correct result.
So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.
With your sample data, following construct will produce the intended outcome:
WITH unnested AS
(SELECT
TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
See here
But as already said, this is unpleasant and risky.
So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.
This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.
I think you should split all the data in Array to record and then count it with Group by. Something like this query
SELECT hashtag, count(*) as hashtag_count
FROM full_data, unnest(hashtags) s(hashtag)
GROUP BY hashtag
ORDER BY hashtag_count DESC
Hopefully, it will match your request!
You can do it as follows :
select unnest(string_to_array(REGEXP_REPLACE(hashtags,'[^\w,]+','','g'), ',')) as tags, count(1)
from full_data
group by tags
order by count(1) desc
Result :
tags count
christmas 3
newyears 2
easter 2
fourthofjuly 1
valentines 1
REGEXP_REPLACE to remove any special characters.
string_to_array to generate an array
unnest to expand an array to a set of rows
Demo here

How to rotate a two-column table?

This might be a novice question – I'm still learning. I'm on PostgreSQL 9.6 with the following query:
SELECT locales, count(locales) FROM (
SELECT lower((regexp_matches(locale, '([a-z]{2,3}(-[a-z]{2,3})?)', 'i'))[1])
AS locales FROM users)
AS _ GROUP BY locales
My query returns the following dynamic rows:
locales
count
en
10
fr
7
de
3
n additional locales (~300)...
n-count
I'm trying to rotate it so that locale values end up as columns with a single row, like this:
en
fr
de
n additional locales (~300)...
10
7
3
n-count
I'm having to do this to play nice with a time-series db/app
I've tried using crosstab(), but all the examples show better defined tables with 3 or more columns.
I've looked at examples using join, but I can't figure out how to do it dynamically.
Base query
In Postgres 10 or later you could use the simpler and faster regexp_match() instead of regexp_matches(). (Since you only take the first match per row anyway.) But don't bother and use the even simpler substring() instead:
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}' -- eliminate NULL, allow index support
GROUP BY 1
ORDER BY 2 DESC, 1
Simpler and faster than your original base query.
About those ordinal numbers in GROUP BY and ORDER BY:
Select first row in each GROUP BY group?
Subtle difference: regexp_matches() returns no row for no match, while substring() returns null. I added a WHERE clause to eliminate non-matches a-priori - and allow index support if applicable, but I don't expect indexes to help here.
Note the prefixed (?i), that's a so-called "embedded option" to use case-insensitive matching.
Added a deterministic ORDER BY clause. You'd need that for a simple crosstab().
Aside: you might need _ in the pattern instead of - for locales like "en_US".
Pivot
Try as you might, SQL does not allow dynamic result columns in a single query. You need two round trips to the server. See;
How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?
You can use a dynamically generated crosstab() query. Basics:
PostgreSQL Crosstab Query
Dynamic query:
PostgreSQL convert columns to rows? Transpose?
But since you generate a single row of plain integer values, I suggest a simple approach:
SELECT 'SELECT ' || string_agg(ct || ' AS ' || quote_ident(locale), ', ')
FROM (
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}'
GROUP BY 1
ORDER BY 2 DESC, 1
) t
Generates a query of the form:
SELECT 10 AS en, 7 AS fr, 3 AS de, 3 AS "de-at"
Execute it to produce your desired result.
In psql you can append \gexec to the generating query to feed the generated SQL string back to the server immediately. See:
My function returned a string. How to execute it?

How I fixed the incompatible error in PostgreSQL while pivoting?

I'm pivoting in PostgreSQL but when I run the query the output says:
ERROR: return and sql tuple descriptions are incompatible
SQL state: 42601
Summarizing, I want the distribution channel on raw, the year in the columns and the operative margin as value.
dist_chann_id --> integer
year --> year
operative_margin --> integer
Without the pivot the output is:
dist_chann_name
year
operative_margin
1
2020
20783
1
2021
5791
2
2020
30362
3
2021
14501
3
2020
2765
3
2021
4535
This is my query:
SELECT *
FROM crosstab(
'SELECT dist_chann_id, year, operative_margin
FROM marginality_by_channel
ORDER BY dist_chann_id, year'
) AS ct ("DC" int, "2020" int, "2021" int);
Source of the error msg
One of the columns does not have data type you think it has. Must be operative_margin, probably text?
The 1-parameter form of crosstab() only uses the "category" column (year in your example) only for sorting. And the "row_name" column (dist_chann_name - or dist_chann_id ?) would produce a different error msg.
Solution
Either way, unless you can guarantee that every "row_name" has exactly two values to it, it's safer to use the 2-parameter form of corosstab():
SELECT *
FROM crosstab(
$$
SELECT dist_chann_name, year, operative_margin
FROM marginality_by_channel
ORDER BY 1, 2
$$
, 'VALUES (2020), (2021)'
) AS ct ("DC" int, "2020" int, "2021" int);
db<>fiddle here
This variant also happens to be more tolerant with type mismatches (as everything is passed as text anyway). See:
PostgreSQL Crosstab Query
crosstab() shines for many resulting value columns (faster, shorter). For just two "value" columns, aggregate FILTER might be the better (simpler) choice. Not much performance to gain (if any, after adding some overhead). See:
a_horse's answer under this question
Conditional SQL count
Broken setup
That aside, your setup is ambiguous to begin with. It includes two rows for the same (dist_chann_name, year) = (3, 2021).
a_horse uses sum() in his aggregate FILTER solution. You might also use min() or max(), or whatever ...
My solution with the 2-parameter form outputs the last value according to sort order. (Think of it as each next value overwriting it's dedicated spot.)
The 1-parameter form outputs the first value according to sort order. (Think of it as "first come, first serve". Superfluous rows are discarded.)
A clean solution would use an explicit sort order and document the effect, or work with a query producing distinct values, or use the appropriate aggregate function with the FILTER solution.
Using filtered aggregation is typically much easier than the somewhat convoluted crosstab() function (at least in my opinion).
select dist_chann_name as dc,
sum(operative_margin) filter (where year = 2020) as "2020",
sum(operative_margin) filter (where year = 2021) as "2021"
from marginality_by_channel
group by dist_chann_name
order by dist_chann_name;

Redshift LISTAGG frame clause

I am trying to aggregate strings, but limited to only the preceding rows, not the whole partition. Does anyone know how to do this in Redshift?
What I am trying to achieve is the appended_event_namespace column below.
This is what I've tried so far.
LISTAGG(event_namespace, '/')
WITHIN GROUP (ORDER BY tstamp_true)
OVER (PARTITION BY acct_id) AS appended_event_namespace
This results in the full ApplicationLaunch/CategoryBrowse/NotificationCenter/UserProfile aggregation on every single row instead of what is in the desired screenshot.
The difficulty is in getting it to only append up to the current row since there doesn't seem to be a frame-clause for Redshift's LISTAGG(). Thanks for any ideas that may help.
You can hack this together with another query. Start with your appended_event_namespace as the result of your original LISTAGG
SELECT event_namespace,
SUBSTRING(appended_event_namespace,
1,
POSITION(event_namespace,appended_event_namespace) + LEN(event_namespace) - 1
) as appended_event_namespace_cum
FROM your_table;
Basically, you take your aggregated, ordered string, and then take the first N characters where N is ([where it appears in the aggregated string ]+[its length]), which will cut out everything after that item. This gives you a cumulative namespace.
LISTAGG with frame clause is not supported in RS yet. If you have some columns that you can use for partitioning and ordering you can make a self join (not so performant but would accomplish what you want):
SELECT
t1.id
,t2.tstamp_true
,t1.event_namespace
,LISTAGG(t2.event_namespace,'/') WITHIN GROUP (ORDER BY t2.tstamp_true)
FROM your_table t1
JOIN your_table t2
ON t1.id=t2.id
AND t1.tstamp_true>=t2.tstamp_true
GROUP BY 1,2,3
Alternatively, if you want to avoid self join you can build a JSON with the following structure using LISTAGG:
[{tstamp_true_1,event_namespace_1},{tstamp_true_N,event_namespace_N},...]
and write a Python UDF that takes such JSON for the given group of rows and tstamp_true of the given row and returns the path (the function would need to filter the tstamp_true_N values earlier than the second parameter and concatenate filtered event_namespace_N values for the output)

Splitting text in SQL Server stored procedure

I'm working with a database, where one of the fields I extract is something like:
1-117 3-134 3-133
Each of these number sets represents a different set of data in another table. Taking 1-117 as an example, 1 = equipment ID, and 117 = equipment settings.
I have another table from which I need to extract data based on the previous field. It has two columns that split equipment ID and settings. Essentially, I need a way to go from the queried column 1-117 and run a query to extract data from another table where 1 and 117 are two separate corresponding columns.
So, is there anyway to split this number to run this query?
Also, how would I split those three numbers (1-117 3-134 3-133) into three different query sets?
The tricky part here is that this column can have any number of sets here (such as 1-117 3-133 or 1-117 3-134 3-133 2-131).
I'm creating these queries in a stored procedure as part of a larger document to display the extracted data.
Thanks for any help.
Since you didn't provide the DB vendor, here's two posts that answer this question for SQL Server and Oracle respectively...
T-SQL: Opposite to string concatenation - how to split string into multiple records
Splitting comma separated string in a PL/SQL stored proc
And if you're using some other DBMS, go search for "splitting text ". I can almost guarantee you're not the first one to ask, and there's answers for every DBMS flavor out there.
As you said the format is constant though, you could also do something simpler using a SUBSTRING function.
EDIT in response to OP comment...
Since you're using SQL Server, and you said that these values are always in a consistent format, you can do something as simple as using SUBSTRING to get each part of the value and assign them to T-SQL variables, where you can then use them to do whatever you want, like using them in the predicate of a query.
Assuming that what you said is true about the format always being #-### (exactly 1 digit, a dash, and 3 digits) this is fairly easy.
WITH EquipmentSettings AS (
SELECT
S.*,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 5, 1) EquipmentID,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 3, 3) Settings
FROM
SourceTable S
INNER JOIN master.dbo.spt_values V
ON V.Value BETWEEN 1 AND Len(S.AwfulMultivalue) / 6
WHERE
V.type = 'P'
)
SELECT
E.Whatever,
D.Whatever
FROM
EquipmentSettings E
INNER JOIN DestinationTable D
ON E.EquipmentID = D.EquipmentID
AND E.Settings = D.Settings
In SQL Server 2005+ this query will support 1365 values in the string.
If the length of the digits can vary, then it's a little harder. Let me know.
Incase if the sets does not increase by more than 4 then you can use Parsename to retrieve the result
Declare #Num varchar(20)
Set #Num='1-117 3-134 3-133'
select parsename(replace (#Num,' ','.'),3)
Result :- 1-117
Now again use parsename on the same resultset
Select parsename(replace(parsename(replace (#Num,' ','.'),3),'-','.'),1)
Result :- 117
If the there are more than 4 values then use split functions