I want to printf() just the first 3 patients in collect_set() of patient numbers.
A. I have created "patient_list" using collect_set
collect_set(distinct patient_seq) AS patient_list
which yields arrays of patients numbers of varying length (4, 5 or 6 digits)
Example:
["16189","26599","406622","419117","5551"]
["223587","224663","232072","326504","433430","436673","54540","58188","74118"]
B. I then stripped out the commas and quotes and separated by '*' (in order to grab just the first 3 patients, in the next step):
concat_ws('*', patient_list) AS pat_list
This produces:
16189*26599*406622*419117*5551
223587*224663*232072*326504*433430*436673*54540*58188*74118
C. I tried to use SUBSTRING_INDEX() to create a new variable (pat_list_short) containing just the first 3 patients, but this function is not supported in hive 1.1.0 (not supported until 1.3.0).
substring_index(pat_list, '*', 3) AS pat_list_short
What other option do I have?
I want to feed the pat_list_short into the PRINTF using %s in order to print out just the first three patient numbers for review team. Since the patient num varies in length I can't just limit the print to a certain length
Thanks
Using the data you provided
--------------
key | pat_id
--------------
1 16189
1 26599
1 406622
1 419117
1 5551
2 223587
2 224663
2 232072
2 326504
2 433430
2 436673
2 54540
2 58188
2 74118
you can use this UDF here to truncate an array to a desired length. There are instructions on the main page how to build and use the jar.
Query:
add jar /path/to/jar/brickhouse-0.7.1.jar;
create temporary function trunc_array as 'brickhouse.udf.collect.TruncateArrayUDF';
select key
, concat(' ', trunc_array(collect_set( pat_id ), 3)) pat_list_short
from db.tbl
group by key
Output:
----------------------
key | pat_list_short
----------------------
1 5551 26599 16189
2 232072 58188 223587
I must admit I'm a bit unclear has to how printf() plays a part in this problem as the query returns a result and prints it. It is also with noting that in your query in A, the distinct in collect_set(distinct) is redundant, as collect_set's purpose is to collect distinct elements.
Related
I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.
I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:
'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'
In which case I need to extract 'basoon', 'cello' and 'guitar'.
I'm more or less a SQL newbie, sorry. So far I have:
SELECT
bandId,
REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
FROM `inventory.band.mytable`;
This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.
How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?
BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.
Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon,
first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:
WITH indata(bandid,instruments) AS (
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
SELECT
1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
UNION ALL
SELECT
2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
UNION ALL
SELECT
3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
SELECT
bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out bandid | i | instrument
-- out --------+---+------------
-- out 1 | 2 | basoon
-- out 1 | 3 | cello
-- out 1 | 4 | guitar
-- out 2 | 2 | drum
-- out 2 | 3 | cello
-- out 2 | 4 | bass
-- out 2 | 5 | flute
-- out 3 | 2 | 12string
-- out 3 | 3 | banjo
-- out 3 | 4 | triangle
-- out 3 | 5 | tuba
Consider below few options (just to demonstrate different technics here)
Option 1
select bandId,
( select string_agg(split(kv, '=')[offset(1)])
from unnest(split(instruments, ';')) kv
where split(kv, '=')[offset(0)] = 'inst'
) as insts
from `inventory.band.mytable`
Option 2 (for obvious reason this one would be my choice)
select bandId,
array_to_string(regexp_extract_all(instruments, r'inst=([^;$]+)'), ',') instrs
from `inventory.band.mytable`
If applied to sample data in your question - output in both cases is
I was working with the "<#" operator and two arrays of strings.
anyarray <# anyarray → boolean
Every string is formed in this way: ${name}_${number}, and I would like to check if the name part is included and the number is equal or lower than the one in the other array.
['elementOne_10'] & [['elementOne_7' , 'elementTwo20']] → true
['elementOne_10'] & [['elementOne_17', 'elementTwo20']] → false
what would be an efficient way to do this?
Assuming your sample data elementTwo20 in fact follows your described schema and should be elementTwo_20:
step-by-step demo:db<>fiddle
SELECT
id
FROM (
SELECT
*,
split_part(u, '_', 1) as name, -- 3
split_part(u, '_', 2)::int as num,
split_part(compare, '_', 1) as comp_name,
split_part(compare, '_', 2)::int as comp_num
FROM
t,
unnest(data) u, -- 1
(SELECT unnest('{elementOne_10}'::text[]) as compare) s -- 2
)s
GROUP BY id -- 4
HAVING
ARRAY_AGG(name) #> ARRAY_AGG(comp_name) -- 5
AND MAX(comp_num) BETWEEN MIN(num) AND MAX(num)
unnest() your array elements into one element per record
JOIN and unnest() your comparision data
split the element strings into their name and num parts
unnest() creates several records per original array, they can be grouped by an identifier (best is an id column)
Filter with your criteria in the HAVING clause: Compare the name parts for example with array operators, for BETWEEN comparing you can use MIN and MAX on the num part.
Note:
As #a_horse_with_no_name correctly mentioned: If possible think about your database design and normalize it:
Don't store arrays -> You don't need to unnest them on every operation
Relevant data should be kept separated, not concatenated as a string -> You don't need to split them on every operation
id | name | num
---------------------
1 | elementOne | 7
1 | elementTwo | 20
2 | elementOne | 17
2 | elementTwo | 20
This is exactly the result of the inner subquery. You have to create this every time you need these data. It's better to store the data like this.
I'd like to create a query that returns a column with a repeating number sequence in it.
For example:
row_num | repeat
----------------
1 | 1
2 | 2
3 | 3
4 | 1
5 | 2
6 | 3
I'm struggling to understand how I could achieve this with BigQuery Standard SQL.
So far i've generated the row number (ROW_NUMBER() OVER()) as row_num in my select, and then I was thinking I could use a modulus function to determine the repeat number, but this would split it into several separate columns, so I'd need additional steps to merge them into the one column. I wondered if there was a more elegant way of achieving this.
Many Thanks!
In fact, the modulus should work here. Assuming your table already has a row_num column, and you want to generate the repeat column, you may try:
SELECT
row_num,
MOD(row_num - 1, 3) + 1 AS repeat
FROM yourTable
ORDER BY
row_num;
I am trying to extract a set of numbers from comments like
"on april-17 transactions numbers are 12345 / 56789"
"on april-18 transactions numbers are 56789"
"on may-19 no transactions"
Which are stored in a column called "com" in table comments
My requirement is to get the numbers of specific length. In this case length of 5, so 12345 and 56789 from the above string separately, It is possible to to have 0 five digit number or more more than 2 five digit number.
I tried using regexp_replace with the following result, I am trying the find a efficient regex or other method to achieve it
select regexp_replace(com, '[^0-9]',' ', 'g') from comments;
regexp_replace
----------------------------------------------------
17 12345 56789
I expect the result to get only
column1 | column2
12345 56789
There is no easy way to create query which gets an arbitrary number of columns: It cannot create one column for one number and at the next try the query would give two.
For fixed two columns:
demo:db<>fiddle
SELECT
matches[1] AS col1,
matches[2] AS col2
FROM (
SELECT
array_agg(regexp_matches[1]) AS matches
FROM
regexp_matches(
'on april-17 transactions numbers are 12345 / 56789',
'\d{5}',
'g'
)
) s
regexp_matches() gives out all finds in one row per find
array_agg() puts all elements into one array
The array elements can be give out as separate columns.
I have one table that stores a range of integers in a field, sort of like a print range, (e.g. "1-2,4-7,9-11"). This field could also contain a single number.
My goal is to join this table to a second one that has discrete values instead of ranges.
So if table one contains
1-2,5
9-15
7
And table two contains
1
2
3
4
5
6
7
8
9
10
The result of the join would be
1-2,5 1
1-2,5 2
1-2,5 5
7 7
9-15 9
9-15 10
Working in SQL Server 2008 R2.
Use a string split function of your choice to split on comma. Figure out the min/max values and join using between.
SQL Fiddle
MS SQL Server 2012 Schema Setup:
create table T1(Col1 varchar(10))
create table T2(Col2 int)
insert into T1 values
('1-2,5'),
('9-15'),
('7')
insert into T2 values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
Query 1:
select T1.Col1,
T2.Col2
from T2
inner join (
select T1.Col1,
cast(left(S.Item, charindex('-', S.Item+'-')-1) as int) MinValue,
cast(stuff(S.Item, 1, charindex('-', S.Item), '') as int) MaxValue
from T1
cross apply dbo.Split(T1.Col1, ',') as S
) as T1
on T2.Col2 between T1.MinValue and T1.MaxValue
Results:
| COL1 | COL2 |
----------------
| 1-2,5 | 1 |
| 1-2,5 | 2 |
| 1-2,5 | 5 |
| 9-15 | 9 |
| 9-15 | 10 |
| 7 | 7 |
Like everybody has said, this is a pain to do natively in SQL Server. If you must then I think this is the proper approach.
First determine your rules for parsing the string, then break down the process into well-defined and understood problems.
Based on your example, I think this is the process:
Separate comma separated values in the string into rows
If the data does not contain a dash, then it's finished (it's a standalone value)
If it does contain a dash, parse the left and right sides of the dash
Given the left and right sides (the range) determine all the values between them into rows
I would create a temp table to populate the parsing results into which needs two columns:
SourceRowID INT, ContainedValue INT
and another to use for intermediate processing:
SourceRowID INT, ContainedValues VARCHAR
Parse your comma-separated values into their own rows using a CTE like this Step 1 is now a well-defined and understood problem to solve:
Turning a Comma Separated string into individual rows
So your result from the source
'1-2,5'
will be:
'1-2'
'5'
From there, SELECT from that processing table where the field does not contain a dash. Step 2 is now a well-defined and understood problem to solve These are standalone numbers and can go straight into the results temp table. The results table should also get the ID reference to the original row.
Next would be to parse the values to the left and right of the dash using CHARINDEX to locate it, then the appropriate LEFT and RIGHT functions as needed. This will give you the starting and ending value.
Here is a relevant question for accomplishing this step 3 is now a well-defined and understood problem to solve:
T-SQL substring - separating first and last name
Now you have separated the starting and ending values. Use another function which can explode this range. Step 4 is now a well-defined and understood problem to solve:
SQL: create sequential list of numbers from various starting points
SELECT all N between #min and #max
What is the best way to create and populate a numbers table?
and, also, insert it into the temp table.
Now what you should have is a temp table with every value in the exploded range.
Simply JOIN that to the other table on the values now, then to your source table on the ID reference and you're there.
My suggestion is to add one more field and many more records to your ranges table. Specifically, the primary key would be the integer and the other field would be the range. Records would look like this:
number range
1 1-2,5
2 1-2,5
3 na
4 na
5 1-2,5
etc
Having said that, this is still rather limiting because a number can only have one range. If you want to be thorough, set up a many to many relationship between numbers and ranges.
As far as I can tell you best option is something like below:
Create a table value function that accepts your ranges an converts them to a collection of ints. So 1-3,5 would return:
1
2
3
5
Then use these results to join to other tables. I don't have an exact function to do this at hand, but this one seems like an excellent start.