Bigquery REGEX getting numbers only that followed by certain text (unit) - sql

i have tables that contain product name. i want to extract the numbers only from it. but only numbers followed by the unit (certain text) ex: gr, kg, ml, pcs.
product name | Extracted
milk 30ml | 30
Cigarette 20pcs | 20
Sugar 50gr | 50
1990 chocolate 10gr | 10
Is there any way to only getting number that followed certain text we desired? i just know how to extract numbers only but the last product will getting error.
Thank you

We can use REGEXP_EXTRACT here with a capture group:
SELECT product, REGEXP_EXTRACT(product, r'([0-9]+)(?:l|ml|gr|g|mg|[a-z]+s)\b') AS Extracted
FROM yourTable;

Related

Big Query -- Reorder elements within a delimited string by another delimiter

Summary
I'd like to reorder elements in a string, the elements are delimited by new lines.
The elements I'd like to sort should be ordered by a string that can have numbers or letters within it. This sorting string is not at the beginning of the data, but rather it is also a delimited string (messy data set, I know). To make this even messier, there is an extra new line; this doesn't seem like the crux of this issue
Example
Below is a simplified version of what I'd like to do. I have a table, and I'd like to sort students' favorite shows and characters by the show's name, which is the second element of a pipe-delimited string.
student
favorite characters and shows
alice
10th doctor | dr who troy | community
bob
11 | stranger things Liz | 30 Rock mr peanut butter | bojack horseman
would become this:
student
favorite characters and shows
alice
troy | community 10th doctor | dr who
bob
Liz | 30 Rock mr peanut butter | bojack horseman 11 | stranger things
What I've tried
Big Query doesn't allow arrays of arrays. If it did, I would have an easier time here. I've tried working with COLLATE but today is my first time seeing that function; I'm not sure that is the right way to go, anyways.
Currently, I'm working to split by new line, and rejoin later. I have never done this with tables, so I'm a bit out of my element. Here is the query I'm working from:
WITH
-- example data from above
example_data AS (
SELECT
'alice' AS student,
-- note: the new line is at the end of every pipe-delimited line, so there is always some floating empty row when using functions like split()
'10th doctor | dr who\ntroy | community\n' AS favorite_characters_and_shows
UNION ALL
SELECT
'bob' AS student,
"11 | stranger things\nLiz | 30 Rock\nmr peanut butter | bojack horseman\n" AS favorite_characters_and_shows ),
-- I have no need for this to be another table, but it is where I am. Tell me if this is misguided, please.
soln_table AS (
SELECT
example_data.student,
example_data.favorite_characters_and_shows,
SPLIT(example_data.favorite_characters_and_shows, '\n'),
array( select x from unnest(SPLIT(example_data.favorite_characters_and_shows, '\n') ) as x order by x) as foo,
FROM
example_data )
-- where I am trying to display a sorted solution
SELECT
*
FROM
soln_table;
Consider below approach
select student, (
select string_agg(line, '\n' order by split(line, '|')[safe_offset(1)])
from unnest(split(favorite_characters_and_shows, '\n')) line
where trim(line) != ''
) as favorite_characters_and_shows
from example_data
if applied to sample data in your question - output is

How to generate a dummy variable in Stata based on a sub-string of an existing string variable?

I am looking for a way to create a dummy variable which checks a variable called text against multiple given substrings like "book, buy, journey".
Now, I want to check if a observation has either book, buy, or journey in it. If there is one of these keywords found in the substring then the dummy variable should be 1, otherwise 0.
A example:
TEXT
Book your tickets now
Swiss is making your journey easy
Buy your holiday tickets now!
A touch of Austria in your lungs.
The desired outcome should be
dummy variable
1
1
1
0
I tried it with strpos and also regexm with very limited results.
Regards,
Johi
Using strpos may be tedious because you have to take capitalization into account, so I would use regular expressions.
* Example generated by -dataex-. To install: ssc install dataex
clear
input str33 text
"Book your tickets now"
"Swiss is making your journey easy"
"Buy your holiday tickets now!"
"A touch of Austria in your lungs."
end
generate wanted = regexm(text, "[Bb]ook|[Bb]uy|[Jj]ourney")
list
Result:
. list
+--------------------------------------------+
| text wanted |
|--------------------------------------------|
1. | Book your tickets now 1 |
2. | Swiss is making your journey easy 1 |
3. | Buy your holiday tickets now! 1 |
4. | A touch of Austria in your lungs. 0 |
+--------------------------------------------+
See also this link for info on regular expressions.

Search and match indexes in two different columns, return the sum of a third column - Postgresql

I have a table called "tax_info", this table stores the information of the land tax of my city, it goes like this:
taxpayer_code | condominium_num | lot_area | built_area
-------------------------------------------------------------
0010030078-2 | 00-0 | 143 | 130
0010030079-1 | 02-7 | 283 | 57
0010030080-1 | 02-7 | 283 | 48
0010030081-1 | 02-7 | 283 | 50
the taxpayer code first 3 numbers refer to the city district, the next 3 to the block within the district, and the next 4 can refer to the lot in a block if the condo number is 00-0, or to an apartment, or store, etc if the condo number is different than 00-0, in which case all equal condo numbers refer to the same lot within the block.
what I want to do is pass a list of "taxpayer_code" and get the "lot_area" and "built_area" for the lots. the problem is, if the person lives in a condo, her apartment is a fraction of the total built area for the lot. So, if I search for code 0010030078% (the -X number doesn't matter) the result is:
Lot Area = 143 and Built Area = 130
But if I search for 0010030080%, the result I expect is:
Lot Area = 283 and Built Area 155
And if I search for 0010030078%, 0010030079%, the result:
Lot Area = 426 and Built Area 285
So the database should get the taxpayer codes, then look if the condominium number is different from 00-0 for each code passed, if so, it should add to the sum all the other taxpayer codes that share the same condo number within the same district and block. (ideally, if tax codes belonging to different districts or blocks are passed a warning should be returned, and if more tax codes are added to the sum, a listing with all codes added would be nice, but it's okay if that's too much of a hassle!).
I am new to SQL and can't wrap my head around this, I appreciate every help you can give me, thanks!
Hmmm . . . use a subquery and window functions to add up the values that you want:
select ti.*
from (select ti.*,
(case when condominium_num <> '00-0'
then sum(built_area) over (partition by condominium_num)
else built_area
end) as real_built_area
from tax_info ti
) ti
where . . .

OpenRefine split column with repetitive values

I have a single column in OpenRefine like this:
Title
A Star is born
Author
George Cukor
Date
1954
Other tags...
Data for each item begin with name of the tag (Title, Author, Date etc.), followed by a value, and every tag or value are in successive rows, around ten thousands.
I would like to have as many columns as tags and as many rows as items containing title, date, author etc., something like this:
Title | Author | Date | etc.
A Star is born | George Cukor | 1954 | etc.
Any idea ?
Thanks
This is your original dataset:
Use "Transpose --> Transpose cells in rows into columns" (leaving option 2 as default). You will get this:
Then, on the first column, apply "Transpose --> Columnize by key/value columns" and don't change the default options there either. Final result:
This will obviously work with more tags/columns, but only if each of them is followed by a single value.

SQL Query to find a record with a certain number of specific characters (4 vertical bars)

In the database I'm working with, we have a certain naming convention that should be followed and I'm trying to fix errors.
For example, one of the correctly named records would be:
Blue Insurance | Blue Agency | Blue Agency | BL26 | Blue Insurance
Is there any way to search for all records that do not have 4 vertical bars in them?
Thanks!
select * from tablename
where length(colname) - length(replace(colname,'|','')) <> 4
Change length according to the database being used.