Using Google big query sql split the string in a column to multiple columns without breaking words - google-bigquery

Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words

Regular expression are a good way to accomplish this task. However, BigQuery is still quite limited in the usasge of regular expression. Therefore, I would suggest to solve this with a UDF and JavaScript. A solution for JavaScript can be found here:
https://www.tutorialspoint.com/how-to-split-sentence-into-blocks-of-fixed-length-without-breaking-words-in-javascript
Adaption this solution to BigQuery
The function string_split expects the character counts to be splitted and the text to be splitted. It returns an array with the chunks. The chunks can be two characters longer than the given size value due to the spaces.
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
#length(split_text)
FROM
(
SELECT
text,string_split(20,text) as split_text
FROM (
SELECT "Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words" AS text
UNION ALL SELECT "This is a short text. And can be splitted as well."
)
)
#, unnest(split_text) as split_text #
Please uncomment the two lines to split the text from the array into single rows.
For larger datasets it also works and took less than two minutes:
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
length(split_text)
FROM
(
SELECT
text,string_split(40,text) as split_text
FROM (
SELECT abstract as text from `bigquery-public-data.breathe.jama`
)
)
, unnest(split_text) as split_text #
order by 3 desc

Consider below approach
create temp function split_parts(parts array<string>, max_len int64) returns array<string>
language js as """
var arr = [];
var part = '';
for (i = 0; i < parts.length; i++) {
if (part.length + parts[i].length < max_len){part += parts[i]}
else {arr.push(part); part = parts[i];}
}
arr.push(part);
return arr;
""";
select * from (
select id, offset, part
from your_table, unnest(split_parts(regexp_extract_all(col, r'[^ ]+ ?'), 50)) part with offset
)
pivot (any_value(trim(part)) as part for offset in (0, 1, 2, 3))
if applied to dummy data as below with split size = 50
output is

Non-regexp Approach
DECLARE LONG_SENTENCE DEFAULT "It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.";
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 264) { bin += 1; return Number(v); }
else return c += Number(v);
}, 0);
return bin;
""";
WITH splits AS (
SELECT w, cumsumbin(ARRAY_AGG(LENGTH(w) + 1) OVER (ORDER BY o)) AS bin
FROM UNNEST(SPLIT(LONG_SENTENCE, ' ')) w WITH OFFSET o
)
SELECT * FROM (
SELECT bin, STRING_AGG(w, ' ') AS segment
FROM splits
GROUP BY 1
) PIVOT (ANY_VALUE(segment) AS segment FOR bin IN (0, 1, 2, 3))
;
Query results:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes.
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
57
Regexp Approach
[note] below expression (.{1,264}\b) is simple but word boundary doesn't include a period(.), thus result can have some error. You can see last period(.) in segment_3 is missing. But under centain circumtances this might be useful, I think.
SELECT * FROM (
SELECT *
FROM UNNEST(REGEXP_EXTRACT_ALL(LONG_SENTENCE, r'(.{1,264}\b)')) segment WITH OFFSET o
) PIVOT (ANY_VALUE(segment) segment FOR o IN (0, 1, 2, 3));
Query rseults:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
56

Related

find frequency (times of occurrence) of a (list of) substring (which are elements in list of dictionaries) in another string

I would like to find the frequency(number) of occurrence of a substring (which are elements in list of dictionaries to determine their categories) in another string.
See the sample input and output below.
Find number of repetition of element of st in string namedstgs
The code:
def freqcounter(st, stgs):
"""
st: A mapping of st name to st keyword list.
st: dict of str -> list
stgs: A list of stgs (type:str)
return: A mapping of st name to st occurance in list of stgs.
:rtype: dict of str -> int
"""
stgs = str(stgs).split(" ") or str(stgs).split(' ')
dic = {}
count = 0
for k,v in st.items():
for i in range(len(v)):
for j in range(len(stgs)):
if v[i] == stgs[j]:
count+=1
dic[k]=count
return dic
if __name__ == '__main__':
stgs=['John Smith sells trees, he said the height of his tree is high. I expected more trees with lower price, but it is higher than my expectation.', 'I like my new tree, John!', "100 dollars per each tree is very high. Tree is source of oxygen. Next time I do my shopping from a Cheap Trees shoppers."]
st = {'Height': ['low', 'high', 'height'], 'Topic Work': ['tree', 'trees'], 'John Smith': ['John Smith']}
outtts = freqcounter(st,stgs)
outtts = sorted(list(outtts.items()))
for outtt in outtts:
print(outtt[0])
print(outtt[1])
sample input:
#For instance, if `stgs` is the input, and I would like to count frequency of each `st` in this text.
stgs=['John Smith sells trees, he said the height of his tree is high. I expected more trees with lower price, but it is higher than my expectation.', 'I like my new tree, John!', "100 dollars per each tree is very high. Tree is source of oxygen. Next time I do my shopping from a Cheap Trees shoppers."]
st = {'Height': ['low', 'high', 'height'], 'Topic Work': ['tree', 'trees'], 'John Smith': ['John Smith']}
I would like to calculate on 2 cases:
do not consider word+appendix the same as a word. For instance: do not count lower, higher as low, high
sample output:
'Height': 3 , 'Topic Work': 7, 'John Smith': 1
because 3 times found 'Height','high','low' which are the element of 'Height', 7 times found 'tree', 'trees' which are elements of 'Topic Work' and 1 time found 'John Smith' which are elements of 'John Smith'
2.consider word+appendix the same as a word. For instance: count lower, higher as low, high
sample output:
'Height': 5 , 'Topic Work': 7, 'John Smith': 1
What is my expectation is showing how many of each of them are found.

get prefix out a size range with different size formats

I have column in a df with a size range with different sizeformats.
artikelkleurnummer size
6725 0161810ZWA B080
6726 0161810ZWA B085
6727 0161810ZWA B090
6728 0161810ZWA B095
6729 0161810ZWA B100
in the sizerange are also these other size formats like XS - XXL, 36-50 , 36/38 - 52/54, ONE, XS/S - XL/XXL, 363-545
I have tried to get the prefix '0' out of all sizes with start with a letter in range (A:K). For exemple: Want to change B080 into B80. B100 stays B100.
steps:
1 look for items in column ['size'] with first letter of string in range (A:K),
2 if True change second position in string into ''
for range I use:
from string import ascii_letters
def range_alpha(start_letter, end_letter):
return ascii_letters[ascii_letters.index(start_letter):ascii_letters.index(end_letter) + 1]
then I've tried a for loop
for items in df['size']:
if df.loc[df['size'].str[0] in range_alpha('A','K'):
df.loc[df['size'].str[1] == ''
message
SyntaxError: unexpected EOF while parsing
what's wrong?
You can do it with regex and the pd.Series.str.replace -
df = pd.DataFrame([['0161810ZWA']*5, ['B080', 'B085', 'B090', 'B095', 'B100']]).T
df.columns = "artikelkleurnummer size".split()
replacement = lambda mpat: ''.join(g for g in mpat.groups() if mpat.groups().index(g) != 1)
df['size_cleaned'] = df['size'].str.replace(r'([a-kA-K])(0*)(\d+)', replacement)
Output
artikelkleurnummer size size_cleaned
0 0161810ZWA B080 B80
1 0161810ZWA B085 B85
2 0161810ZWA B090 B90
3 0161810ZWA B095 B95
4 0161810ZWA B100 B100
TL;DR
Find a pattern "LetterZeroDigits" and change it to "LetterDigits" using a regular expression.
Slightly longer explanation
Regexes are very handy but also hard. In the solution above, we are trying to find the pattern of interest and then replace it. In our case, the pattern of interest is made of 3 parts -
A letter in from A-K
Zero or more 0's
Some more digits
In regex terms - this can be written as r'([a-kA-K])(0*)(\d+)'. Note that the 3 brackets make up the 3 parts - they are called groups. It might make a little or no sense depending on how exposed you have been to regexes in the past - but you can get it from any introduction to regexes online.
Once we have the parts, what we want to do is retain everything else except part-2, which is the 0s.
The pd.Series.str.replace documentation has the details on the replacement portion. In essence replacement is a function that takes all the matching groups as the input and produces an output.
In the first part - where we identified three groups or parts. These groups are accessed with the mpat.groups() function - which returns a tuple containing the match for each group. We want to reconstruct a string with the middle part excluded, which is what the replacement function does
sizes = [{"size": "B080"},{"size": "B085"},{"size": "B090"},{"size": "B095"},{"size": "B100"}]
def range_char(start, stop):
return (chr(n) for n in range(ord(start), ord(stop) + 1))
for s in sizes:
if s['size'][0].upper() in range_char("A", "K"):
s['size'] = s['size'][0]+s['size'][1:].lstrip('0')
print(sizes)
Using a List/Dict here for example.

Extract words from the text in Pyspark Dataframe

I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
+----------+----------------------------------------------------------------------------------------------------------------------------+
I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:
df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?
You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()
+----------+--------------------+------------+
| begin_end| text| new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+

SQL Server 2008 Split 1 column into 3 (Shipping dimensions)

I have a table with with one column for product dimensions.
example rows:
16" L x 22" W x 6" H
22.5" L x 12" W x 9" H
I am trying to get the length, width, and height into separate columns. I have to use SQL because this is being used in a software integration that only accepts SQL statements. I am thinking I have to go the route of regex.
SQL Statement to get the data so far
SELECT TOP 10
[ID]
,substring([SHIP_DIMENSIONS],PATINDEX('%[0-9]\"%',[SHIP_DIMENSIONS]),2) as Length
,substring([SHIP_DIMENSIONS],PATINDEX('%[0-9]\"%',[SHIP_DIMENSIONS]),2) as Width
,substring([SHIP_DIMENSIONS],PATINDEX('%[0-9]*\"%',[SHIP_DIMENSIONS]),2) as Height
FROM [PART]
I need the output to be
Length | Width | Height
16 | 22 | 6
22.5 | 12 | 9
Any suggestions would be greatly appreciated.
One way to do it is as follows:
select
left(dim, charindex('" L', dim)-1) as [Length]
, substring(dim, charindex('" L', dim)+6, charindex('" W', dim)-charindex('" L x ', dim) - 6) as [Width]
, substring(dim, charindex('" W', dim)+6, charindex('" H', dim)-charindex('" W x ', dim) - 6) as [Height]
from test
The idea is to look for the markers that you have in your text, and use them to parcel out the string into substrings. This approach is very rigid, in that it assumes that the pattern shown in your example is followed precisely in all your records, i.e. all the markers are present, along with the spaces. There is also an implicit assumption that all dimensions are in inches. What can vary is the width of the columns.
Demo.
Note: I am assuming that you are dealing with a legacy database, so there is no way to do the right thing (which is to separate out the dimensions into separate columns).

Power-law distribution in T-SQL

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.
I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.
The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"
Here's what I have so far.
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)
But this of course yields a uniform distribution.
I promise I'll still be trying to figure this out myself by the time a smarter person responds.
Why settle for the power-law distribution when you can draw from the actual distribution ?
I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.
The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)
Smith 0
White 10,060
Johnson 19,123
Williams 28,456
...
Sanders 200,987
..
Alderink 999,997
And the name selection would be
SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC
That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.
Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).
Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the #PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.
DECLARE #PwrCoef INT
SET #PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, #PwrCoef) * RAND(), 1.0/#PwrCoef), 0)
Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.
Assuming a power of, say, 3 the query would then look something like
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)
Which is the query from the question except for the last line.
Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:
First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end
If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:
-- Random Drawing of a single Name in top third
-- Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)
-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
= (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)
-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
= (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)
Instead of storing the pdf as rank, store the CDF (the sum of all frequencies until that name, starting from Aldekirk).
Then modify your select to retrieve the first LN with rank greater than your formula result.
I read the question as "I need to get a stream of names which will mirror the frequency of last names from the 1990 US Census"
I might have read the question a bit differently than the other suggestions and although an answer has been accepted, and a very through answer it is, I will contribute my experience with the Census last names.
I had downloaded the same data from the 1990 census. My goal was to produce a large number of names to be submitted for search testing during performance testing of a medical record app. I inserted the last names and the percentage of frequency into a table. I added a column and filled it with a integer which was the product of the "total names required * frequency". The frequency data from the census did not add up to exactly 100% so my total number of names was also a bit short of the requirement. I was able to correct the number by selecting random names from the list and increasing their count until I had exactly the required number, the randomly added count never ammounted to more than .05% of the total of 10 million.
I generated 10 million random numbers in the range of 1 to 88799. With each random number I would pick that name from the list and decrement the counter for that name. My approach was to simulate dealing a deck of cards except my deck had many more distinct cards and a varing number of each card.
Do you store the actual frequencies with the ranks?
Converting the algebra from that accepted answer to MySQL is no bother, if you know what values to use for n. y would be what you currently have ROUND(88799 * RAND(), 0) and x0,x1 = 1,88799 I think, though I might misunderstand it. The only non-standard maths operator involved from a T-SQL perspective is ^ which is just POWER(x,y) == x^y.