Get Position of a String in a field with delimiters BigQuery - sql

I want to get the position of a word in a field that has the following data with the delimiter as "->":
Example:
Row 1| "ACT -> BAT -> CAT -> DATE -> EAT"
Row 2| "CAT -> ACT -> EAT -> BAT -> DATE"
I would like to lets say extract the position of CAT in each row.
Output would be -
Row 1| 3
Row 2| 1
Ive tried regex_instr and instr but they both return position of the alphabet i think not the word

Consider below
select *,
array_length(split(regexp_extract(col, r'(.*?)CAT'), '->')) as position
from your_table
if applied to sample data in your question - output is

Related

Spark- scan data frame base on value

I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)

Return value X where all rows that value fit criteria in other column

I need to return a list of numbers where all rows having the same number fit criteria in another column.
For example, all rows with same "num" must have non-null values in the Check-in column.
num| Check-in
1| null
1| X
1| X
2| X
2| X
3| X
3| X
Desired return: 2, 3
I know there has to be a simple way to do this without looping! Thank you!
you can try grouping with a filter using having, here it compares the count of rows in each group with the count of rows per group that are not null, you only want those rows where the counts match.
select num
from t
group by num
having Count(*)=count(checkin);

What is the best way to find all occurrences of values from one dataframe in another dataframe?

I am working on a spark cluster and I have two dataframes. One contains text. The other is a look-up table. Both tables are huge (M and N both could easily exceed 100,000 entries). What is the best way to match them?
Doing a cross-join then filtering results based on matches seems like a crazy idea since I would most certainly run out of memory.
My dataframes look something like this:
df1:
text
0 i like apples
1 oranges are good
2 eating bananas is healthy
. ...
. ...
M tomatoes are red, bananas are yellow
df2:
fruit_lookup
0 apples
1 oranges
2 bananas
. ...
. ...
N tomatoes
I am expecting an output dataframe to look something like this:
output_df:
text extracted_fruits
0 i like apples ['apples']
1 oranges are good ['oranges']
2 eating bananas is healthy ['bananas']
. ...
. ...
M tomatoes are red, bananas are yellow . ['tomatoes','bananas']
One way is to use CountVectorizerModel as 100K lookup words should be manageable for this model (default vocabSize=262144):
The basic idea is to create the CountVectorizerModel based on a customized list from df2 (lookup table). split df1.text into an array column and then transform this column into a SparseVector which can then be mapped into words:
Edit: in split function , adjusted the regex from \s+ to [\s\p{Punct}]+ so that all punctuation marks are removed. change 'text' to lower(col('text')) if the lookup is case insensitive.
from pyspark.ml.feature import CountVectorizerModel
from pyspark.sql.functions import split, udf, regexp_replace, lower
df2.show()
+---+------------+
| id|fruit_lookup|
+---+------------+
| 0| apples|
| 1| oranges|
| 2| bananas|
| 3| tomatoes|
| 4|dragon fruit|
+---+------------+
Edit-2: Added the following df1 pre-process step and create an array column including all N-gram combinations. For each string with L words, N=2 will add (L-1) more items in array, if N=3, (L-1)+(L-2) more items.
# max number of words in a single entry of the lookup table df2
N = 2
# Pre-process the `text` field up to N-grams,
# example: ngram_str('oranges are good', 3)
# --> ['oranges', 'are', 'good', 'oranges are', 'are good', 'oranges are good']
def ngram_str(s_t_r, N):
arr = s_t_r.split()
L = len(arr)
for i in range(2,N+1):
if L - i < 0: break
arr += [ ' '.join(arr[j:j+i]) for j in range(L-i+1) ]
return arr
udf_ngram_str = udf(lambda x: ngram_str(x, N), 'array<string>')
df1_processed = df1.withColumn('words_arr', udf_ngram_str(lower(regexp_replace('text', r'[\s\p{Punct}]+', ' '))))
Implement the model on the processed df1:
lst = [ r.fruit_lookup for r in df2.collect() ]
model = CountVectorizerModel.from_vocabulary(lst, inputCol='words_arr', outputCol='fruits_vec')
df3 = model.transform(df1_processed)
df3.show(20,40)
#+----------------------------------------+----------------------------------------+-------------------+
#| text| words_arr| fruits_vec|
#+----------------------------------------+----------------------------------------+-------------------+
#| I like apples| [i, like, apples, i like, like apples]| (5,[0],[1.0])|
#| oranges are good|[oranges, are, good, oranges are, are...| (5,[1],[1.0])|
#| eating bananas is healthy|[eating, bananas, is, healthy, eating...| (5,[2],[1.0])|
#| tomatoes are red, bananas are yellow|[tomatoes, are, red, bananas, are, ye...|(5,[2,3],[1.0,1.0])|
#| test| [test]| (5,[],[])|
#|I have dragon fruit and apples in my bag|[i, have, dragon, fruit, and, apples,...|(5,[0,4],[1.0,1.0])|
#+----------------------------------------+----------------------------------------+-------------------+
Then you can map the fruits_vec back to the fruits using model.vocabulary
vocabulary = model.vocabulary
#['apples', 'oranges', 'bananas', 'tomatoes', 'dragon fruit']
to_match = udf(lambda v: [ vocabulary[i] for i in v.indices ], 'array<string>')
df_new = df3.withColumn('extracted_fruits', to_match('fruits_vec')).drop('words_arr', 'fruits_vec')
df_new.show(truncate=False)
#+----------------------------------------+----------------------+
#|text |extracted_fruits |
#+----------------------------------------+----------------------+
#|I like apples |[apples] |
#|oranges are good |[oranges] |
#|eating bananas is healthy |[bananas] |
#|tomatoes are red, bananas are yellow |[bananas, tomatoes] |
#|test |[] |
#|I have dragon fruit and apples in my bag|[apples, dragon fruit]|
#+----------------------------------------+----------------------+
Method-2: As your dataset are not huge in terms of Spark context, the following might work, this will work with lookup value having multiple words as per your comment:
from pyspark.sql.functions import expr, collect_set
df1.alias('d1').join(
df2.alias('d2')
, expr('d1.text rlike concat("\\\\b", d2.fruit_lookup, "\\\\b")')
, 'left'
).groupby('text') \
.agg(collect_set('fruit_lookup').alias('extracted_fruits')) \
.show()
+--------------------+-------------------+
| text| extracted_fruits|
+--------------------+-------------------+
| oranges are good| [oranges]|
| I like apples| [apples]|
|tomatoes are red,...|[tomatoes, bananas]|
|eating bananas is...| [bananas]|
| test| []|
+--------------------+-------------------+
Where: "\\\\b is word boundary so that the lookup values do not mess up with their contexts.
Note: you may need to clean up all punctuation marks and redundant spaces on both columns before the dataframe join.

Spark Scala : How to read fixed record length File

I have a simple question.
“How to read Files with Fixed record length?” i have 2 fields in the record- name & state.
File Data-
John OHIO
VictorNEWYORK
Ron CALIFORNIA
File Layout-
Name String(6);
State String(10);
I just want to read it and create a DataFrame from this file. Just to elaborate more for example on “fixed record length”- if you see since “OHIO” is 4 characters length, in file it is populated with 6 trailing spaces “OHIO “
The record length here is 16.
Thanks,
Sid
Read your input file:
val rdd = sc.textFile('your_file_path')
Then use substring to split the fields and then convert RDD to Dataframe using toDF().
val df = rdd.map(l => (l.substring(0, 6).trim(), l.substring(6, 16).trim()))
.toDF("Name","State")
df.show(false)
Result:
+------+----------+
|Name |State |
+------+----------+
|John |OHIO |
|Victor|NEWYORK |
|Ron |CALIFORNIA|
+------+----------+

Split Words in the comment text

I am trying to write a macro which will split the comment. My supervisor wants to prioritize the comments, e.g.:
Low : Comment 1
Medium : Comment 2
High: Comment 3
The output should be displayed in Excel with the following headings.
I was able to write a macro to export comments from Word to an Excel file, however I am struggling to add this code snippet to split the comment text.
Comment ID |Page| Section/Paragraph Name |Comment Scope |Comment text |Priority |Reviewer |Comment Date|
1 |1| 1.1heading1| example| heading| Comment 1 |Low |BlueDolphin |1/1/1|
2| 2| 1.2heading |example2 |Comment 2 |Medium| BlueDolphin| 1/1/1|
3| 3 |1.3heading| 3example3|Comment 3 |High |BlueDolphin |1/1/1|
Any help is much appreciated.
This looks pretty straightforward. Just turn on the Macro Recorder, select the cells of interest, click Data > Text to Columns, and follow the prompts. Turn off the Macro Recorder when done. That's it.