I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
+----------+----------------------------------------------------------------------------------------------------------------------------+
I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:
df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?
You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()
+----------+--------------------+------------+
| begin_end| text| new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+
I am working on a spark cluster and I have two dataframes. One contains text. The other is a look-up table. Both tables are huge (M and N both could easily exceed 100,000 entries). What is the best way to match them?
Doing a cross-join then filtering results based on matches seems like a crazy idea since I would most certainly run out of memory.
My dataframes look something like this:
df1:
text
0 i like apples
1 oranges are good
2 eating bananas is healthy
. ...
. ...
M tomatoes are red, bananas are yellow
df2:
fruit_lookup
0 apples
1 oranges
2 bananas
. ...
. ...
N tomatoes
I am expecting an output dataframe to look something like this:
output_df:
text extracted_fruits
0 i like apples ['apples']
1 oranges are good ['oranges']
2 eating bananas is healthy ['bananas']
. ...
. ...
M tomatoes are red, bananas are yellow . ['tomatoes','bananas']
One way is to use CountVectorizerModel as 100K lookup words should be manageable for this model (default vocabSize=262144):
The basic idea is to create the CountVectorizerModel based on a customized list from df2 (lookup table). split df1.text into an array column and then transform this column into a SparseVector which can then be mapped into words:
Edit: in split function , adjusted the regex from \s+ to [\s\p{Punct}]+ so that all punctuation marks are removed. change 'text' to lower(col('text')) if the lookup is case insensitive.
from pyspark.ml.feature import CountVectorizerModel
from pyspark.sql.functions import split, udf, regexp_replace, lower
df2.show()
+---+------------+
| id|fruit_lookup|
+---+------------+
| 0| apples|
| 1| oranges|
| 2| bananas|
| 3| tomatoes|
| 4|dragon fruit|
+---+------------+
Edit-2: Added the following df1 pre-process step and create an array column including all N-gram combinations. For each string with L words, N=2 will add (L-1) more items in array, if N=3, (L-1)+(L-2) more items.
# max number of words in a single entry of the lookup table df2
N = 2
# Pre-process the `text` field up to N-grams,
# example: ngram_str('oranges are good', 3)
# --> ['oranges', 'are', 'good', 'oranges are', 'are good', 'oranges are good']
def ngram_str(s_t_r, N):
arr = s_t_r.split()
L = len(arr)
for i in range(2,N+1):
if L - i < 0: break
arr += [ ' '.join(arr[j:j+i]) for j in range(L-i+1) ]
return arr
udf_ngram_str = udf(lambda x: ngram_str(x, N), 'array<string>')
df1_processed = df1.withColumn('words_arr', udf_ngram_str(lower(regexp_replace('text', r'[\s\p{Punct}]+', ' '))))
Implement the model on the processed df1:
lst = [ r.fruit_lookup for r in df2.collect() ]
model = CountVectorizerModel.from_vocabulary(lst, inputCol='words_arr', outputCol='fruits_vec')
df3 = model.transform(df1_processed)
df3.show(20,40)
#+----------------------------------------+----------------------------------------+-------------------+
#| text| words_arr| fruits_vec|
#+----------------------------------------+----------------------------------------+-------------------+
#| I like apples| [i, like, apples, i like, like apples]| (5,[0],[1.0])|
#| oranges are good|[oranges, are, good, oranges are, are...| (5,[1],[1.0])|
#| eating bananas is healthy|[eating, bananas, is, healthy, eating...| (5,[2],[1.0])|
#| tomatoes are red, bananas are yellow|[tomatoes, are, red, bananas, are, ye...|(5,[2,3],[1.0,1.0])|
#| test| [test]| (5,[],[])|
#|I have dragon fruit and apples in my bag|[i, have, dragon, fruit, and, apples,...|(5,[0,4],[1.0,1.0])|
#+----------------------------------------+----------------------------------------+-------------------+
Then you can map the fruits_vec back to the fruits using model.vocabulary
vocabulary = model.vocabulary
#['apples', 'oranges', 'bananas', 'tomatoes', 'dragon fruit']
to_match = udf(lambda v: [ vocabulary[i] for i in v.indices ], 'array<string>')
df_new = df3.withColumn('extracted_fruits', to_match('fruits_vec')).drop('words_arr', 'fruits_vec')
df_new.show(truncate=False)
#+----------------------------------------+----------------------+
#|text |extracted_fruits |
#+----------------------------------------+----------------------+
#|I like apples |[apples] |
#|oranges are good |[oranges] |
#|eating bananas is healthy |[bananas] |
#|tomatoes are red, bananas are yellow |[bananas, tomatoes] |
#|test |[] |
#|I have dragon fruit and apples in my bag|[apples, dragon fruit]|
#+----------------------------------------+----------------------+
Method-2: As your dataset are not huge in terms of Spark context, the following might work, this will work with lookup value having multiple words as per your comment:
from pyspark.sql.functions import expr, collect_set
df1.alias('d1').join(
df2.alias('d2')
, expr('d1.text rlike concat("\\\\b", d2.fruit_lookup, "\\\\b")')
, 'left'
).groupby('text') \
.agg(collect_set('fruit_lookup').alias('extracted_fruits')) \
.show()
+--------------------+-------------------+
| text| extracted_fruits|
+--------------------+-------------------+
| oranges are good| [oranges]|
| I like apples| [apples]|
|tomatoes are red,...|[tomatoes, bananas]|
|eating bananas is...| [bananas]|
| test| []|
+--------------------+-------------------+
Where: "\\\\b is word boundary so that the lookup values do not mess up with their contexts.
Note: you may need to clean up all punctuation marks and redundant spaces on both columns before the dataframe join.
For example, say I have these data points:
feature: color class
---------------------------------------------
red, green A
yellow, orange B
blue, green, red A
yellow B
The categorical feature column would be [red, blue, green, yellow, orange], but each sample can belong to multiple categories (such as (red, green)).
One approach would be to represent each category (color) as it's own column, and then perform a binary encoding on top of that (1 or 0 for true or false).
Would this be the best approach in Tensorflow, or is there a better way to do this?
Here is what I want to do:
I have 2 tables in oracle database. First one for my data, and second for mapping.
Mapping table has 2 columns and it's like this:
olive - green
marine - green
grass - green
green - green
navy - blue
sky - blue
light blue - blue
etc.
First table (for my data) has more columns, and I should fill every row in that table with various number of colors from mapping table.
and I have txt file with this:
olive,
navy,
green,
#
blue,
sky,
olive,
marine,
#
blue,
light blue,
#
etc.
So, I should basically write that data to my first table in database, but not with those values from txt file, but with adequate value from mapping table.
I hope I explained it well enough. Help pls?
Let's say I have a set of objects with properties:
Object Quantity Color Shape Kind
----------------------------------------
APPLE 12 RED ROUND FRUIT
APPLE 3 GREEN ROUND FRUIT
ORANGE 6 ORANGE ROUND FRUIT
CARROT 0 RED CONICAL VEGETABLE
RADISH 24 RED ROUND VEGETABLE
Object and all properties except quantity are represented as strings. Quantity is a number.
I must compose a random list of objects, based on user's query.
Query contains values for all string properties (that is, all properties except quantity).
Value in query may be either exact property value, or a wildcard (meaning "any value would do for this property"), or a negation — "NOT this exact property value".
Query result is an object, picked by weighted random from all object with matching properties. Weight for the random pick is the quantity.
For example:
Query -> Probabilities -> Example
random result
-----------------------------------------------------------------------------
* ROUND FRUIT -> APPLE 12 / APPLE 3 = APPLE 15 -> APPLE
!GREEN ROUND FRUIT -> APPLE 12 / ORANGE 6 -> ORANGE
RED * * -> CARROT 0 / APPLE 12 / RADISH 24
= APPLE 12 / RADISH 24 -> RADISH
RED CONICAL VEGETABLE -> CARROT 0
= (none) -> (none)
For self-education purposes, I would like to build this system using Redis for data storage.
The question is — how to do this elegantly and with least amount of application logic (as opposed to in-Redis operations)? Weights and negation kind of spoil the picture. Otherwise it would be nicely doable with sets.
Any hints are welcome.
Since redis can only query keys and not values, a good option is to store the individual values of each object in seperate redis lists.
For example, when you add the object ...
APPLE 12 RED ROUND FRUIT
you would store it as
hmset obj:1 name apple qty 12 color red shape round kind fruit
and then ...
sadd name:apple obj:1,
sadd color:red obj:1
sadd shape:round obj:1
This way you have a way to interrogate sets directly and be able to pick the object using a random number based on, for example, the total number of items in the set returned.
Hope that helps. If you need more explanation, hit me up.