how to remove stopwords from a text in pre processing of spark - apache-spark-sql

I have a requirement to pre process the data in the spark before running the algorithms
One of the pre processing logic was to remove the stopwords from the text. I tried with spark StopWordsRemover. StopWordsRemover requires input and output should be Array[String]. After running the program the final column output is shown as collection of strings, i would require a plain string.
My code as follows.
val tokenizer: RegexTokenizer = new RegexTokenizer().setInputCol("raw").setOutputCol("token")
val stopWordsRemover = new StopWordsRemover().setInputCol("token").setOutputCol("final")
stopWordsRemover.setStopWords(stopWordsRemover.getStopWords ++ customizedStopWords)
val tokenized: DataFrame = tokenizer.transform(hiveDF)
val transformDF = stopWordsRemover.transform(tokenized)
Actual Output
["rt messy support need help with bill"]
Required Output:
rt messy support need help with bill
My output should be like a string but not as array of string. Is there any way to do this. I require the output of the column in the dataframe as string.
Also I would need suggestion on the below options to remove stopwords from the text in the spark program.
StopWordsRemover from SparkMlib
Standford CoreNLP Library.
Which of the option gives better performance when parsing huge files.
Any help appreciated.
Thanks in advance.

You may use this to get string instead of array-of-strings - df.collect()[0] - if you are sure only first item is in your interest.
However, that should not be any issue here as long as you traverse the array and get each items there.
Ultimately HiveDF will give you RDD[String] - and it becomes Array[String] when you convert from RDD.

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

Create new column from existing column in Dataset - Apache Spark Java

I am new to Spark ML and got stuck in a task which require some data normalization and there is very less documentation available on net for Spark ML - Java. Any help is much appreciated.
Problem Description :
I have a Dataset that contains encoded url in column (ENCODED_URL) and I want to create new column (DECODED_URL) in existing Dataset that contains decoded version of ENCODED_URL.
For Eg :
Current Dataset
ENCODED_URL
https%3A%2F%2Fmywebsite
New Dataset
ENCODED_URL | DECODED_URL
https%3A%2F%2Fmywebsite | https://mywebsite
Tried using withColumn but had no clue what i should pass as 2nd argument
Dataset<Row> newDs = ds.withColumn("new_col",?);
After reading the Spark documentation got an idea that it may be possible with SQLTransformer but couldn't figure out how to customize it to decode the url.
This is how i read information from CSV
Dataset<Row> urlDataset = s_spark.read().option("header", true).csv(CSV_FILE).persist(StorageLevel.MEMORY_ONLY());
A Spark primer
The first thing to know is that Spark Datasets are effectively immutable. Whenever you do a transformation, a new Dataset is created and returned. Another thing to keep in mind is the difference between actions and transformations -- actions cause Spark to actually to start crunching numbers and compute your DataFrame while transformations add to the definition of a DataFrame but are not computed unless an action is called. An example of an action is DataFrame#count while an example of a transformation is DataFrame#withColumn. See the full list of actions and transformations in the Spark Scala documentation.
A solution
withColumn allows you to either create a new column or replace an existing column in a Dataset (if the first argument is an existing column's name). The docs for withColumn will tell you that the second argument is supposed to be a Column object. Unfortunately, the Column documentation only describes methods available to Column objects but does not link to other ways to create Column objects, so it's not your fault that you're at a loss for what do next.
The thing you're looking for is org.apache.spark.sql.functions#regexp_replace. Putting it all together, your code should look something like this:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
ds = ds.withColumn(
"decoded_url",
functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
regexp_replace requires that we pass a Column object as the first value but nothing requires that it even exist on any Dataset because Column objects are basically instructions for how to compute a column, they don't actually contain any real data themselves. To illustrate this principle, we could write the above snippet as:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
Column myColExpression = functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
ds = ds.withColumn("decoded_url", myColExpression)
If you wanted, you could reuse myColExpression on other datasets that have an encoded_url column.
Suggestion
If you haven't already, you should familiarize yourself with the org.apache.spark.sql.functions class. It's a util class that's effectively the Spark standard lib for transformations.

Parsing a SQL spatial column in Python

I am struggling a bit as I am new to programming. I am currently writing a python script and I am a bit stuck. The goal is to parse some spatial information the gets pulled from SQL to a format that is usable for my py script down the line.
I was able to CAST through a SQL query and fetchall using the obdc module. However once I fetch the data that is where it gets trick for me. Here is an example of a print from the fetchall:
[(u'POLYGON ((7014.186279296875 6602.99658203125 1612.5, 7015.984375 6600.416015625 1612.5))',), (u'POLYGON ((6730.962646484375 6715.2490234375 1522.5, 6730.0869140625 6714.13916015625 1522.5))',)]
I am not exactly sure what I am getting here it is like a list of tuples. which I have tried converting to a list of list, but there must be something I am missing.
Here is the usable format I am looking for:
[[7014.186279296875, 6602.99658203125, 1612.5], [7015.984375, 6600.416015625, 1612.5]]
[[6730.962646484375, 6715.2490234375, 1522.5], [6730.0869140625, 6714.13916015625, 1522.5]]
Any ideas of how I can accomplish this? Maybe there is a better way to CAST in SQL or a module in python that would be easier to use instead of just doing a cursor.fetchall() and parsing? Or any any parsing help would be useful. Thanks.
If you want to do parsing, that should be straight forward. For example you've provided next code would do the thing:
result = []
for element in data:
single_elements = element[0][10:-2].split(', ')
for se in single_elements:
row = str(se).split(' ')
result.append([float(a) for a in row])
Result will contain what you need. If parsing is not an option, then paste some of your code so I can see how you're fetching data.

Can we do string manipulation and conditional check in smooks?

I want to manipulate a large text file, which is coming as TEXT and want to use smooks to manipulate it. The text file contains large number of lines. And from each line, i have to split the characters and get information out of that.
Eg: i do following in java;
row.substring(0, 4)
row.substring(4, 64)
I have to convert the text content to CSV file.
Can we do exact same string manipulation in smooks too? (that is in smooks configuration can i do that?) I believe i can use Fixed Length processing for that?
How to add IF ELSE condition in smooks configuration?
Like in java;
if (row.length() == 900) {
//DO
}else(){
//DO
}
We can do string manipulation using fixed length reader[1]. but still i do not find a way to do condition check.
Eg: if /else
[1]http://www.smooks.org/mediawiki/index.php?title=V1.4:Smooks_v1.4_User_Guide#XML
If the format does not fit the flatfile reader, then you might be able to use the regex reader: https://github.com/smooks/smooks/tree/v1.5.1/smooks-examples/flatfile-to-xml-regex/
As for the conditional stuff... you really need to bind the data fragments into a Java model of some sort (real or virtual) and then conditionally process those fragments by either adding elements on the visitors being applied, or process the fragments by routing them to another process that processes them in parallel, which is a far better way of processing a huge data stream.