Cannot run multiple SparkContexts Pyspark filter() application of filtering stopwords using isin() - apache-spark-sql

For this PySpark I am looking for ideas and to verify the logic usage of 'filter'.
Looking for advise on word chucks for filtering stopwords.
My first step is to read CSV. Then use (import re) to split() into word chuncks. Then apply stopwords to filter(~storm_word_chunks.isin(stopwords)). Is the usage of filter correct for x.isin() ?
Before sqlContext.spark.createDataFrame() runs, shall we define sqlContext, I must first need something like:
sc = SparkContext("local", "first app")
Following the from ... import
from pyspark.context import SparkContext
sc = SparkContext("local", "first app")
df = spark.read.csv("StormEvents.csv")
storm_word_chunks = sc.spark.createDataFrame(df.split()))
stopwords = ["i", "me", "my", "myself",...]
dfStorm_words = filter(~storm_word_chunks.isin(stopwords))
Now when this runs I get a ValueError: Cannot run multiple SparkContexts are once; existing SparkContexts created by GetOrCreate()...
OK so how to solve this SparkContext?

Related

I want to get the excel file from the data frame created which automatically changes as written in the code

i have tried two methods and both showing different location as given by me in this image
apikey='abcd'
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
import time
ts=TimeSeries(key=apikey,output_format='pandas')
data,metadata=ts.get_intraday(symbol='name',interval='1min',outputsize='full')
data
while True:
data, metadata=ts.get_intraday(symbol='TCS',interval='1min',outputsize='full')
data.to_excel('livedat.xlsx')
time.sleep(60)
The code is running properly but I don't know how to get the data file in excel.
imp- the method should get the file which is updated timely i.e 1min automaticaly.
Also i am using IBM watson studio to write the code.
I am not familiar with the alpha_vantage wrapper that you are using however this is how i would perform your question. The code works and i have included comments.
To get the file in the python script i would do pd.read_excel(filepath).
import requests
import pandas as pd
import time
import datetime
# Your API KEY and the URL we will request from
API_KEY = "YOUR API KEY"
url = "https://www.alphavantage.co/query?"
def Generate_file(symbol="IBM", interval="1min"):
# URL parameters
parameters = {"function": "TIME_SERIES_INTRADAY",
"symbol": symbol,
"interval": interval,
"apikey": API_KEY,
"outputsize": "compact"}
# get the json response from AlphaVantage
response = requests.get(url, params=parameters)
data = response.json()
# filter the response to only get the time series data we want
time_series_interval = f"Time Series ({interval})"
prices = data[time_series_interval]
# convert the filtered reponse to a Pandas DataFrame
df = pd.DataFrame.from_dict(prices, orient="index").reset_index()
df = df.rename(columns={"index": time_series_interval})
# create a timestampe for our excel file. So that the file does not get overriden with new data each time.
current_time = datetime.datetime.now()
file_timestamp = current_time.strftime("%Y%m%d_%H.%M")
filename = f"livedat_{file_timestamp}.xlsx"
df.to_excel(filename)
# sent a limit on the number of calls we make to prevent infinite loop
call_limit = 3
number_of_calls = 0
while(number_of_calls < call_limit):
Generate_file() # our function
number_of_calls += 1
time.sleep(60)

LIKE operator working on AWS lambda function but not =

I have a small csv file that looks like that :
is_employee,candidate_id,gender,hesa_type,university
FALSE,b9bb80,Male,Mathematical sciences,Birmingham
FALSE,8e552d,Female,Computer science,Swansea
TRUE,2bc475,Male,Engineering & technology,Aston
TRUE,c3ac8d,Female,Mathematical sciences,Heriot-Watt
FALSE,ceb2fa,Female,Mathematical sciences,Imperial College London
The following lambda function is used to query from an s3bucket.
import boto3
import os
import json
def lambda_handler(event, context):
BUCKET_NAME = 'foo'
KEY = 'bar/data.csv'
s3 = boto3.client('s3','eu-west-1')
response = s3.select_object_content(
Bucket = BUCKET_NAME,
Key = KEY,
ExpressionType = 'SQL',
Expression = 'Select count(*) from s3object s where s.gender like \'%Female%\'',
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'JSON': {}},
)
for i in response['Payload']:
if 'Records' in i:
query_result = i['Records']['Payload'].decode('utf-8')
print(list(json.loads(query_result).values())[0])
Now, this works great as I get back a result of 3.
But for some reason the same code does not work when changing the like operator to =, results drop down to 0, so no match found. What's happening here ?
So I found the problem. The problem was that the items of the last column were followed by a newline character, which was not understood by the AWS S3 interpreter. So really, a university name was not Swansea, but more Swansea\n.
So s.university = \'Swansea\'' does not work; however, s.university LIKE \'Swansea%\'' does work, and is still a sargable expression.

Unable load a CSV file as dataframe in spark

I am trying to load a CSV file in the data frame and my objective is to display the first row as the column name of the CSV file. but while using the below code, I am getting the error
Exception in thread "main" java.lang.AbstractMethodError
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
Code:
def main(args : Array[String]): Unit = {
val spark : SparkSession = SparkSession
.builder()
.master("local")
.appName("SparkSessioncsvExample")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("D:/Scala/C2ImportCalEventSample.csv")}
But I'm able to load the file with the code:
val df = spark.sparkContext
.textFile("D:/Scala/C2ImportCalEventSample1.csv")
//.flatMap(header='true')
.map(line => line.split(","))
// .map(line => line.map()
.toDF()
but in the second code file is getting successfully loaded but the first row is not getting as column_name of the data frame.
spark version is: spark-2.3.2
scala 2.11.3
jdk1.8.0_20
sbt-1.2.7
Thanks any anyone who can help me on this.
java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath.

Custom SPARQL functions in rdflib

What is a good way to hook a custom SPARQL function into rdflib?
I have been looking around in rdflib for an entry point for custom function. I found no dedicated entry point but found that rdflib.plugins.sparql.CUSTOM_EVALS might be a place to add the custom function.
So far I have made an attempt with the code below. It seems "dirty" to me. I am calling a "hidden" function (_eval) and I am not sure I got all the argument updating correct. Beyond the custom_eval.py example code (which form the basis for my code) I found little other code or documentation about CUSTOM_EVALS.
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
NAMESPACE = Namespace('//custom/')
LENGTH = rdflib.term.URIRef(NAMESPACE + 'length')
def customEval(ctx, part):
"""Evaluate custom function."""
if part.name == 'Extend':
cs = []
for c in evalPart(ctx, part.p):
if hasattr(part.expr, 'iri'):
# A function
argument = _eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars))
if part.expr.iri == LENGTH:
e = Literal(len(argument))
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
e = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(e, SPARQLError):
raise e
cs.append(c.merge({part.var: e}))
return cs
raise NotImplementedError()
QUERY = """
PREFIX custom: <%s>
SELECT ?s ?length WHERE {
BIND("Hello, World" AS ?s)
BIND(custom:length(?s) AS ?length)
}
""" % (NAMESPACE,)
rdflib.plugins.sparql.CUSTOM_EVALS['exampleEval'] = customEval
for row in rdflib.Graph().query(QUERY):
print(row)
So first off, I want to thank you for showing how you implemented a new SPARQL function.
Secondly, by using your code I was able to create a SPARQL function that evaluates two strings by using the Levenshtein distance. It has been really insightful and I wish to share it for it holds additional documentation that could help other developers creating their own custom SPARQL functions.
# Import needed to introduce new SPARQL function
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
# Import for custom function calculation
from Levenshtein import distance as levenshtein_distance # python-Levenshtein==0.12.2
def SPARQL_levenshtein(ctx:object, part:object) -> object:
"""
The first two variables retrieved from a SPARQL-query are compared using the Levenshtein distance.
The distance value is then stored in Literal object and added to the query results.
Example:
Query:
PREFIX custom: //custom/ # Note: this part refereces to the custom function
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
Retrieve:
?label1 ?label2
Calculation:
levenshtein_distance(?label1, ?label2) = distance
Output:
Save distance in Literal object.
:param ctx: <class 'rdflib.plugins.sparql.sparql.QueryContext'>
:param part: <class 'rdflib.plugins.sparql.parserutils.CompValue'>
:return: <class 'rdflib.plugins.sparql.processor.SPARQLResult'>
"""
# This part holds basic implementation for adding new functions
if part.name == 'Extend':
cs = []
# Information is retrieved and stored and passed through a generator
for c in evalPart(ctx, part.p):
# Checks if the function holds an internationalized resource identifier
# This will check if any custom functions are added.
if hasattr(part.expr, 'iri'):
# From here the real calculations begin.
# First we get the variable arguments, for example ?label1 and ?label2
argument1 = str(_eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars)))
argument2 = str(_eval(part.expr.expr[1], c.forget(ctx, _except=part.expr._vars)))
# Here it checks if it can find our levenshtein IRI (example: //custom/levenshtein)
# Please note that IRI and URI are almost the same.
# Earlier this has been defined with the following:
# namespace = Namespace('//custom/')
# levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
if part.expr.iri == levenshtein:
# After finding the correct path for the custom SPARQL function the evaluation can begin.
# Here the levenshtein distance is calculated using ?label1 and ?label2 and stored as an Literal object.
# This object is than stored as an output value of the SPARQL-query (example: ?levenshtein)
evaluation = Literal(levenshtein_distance(argument1, argument2))
# Standard error handling and return statements
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
evaluation = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(evaluation, SPARQLError):
raise evaluation
cs.append(c.merge({part.var: evaluation}))
return cs
raise NotImplementedError()
namespace = Namespace('//custom/')
levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
query = """
PREFIX custom: <%s>
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
""" % (namespace,)
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
for row in rdflib.Graph().query(query):
print(row)
To answer your question: "What is a good way to hook a custom SPARQL function into rdflib?
Currently I'm developing a class that handles RDF data and I believe it might be best to implement the following code in to __init__function.
For example:
class ClassName():
"""DOCSTRING"""
def __init__(self):
"""DOCSTRING"""
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
Please note, this SPARQL function will only work for the endpoint on which it is implemented. Even though the SPARQL syntax in the query is correct, it is not possible applying the function in SPARQL-queries used for databases like DBPedia. The DBPedia endpoint does not support this custom function (yet).

NiFi: Remove fixed number of header lines from file

I'm processing a file and I'd like to remove (trim) the first X header lines to keep only data, possibly avoiding using regular expressions.
Thanks
You can remove the first X header lines by using ExecuteScript procesor in Nifi.
The following is a example Jython script which I wrote for myself:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
for line in text[3:]:
outputStream.write(line + "\n")
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
This obviously removes the first 3 lines but you can easily modify it to remove more or less lines.
Hope that helps.