For this PySpark I am looking for ideas and to verify the logic usage of 'filter'.
Looking for advise on word chucks for filtering stopwords.
My first step is to read CSV. Then use (import re) to split() into word chuncks. Then apply stopwords to filter(~storm_word_chunks.isin(stopwords)). Is the usage of filter correct for x.isin() ?
Before sqlContext.spark.createDataFrame() runs, shall we define sqlContext, I must first need something like:
sc = SparkContext("local", "first app")
Following the from ... import
from pyspark.context import SparkContext
sc = SparkContext("local", "first app")
df ="StormEvents.csv")
storm_word_chunks = sc.spark.createDataFrame(df.split()))
stopwords = ["i", "me", "my", "myself",...]
dfStorm_words = filter(~storm_word_chunks.isin(stopwords))
Now when this runs I get a ValueError: Cannot run multiple SparkContexts are once; existing SparkContexts created by GetOrCreate()...
OK so how to solve this SparkContext?


I want to get the excel file from the data frame created which automatically changes as written in the code

i have tried two methods and both showing different location as given by me in this image
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
import time
while True:
data, metadata=ts.get_intraday(symbol='TCS',interval='1min',outputsize='full')
The code is running properly but I don't know how to get the data file in excel.
imp- the method should get the file which is updated timely i.e 1min automaticaly.
Also i am using IBM watson studio to write the code.
I am not familiar with the alpha_vantage wrapper that you are using however this is how i would perform your question. The code works and i have included comments.
To get the file in the python script i would do pd.read_excel(filepath).
import requests
import pandas as pd
import time
import datetime
# Your API KEY and the URL we will request from
url = ""
def Generate_file(symbol="IBM", interval="1min"):
# URL parameters
parameters = {"function": "TIME_SERIES_INTRADAY",
"symbol": symbol,
"interval": interval,
"apikey": API_KEY,
"outputsize": "compact"}
# get the json response from AlphaVantage
response = requests.get(url, params=parameters)
data = response.json()
# filter the response to only get the time series data we want
time_series_interval = f"Time Series ({interval})"
prices = data[time_series_interval]
# convert the filtered reponse to a Pandas DataFrame
df = pd.DataFrame.from_dict(prices, orient="index").reset_index()
df = df.rename(columns={"index": time_series_interval})
# create a timestampe for our excel file. So that the file does not get overriden with new data each time.
current_time =
file_timestamp = current_time.strftime("%Y%m%d_%H.%M")
filename = f"livedat_{file_timestamp}.xlsx"
# sent a limit on the number of calls we make to prevent infinite loop
call_limit = 3
number_of_calls = 0
while(number_of_calls < call_limit):
Generate_file() # our function
number_of_calls += 1

LIKE operator working on AWS lambda function but not =

I have a small csv file that looks like that :
FALSE,b9bb80,Male,Mathematical sciences,Birmingham
FALSE,8e552d,Female,Computer science,Swansea
TRUE,2bc475,Male,Engineering & technology,Aston
TRUE,c3ac8d,Female,Mathematical sciences,Heriot-Watt
FALSE,ceb2fa,Female,Mathematical sciences,Imperial College London
The following lambda function is used to query from an s3bucket.
import boto3
import os
import json
def lambda_handler(event, context):
KEY = 'bar/data.csv'
s3 = boto3.client('s3','eu-west-1')
response = s3.select_object_content(
Key = KEY,
ExpressionType = 'SQL',
Expression = 'Select count(*) from s3object s where s.gender like \'%Female%\'',
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'JSON': {}},
for i in response['Payload']:
if 'Records' in i:
query_result = i['Records']['Payload'].decode('utf-8')
Now, this works great as I get back a result of 3.
But for some reason the same code does not work when changing the like operator to =, results drop down to 0, so no match found. What's happening here ?
So I found the problem. The problem was that the items of the last column were followed by a newline character, which was not understood by the AWS S3 interpreter. So really, a university name was not Swansea, but more Swansea\n.
So = \'Swansea\'' does not work; however, LIKE \'Swansea%\'' does work, and is still a sargable expression.

Unable load a CSV file as dataframe in spark

I am trying to load a CSV file in the data frame and my objective is to display the first row as the column name of the CSV file. but while using the below code, I am getting the error
Exception in thread "main" java.lang.AbstractMethodError
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
def main(args : Array[String]): Unit = {
val spark : SparkSession = SparkSession
.config("spark.some.config.option", "some-value")
val df =
.option("header", "true") //reading the headers
But I'm able to load the file with the code:
val df = spark.sparkContext
.map(line => line.split(","))
// .map(line =>
but in the second code file is getting successfully loaded but the first row is not getting as column_name of the data frame.
spark version is: spark-2.3.2
scala 2.11.3
Thanks any anyone who can help me on this.
java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath.

Custom SPARQL functions in rdflib

What is a good way to hook a custom SPARQL function into rdflib?
I have been looking around in rdflib for an entry point for custom function. I found no dedicated entry point but found that rdflib.plugins.sparql.CUSTOM_EVALS might be a place to add the custom function.
So far I have made an attempt with the code below. It seems "dirty" to me. I am calling a "hidden" function (_eval) and I am not sure I got all the argument updating correct. Beyond the example code (which form the basis for my code) I found little other code or documentation about CUSTOM_EVALS.
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
NAMESPACE = Namespace('//custom/')
LENGTH = rdflib.term.URIRef(NAMESPACE + 'length')
def customEval(ctx, part):
"""Evaluate custom function."""
if == 'Extend':
cs = []
for c in evalPart(ctx, part.p):
if hasattr(part.expr, 'iri'):
# A function
argument = _eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars))
if part.expr.iri == LENGTH:
e = Literal(len(argument))
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
e = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(e, SPARQLError):
raise e
cs.append(c.merge({part.var: e}))
return cs
raise NotImplementedError()
QUERY = """
PREFIX custom: <%s>
SELECT ?s ?length WHERE {
BIND("Hello, World" AS ?s)
BIND(custom:length(?s) AS ?length)
""" % (NAMESPACE,)
rdflib.plugins.sparql.CUSTOM_EVALS['exampleEval'] = customEval
for row in rdflib.Graph().query(QUERY):
So first off, I want to thank you for showing how you implemented a new SPARQL function.
Secondly, by using your code I was able to create a SPARQL function that evaluates two strings by using the Levenshtein distance. It has been really insightful and I wish to share it for it holds additional documentation that could help other developers creating their own custom SPARQL functions.
# Import needed to introduce new SPARQL function
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
# Import for custom function calculation
from Levenshtein import distance as levenshtein_distance # python-Levenshtein==0.12.2
def SPARQL_levenshtein(ctx:object, part:object) -> object:
The first two variables retrieved from a SPARQL-query are compared using the Levenshtein distance.
The distance value is then stored in Literal object and added to the query results.
PREFIX custom: //custom/ # Note: this part refereces to the custom function
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
?label1 ?label2
levenshtein_distance(?label1, ?label2) = distance
Save distance in Literal object.
:param ctx: <class 'rdflib.plugins.sparql.sparql.QueryContext'>
:param part: <class 'rdflib.plugins.sparql.parserutils.CompValue'>
:return: <class 'rdflib.plugins.sparql.processor.SPARQLResult'>
# This part holds basic implementation for adding new functions
if == 'Extend':
cs = []
# Information is retrieved and stored and passed through a generator
for c in evalPart(ctx, part.p):
# Checks if the function holds an internationalized resource identifier
# This will check if any custom functions are added.
if hasattr(part.expr, 'iri'):
# From here the real calculations begin.
# First we get the variable arguments, for example ?label1 and ?label2
argument1 = str(_eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars)))
argument2 = str(_eval(part.expr.expr[1], c.forget(ctx, _except=part.expr._vars)))
# Here it checks if it can find our levenshtein IRI (example: //custom/levenshtein)
# Please note that IRI and URI are almost the same.
# Earlier this has been defined with the following:
# namespace = Namespace('//custom/')
# levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
if part.expr.iri == levenshtein:
# After finding the correct path for the custom SPARQL function the evaluation can begin.
# Here the levenshtein distance is calculated using ?label1 and ?label2 and stored as an Literal object.
# This object is than stored as an output value of the SPARQL-query (example: ?levenshtein)
evaluation = Literal(levenshtein_distance(argument1, argument2))
# Standard error handling and return statements
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
evaluation = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(evaluation, SPARQLError):
raise evaluation
cs.append(c.merge({part.var: evaluation}))
return cs
raise NotImplementedError()
namespace = Namespace('//custom/')
levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
query = """
PREFIX custom: <%s>
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
""" % (namespace,)
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
for row in rdflib.Graph().query(query):
To answer your question: "What is a good way to hook a custom SPARQL function into rdflib?
Currently I'm developing a class that handles RDF data and I believe it might be best to implement the following code in to __init__function.
For example:
class ClassName():
def __init__(self):
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
Please note, this SPARQL function will only work for the endpoint on which it is implemented. Even though the SPARQL syntax in the query is correct, it is not possible applying the function in SPARQL-queries used for databases like DBPedia. The DBPedia endpoint does not support this custom function (yet).

NiFi: Remove fixed number of header lines from file

I'm processing a file and I'd like to remove (trim) the first X header lines to keep only data, possibly avoiding using regular expressions.
You can remove the first X header lines by using ExecuteScript procesor in Nifi.
The following is a example Jython script which I wrote for myself:
import json
from import IOUtils
from java.nio.charset import StandardCharsets
from import StreamCallback
class PyStreamCallback(StreamCallback):
def __init__(self):
def process(self, inputStream, outputStream):
text = IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
for line in text[3:]:
outputStream.write(line + "\n")
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
This obviously removes the first 3 lines but you can easily modify it to remove more or less lines.
Hope that helps.