How to ignore other instances of a word when looping it via pandas.groupby.agg? - pandas

I have a code (see below) that I used to match word occurrences per Location. My problem is that it reads all instances of the word.
FOR EXAMPLE: This is what I was hoping it to do, but the code below counted all occurrences of 'help' including 'helping' and 'helped'
tidytext2 | Location | occurrences
she used to help me | Aus | 1
help is on the way | UK | 1
Helping is a kind gift | UK | 0
She helped me when I needed it | Japan | 0
Why dont u help me? | SA | 1
Help me! Im hungry help | Rwanda | 2
words = [i[0] for i in pos_freq.most_common()]
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(i)
funs = {i: 'sum' for i in words}
groupedpos = positivedf.groupby('Location').agg(funs)
I got positive_freq.most_common() using the following codes. It returns
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct
pos_lines = list(positivedf.tidytext2)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
[('help', 7)]

You need to use regex for this:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?<!\S)'+i+'(?!\S)')
If you want to be case insensitive:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?i)(?<!\S)'+i+'(?!\S)')

Related

fbprophet at scale with applyInPandas resulting unexpected count values [PySpark]

I am using applyInPandas to implement a forecast function over a sampled data using groupBy on ID. The end goal is to calculate MAPE for each ID.
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:
anonym_cis = history_pd.at[0,'ID']
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=True,
weekly_seasonality=True,
yearly_seasonality=False,
seasonality_mode='multiplicative'
)
# fit the model
model.fit(history_pd)
# configure predictions
future_pd = model.make_future_dataframe(
periods=30,
freq='d',
include_history=True
)
# make predictions
results_pd = model.predict(future_pd)
results_pd.loc[:, 'ID'] = anonym_cis
# . . .
# return predictions
return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]
results = (
fr_sample
.groupBy('ID')
.applyInPandas(forecast_balance, schema=result_schema)
)
I am getting am expected predictive results. However, when I count the number of rows for each ID in input data and the output data, it doesn't match. I would like to know from where/how these extra 30 (292-262) rows are getting created in the process for each ID.
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 262|
| 482769| 262|
| 483946| 262|
| 484124| 262|
| 484364| 262|
| 485103| 262|
+----------+-----+
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 292|
| 482769| 292|
| 483946| 292|
| 484124| 292|
| 484364| 292|
| 485103| 292|
+----------+-----+
Note:
This is how I am calculating MAPE as of now which is not for each ID but a over all data, hence resulting a single value (e.g. 1.4382).
def gr_mape_val(pd_sample_df, result_df):
result_df = result_df.toPandas()
actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
mape = mean_absolute_percentage_error(actuals_pd, predicted_pd)
return mape
To use it in groupBy format for each ID, I need to have both the above mentioned count values matched but I am not able to figure out, how?
I just found what was going on there:
Basically with make_future_dataframe, I am creating 30 extra datapoints which was changing the total count of predicted_pd.
This can be simply solved by using df.na.drop()
pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

Spark- scan data frame base on value

I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)

Splitting a composite variable into two variables

I have a string variable called country with a value which can be for example Afghanistan2008, but it can also be Brasil2012. I would like to create two new variables, one being the country part and one the year part .
Because there are always numbers at the end of the string, I do know the position the string should be split at from the right side but not from the left side.
Could I use something like:
gen(substr("country",-4,.))
If not, could anyone tell me how to split an entire column of such variables into a country and a year variable? I would also like to keep the original variable.
You can use a regular expression:
clear
set obs 2
generate string = ""
replace string = "Afghanistan2008" in 1
replace string = "Brasil2012" in 2
generate country = regexs(0) if regex(string, "[a-zA-Z]+")
generate year = regexs(1) + regexs(2) if regex(string, "(19|20)([0-9][0-9])")
list
+--------------------------------------+
| string country year |
|--------------------------------------|
1. | Afghanistan2008 Afghanistan 2008 |
2. | Brasil2012 Brasil 2012 |
+--------------------------------------+
Type help regex in Stata's command prompt for more information.
Alternatively you could do the following:
generate len = length(string) - 3
generate country2 = substr(string, 1, len - 1)
generate year2 = substr(string, len, .)
list country2 year2
+---------------------+
| country2 year2 |
|---------------------|
1. | Afghanistan 2008 |
2. | Brasil 2012 |
+---------------------+
For my specific situation the following makes a new year variable:
gen spyear = real(substr(country,-4,.))
I took the other part from #PearlySpencer:
generate len = length(country) - 3
generate spcountry = substr(country, 1, len - 1)
which creates an excess column to be removed.
EDIT (Nick Cox) This can be simplified to
gen spyear = real(substr(country, -4, 4))
gen spcountry = substr(country, 1, length(country) - 4)
showing that
There is no need to create a variable containing the string length.
The puzzling split 4 = 3 + 1 is not needed either.

How to apply a custom filtering function on a Spark DataFrame

I have a DataFrame of the form:
A_DF = |id_A: Int|concatCSV: String|
and another one:
B_DF = |id_B: Int|triplet: List[String]|
Examples of concatCSV could look like:
"StringD, StringB, StringF, StringE, StringZ"
"StringA, StringB, StringX, StringY, StringZ"
...
while a triplet is something like:
("StringA", "StringF", "StringZ")
("StringB", "StringU", "StringR")
...
I want to produce the cartesian set of A_DF and B_DF, e.g.;
| id_A: Int | concatCSV: String | id_B: Int | triplet: List[String] |
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 21 | ("StringA", "StringF", "StringZ")|
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 45 | ("StringB", "StringU", "StringR")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 21 | ("StringA", "StringF", "StringZ")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 45 | ("StringB", "StringU", "StringR")|
| ... | | | |
Then keep just the records that have at least two substrings (e.g StringA, StringB) from A_DF("concatCSV") that appear in B_DF("triplet"), i.e. use filter to exclude those that don't satisfy this condition.
First question is: can I do this without converting the DFs into RDDs?
Second question is: can I ideally do the whole thing in the join step--as a where condition?
I have tried experimenting with something like:
val cartesianRDD = A_DF
.join(B_DF,"right")
.where($"triplet".exists($"concatCSV".contains(_)))
but where cannot be resolved. I tried it with filter instead of where but still no luck. Also, for some strange reason, type annotation for cartesianRDD is SchemaRDD and not DataFrame. How did I end up with that? Finally, what I am trying above (the short code I wrote) is incomplete as it would keep records with just one substring from concatCSV found in triplet.
So, third question is: Should I just change to RDDs and solve it with a custom filtering function?
Finally, last question: Can I use a custom filtering function with DataFrames?
Thanks for the help.
The function CROSS JOIN is implemented in Hive, so you could first do the cross-join using Hive SQL:
A_DF.registerTempTable("a")
B_DF.registerTempTable("b")
// sqlContext should be really a HiveContext
val result = sqlContext.sql("SELECT * FROM a CROSS JOIN b")
Then you can filter down to your expected output using two udf's. One that converts your string to an array of words, and a second one that gives us the length of the intersection of the resulting array column and the existing column "triplet":
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val splitArr = udf { (s: String) => s.split(",").map(_.trim) }
val commonLen = udf { (a: WrappedArray[String],
b: WrappedArray[String]) => a.intersect(b).length }
val temp = (result.withColumn("concatArr",
splitArr(col("concatCSV"))).select(col("*"),
commonLen(col("triplet"), col("concatArr")).alias("comm"))
.filter(col("comm") >= 2)
.drop("comm")
.drop("concatArr"))
temp.show
+----+--------------------+----+--------------------+
|id_A| concatCSV|id_B| triplet|
+----+--------------------+----+--------------------+
| 14|StringD, StringB,...| 21|[StringA, StringF...|
| 18|StringA, StringB,...| 21|[StringA, StringF...|
+----+--------------------+----+--------------------+

How can I efficiently create unique relationships in Neo4j?

Following up on my question here, I would like to create a constraint on relationships. That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside.
As encouraged in user2194039's answer, I am using the following index:
CREATE INDEX ON :Neighborhood(name)
Also, I have the following constraint:
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
The following code fails to create unique relationships, and takes an excessively long period of time:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});
Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones).
Profile with Limit 10,000:
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 9750 | 3360 | anon[307], b, neighborhood, line | MergePattern |
| SchemaIndex | 9750 | 19500 | b, line | line.City; :City(name) |
| ColumnFilter | 9750 | 0 | line | keep columns line |
| Filter | 9750 | 0 | anon[220], line | anon[220] |
| Extract | 10000 | 0 | anon[220], line | anon[220] |
| Slice | 10000 | 0 | line | { AUTOINT0} |
| LoadCSV | 10000 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+
Total database accesses: 22860
Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. I've searched the documentation, and wasn't able to determine a work around from this. It looks like there's an open SO post about this exception.
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph()
authenticate("localhost:7474", "account", "password")
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
except error.Finished as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
main:
queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
for neighborhood in neighborhoods:
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
c +=1
if c % batch == 0:
print "running batch"
print c
s = time.time()*1000
r = run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
print c
if queries:
s = time.time()*1000
r = run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
If you want to create unique relationships you have 2 options:
Prevent the path from being duplicated, using MERGE, just like #user2194039 suggested. I think this is the simplest, and best approach you can take.
Turn your relationship into a node, and create an unique constraint on it. But it's hardly necessary for most cases.
If you're having trouble with speed, try using the transactional endpoint. I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost.
I managed to import 1M rows in just under 11 minutes.
I used an INDEX for Neighbourhood(name) and a unique constraint for City(name).
Give it a try and see if it works for you.
Edit:
The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. You can read about it here.
Basically, it allows you to stream a bunch of queries to the server at once.
I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this:
with open("city.csv", "r") as fp:
reader = csv.reader(fp)
queries = []
template = ["MERGE (c :`City` {name: {city}})",
"MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
for row in reader:
city, neighborhood = row
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
if c % batch == 0:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
c += 1
if queries:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
Helper functions:
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout.
To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. The neighborhoods are unique within the context of a city, but not globally. So if you have a neighborhood 3 [IN] city Boston, you could also have a neighborhood 3 [IN] city Seattle, and both of those neighborhoods are represented by different nodes, even though they have the same name property. Is that correct?
Before importing, I would recommend adding an index to your neighborhood nodes. You can add the index without enforcing uniqueness. I have found that this greatly increases speeds on even small databases.
CREATE INDEX ON :Neighborhood(name)
And for the import:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})
If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. In your case, with almost a million records, this is recommended by Neo4j. You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. The docs say 1000 is the default. Just understand that this will break the import into several transactions.
Best of luck!