Creating a .p file from an InvertedIndex - indexing

I am trying to construct a Python search engine that will pull queried tweets out of the TwitterAPI. I cannot use tweepy. I have constructed the class TwitterWrapper, class Engine, and have also created the (pickled) corpus from my query term of “raptor”, no problem there. I have created the class InvertedIndex but cannot seem to construct the index of lemmatized words and add it to my Jupyter Notebooks as a pickle file (.p) (which I must do) in order to use my TwittIR.py interface:
t = TwittIR.Engine(“raptor.p”, “index (to be named).p”)
results = t.query (“raptor”)
for result in results:
print(result)
So, my question is, how do I make an index (using my inverted index), and save it into my Jupyter Notebook as a .p file? I am of course, somewhat of a Python beginner.
def build_index(self, corpus):
index = {}
if self.lemmatizer == None:
self.lemmatizer = nltk.WordNetLemmatizer()
if self.stop_words is None:
self.stop_words = [self.wordtotoken(word) for word in
nltk.corpus.stopwords.words('english')]
for doc in corpus:
self.add_document(doc['text'], self.ndocs)
self.ndocs = self.ndocs + 1
def __getitem__(self,index):
return self.index[self.normalize_document(index)[0]]

Related

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999

Python Dynamic Test Plan generation

I am using Sphinx for documentation and pytest for testing.
I need to generate a test plan but I really don't want to generate it by hand.
It occurred to me that a neat solution would be to actually embed test metadata in the tests' themselves, within their respective docstrings. This metadata would include things like % complete, time remaining etc. I could then run through all of the tests (which would at this point include mostly placeholders) and generate a test plan from them. This would then guarantee that the test plan and the tests themselves would be in sync.
I was thinking of making either a pytest plugin or a sphinx plugin to handle this.
Using pytest, the closest hook I can see looks like pytest_collection_modifyitems which gets called after all of the tests are collected.
Alternatively, I was thinking of using Sphinx and perhaps copying/modifying the todolist plugin as it seems like the closest match to this idea. The output of this would be more useful as the output would slot nicely in to the existing Sphinx based docs I have though there is a lot going on in this plugin and I don't really have the time to invest in understanding it.
The docstrings could have something like this within it:
:plan_complete: 50 #% indicator of how complete this test is
:plan_remaining: 2 #the number of hours estimated to complete this test
:plan_focus: something #what is the test focused on testing
The idea is to then generate a simple markdown/rst or similar table based on the function's name, docstring and embedded plan info and use that as the test plan.
Does something like this already exist?
In the end I went with a pytest based plugin as it was just so much simpler to code.
If anyone else is interested, below is the plugin:
"""Module to generate a test plan table based upon metadata extracted from test
docstrings. The test description is extracted from the first sentence or up to
the first blank line. The data which is extracted from the docstrings are of the
format:
:test_remaining: 10 #number of hours remaining for this test to be complete. If
not present, assumed to be 0
:test_complete: #the percentage of the test that is complete. If not
present, assumed to be 100
:test_focus: The item the test is focusing on such as a DLL call.
"""
import pytest
import re
from functools import partial
from operator import itemgetter
from pathlib import Path
whitespace_re = re.compile(r'\s+')
cut_whitespace = partial(whitespace_re.sub, ' ')
plan_re = re.compile(r':plan_(\w+?):')
plan_handlers = {
'remaining': lambda x:int(x.split('#')[0]),
'complete': lambda x:int(x.strip().split('#')[0]),
'focus': lambda x:x.strip().split('#')[0]
}
csv_template = """.. csv-table:: Test Plan
:header: "Name", "Focus", "% Complete", "Hours remaining", "description", "path"
:widths: 20, 20, 10, 10, 60, 100
{tests}
Overall hours remaining: {hours_remaining:.2f}
Overall % complete: {complete:.2f}
"""
class GeneratePlan:
def __init__(self, output_file=Path('test_plan.rst')):
self.output_file = output_file
def pytest_collection_modifyitems(self, session, config, items):
#breakpoint()
items_to_parse = {i.nodeid.split('[')[0]:i for i in self.item_filter(items)}
#parsed = map(parse_item, items_to_parse.items())
parsed = [self.parse_item(n,i) for (n,i) in items_to_parse.items()]
complete, hours_remaining = self.get_summary_data(parsed)
self.output_file.write_text(csv_template.format(
tests = '\n'.join(self.generate_rst_table(parsed)),
complete=complete,
hours_remaining=hours_remaining))
def item_filter(self, items):
return items #override me
def get_summary_data(self, parsed):
completes = [p['complete'] for p in parsed]
overall_complete = sum(completes)/len(completes)
overall_hours_remaining = sum(p['remaining'] for p in parsed)
return overall_complete, overall_hours_remaining
def generate_rst_table(self, items):
"Use CSV type for simplicity"
sorted_items = sorted(items, key=lambda x:x['name'])
quoter = lambda x:'"{}"'.format(x)
getter = itemgetter(*'name focus complete remaining description path'.split())
for item in sorted_items:
yield 3*' ' + ', '.join(map(quoter, getter(item)))
def parse_item(self, path, item):
"Process a pytest provided item"
data = {
'name': item.name.split('[')[0],
'path': path.split('::')[0],
'description': '',
'remaining': 0,
'complete': 100,
'focus': ''
}
doc = item.function.__doc__
if doc:
desc = self.extract_description(doc)
data['description'] = desc
plan_info = self.extract_info(doc)
data.update(plan_info)
return data
def extract_description(self, doc):
first_sentence = doc.split('\n\n')[0].replace('\n',' ')
return cut_whitespace(first_sentence)
def extract_info(self, doc):
plan_info = {}
for sub_str in doc.split('\n\n'):
cleaned = cut_whitespace(sub_str.replace('\n', ' '))
splitted = plan_re.split(cleaned)
if len(splitted) > 1:
i = iter(splitted[1:]) #splitter starts at index 1
while True:
try:
key = next(i)
val = next(i)
except StopIteration:
break
assert key
if key in plan_handlers:
plan_info[key] = plan_handlers[key](val)
return plan_info
From my conftest.py file, I have a command line argument configured within a pytest_addoption function: parser.addoption('--generate_test_plan', action='store_true', default=False, help="Generate test plan")
And I then configure the plugin within this function:
def pytest_configure(config):
output_test_plan_file = Path('docs/source/test_plan.rst')
class CustomPlan(GeneratePlan):
def item_filter(self, items):
return (i for i in items if 'tests/hw_regression_tests' in i.nodeid)
if config.getoption('generate_test_plan'):
config.pluginmanager.register(CustomPlan(output_file=output_test_plan_file))
#config.pluginmanager.register(GeneratePlan())
Finally, in one of my sphinx documentation source files I then just include the output rst file:
Autogenerated test_plan
=======================
The below test_data is extracted from the individual tests in the suite.
.. include:: test_plan.rst
We have done something similar in our company by using Sphinx-needs and Sphinx-Test-Reports.
Inside a test file we use the docstring to store our test-case incl meta-data:
def my_test():
"""
.. test:: My test case
:id: TEST_001
:status: in progress
:author: me
This test case checks for **awesome** stuff.
"""
a = 2
b = 5
# ToDo: chek if a+b = 7
Then we document the test cases by using autodoc.
My tests
========
.. automodule:: test.my_tests:
:members:
This results in some nice test-case objects in sphinx, which we can filter, link and present in table and flowcharts. See Sphinx-Needs.
With Sphinx-Test-Reports we are loading the results into the docs as well:
.. test-report: My Test report
:id: REPORT_1
:file: ../pytest_junit_results.xml
:links: [[tr_link('case_name', 'signature')]]
This will create objects for each test case, which we also can filter and link.
Thanks of tr_link the result objects get automatically linked to the test case objects.
After that we have all needed information in sphinx and can use e.g. .. needtable:: to get custom views on it.

Python: Accessing indices of iterable passed into `Pool.map(function(), iterable)`

I have code that looks something like this:
def downloadImages(self, path):
for link in self.images: #self.images=list of links opened via requests & written to file
if stuff in link:
#parse link a certain way to get `name`
else:
#parse link a different way to get `name`
r = requests.get(link)
with open(path+name,'wb') as f:
f.write(r.content)
pool = Pool(2)
scraper.prepPage(url)
scraper.downloadImages('path/to/directory')
I want to change downloadImages to parallelize the function. Here's my attempt:
def downloadImage(self, path, index of self.images currently being processed):
if stuff in link: #because of multiprocessing, there's no longer a loop providing a reference to each index...the index is necessary to parse and complete the function.
#parse link a certain way to get `name`
else:
#parse link a different way to get `name`
r = requests.get(link)
with open(path+name,'wb') as f:
f.write(r.content)
pool = Pool(2)
scraper.prepPage(url)
pool.map(scraper.downloadImages('path/to/directory', ***some way to grab index of self.images****), self.images)
How can I refer to the index currently being worked on of the iterable passed into pool.map()?
I'm completely new to multiprocessing & couldn't find what I was looking for in the documentation...I also couldn't find a similar question via google or stackoverflow.