MongoDB delete documents not containing custom words - pandas

I have a news article scraper that pulls articles based on certain content. On occasions, the crawlers pull back articles irrelevant to what they're supposed to.
I want to delete documents that DO NOT contain the relevant keywords. I ran the below code in pandas and was successful in deleting the unwanted documents:
relevant_words = ['Bitcoin', 'bitcoin', 'Ethereum', 'ethereum', 'Tether', 'tether', 'Cardano', 'cardano', 'XRP', 'xrp']
articles['content'] = articles['content'][articles['content'].str.contains('|'.join(relevant_words))].str.lower()
articles.dropna(subset=['content'], inplace=True)
My DB structure is as follows:
_id:
title:
url:
description:
author:
publishedAt:
content:
source_id:
urlToImage:
summarization:
The content field can contain anyone from one sentence to several paragraphs. I'm thinking a python script that iterates over the content field looking for documents without the relevant words and deleting them.

Filter into a new dataframe.
You were on the right track except you have to go in the following order;
join and convert to lower case:
'|'.join(relevant_words).lower()
Filter
m= articles['content'].str.contains('|'.join(relevant_words).lower())
Mask the filter
articles[m]
Combined code
new_articles=articles[articles['content'].str.contains('|'.join(relevant_words).lower())]

Related

Solr: indexing nested JSON files + some fields independent of UniqueKey (need new core?)

I am working on an NLP project and I have a large amount of text data to index with Solr. I have already created an initial index (Solr core) with fields title, authors, publication date, authors, abstract. The is an ID that is unique to each article (PMID). Since then, I have extracted more information from the dataset and I am stuck with how to incorporate this new info into the existing index. I don't know how to approach the problem and I would appreciate suggestions.
The new information is currently stored in JSON files that look like this:
{id: {entity: [[33, 39, 0, subj], [103, 115, 1, obj], ...],
another_entity: [[88, 95, 0, subj], [444, 449, 1, obj], ...],
...},
another id,
...}
where the integers are the character span and the index of the sentence the entity appears in.
Is there a way to have something like subfields in Solr? Since the id is the same as the unique key in the main index I was thinking of adding a field entities, but then this field would need to have its own subfields start character, end character, sentence index, dependency tag. I have come across Nested Child Documents and I am considering changing the structure of the extracted information to:
{id: {entity: [{start:33, end:39, sent_idx:0, dep_tag:'subj'},
{start:103, end:115, sent_idx:1, dep_tag:'obj'}, ...],
another_entity: [{}, {}, ...],
...},
another id,
...}
Having keys for the nested values, I should be able to use the methods linked above - though I am still unsure if I am on the right track here. Is there a better way to approach this? All fields should be searchable. I am familiar with Python, and so far I have been using the library subprocess to post documents to Solr via Python script
sp.Popen(f"./post -c {core_name} {json_path}", shell=True, cwd=SOLR_BIN_DIR)
Additionally, I want to index some information that is not linked to a specific PMID (does not have the same unique key), so I assume I need to create a new Solr core for it? Does it mean I have to switch to SolrCloud mode? So far I have been using a simple, single core.
Example of such information (abbreviations and the respective long form - also stored in a JSON file):
{"IEOP": "immunoelectroosmophoresis",
"ELISA": "enzyme-linked immunosorbent assay",
"GAGs": "glycosaminoglycans",
...}
I would appreciate any input - thank you!
S.

Python write function saves dataframe.__repr__ output but truncated?

I have a dataframe output as a result of running some code, like so
df = pd.DataFrame({
"i": self.direct_hit_i,
"domain name": self.domain_list,
"j": self.direct_hit_j,
"domain name 2": self.domain_list2,
"domain name cleaned": self.clean_domain_list,
"domain name cleaned 2": self.clean_domain_list2
})
All I was really looking for was a way to save these data to whatever file e.g. txt, csv but in a way where the columns of data align with the header. I was using df.to_csv() with \t delimeter but due to the data have different lengths of string and numbers, the elements within each row never quite line up as a column with the corresponding header. So I resulted to using
with open('./filename.txt', 'w') as fo:
fo.write(df.__repr__())
But bear in mind the data in the dataframe are lists with really long length. So for small lengths it returns
which is exactly what I want. However, when I have very big lists it gives me
So as seen below the outputs are truncated. I would like it to not be truncated since I'll need to manually scroll down and verify things.
Try the syntax:
with open('./filename.txt', 'w') as fo:
fo.write(f'{df!r}')
Another way of doing this export to csv would be to use a too like Mito, which full disclosure I'm the author of. It should allow you to export ot CSV easier than the process here!

Woocommerce, update ACF field and short description with SQL

I've got a WooCommerce set up where I currently need to do two things. The first is to move data that is currently in the post.excerpt field to an ACF field that I've created specifically for the purpose while the second is to update the post.excerpt field with new data. All the product data is in SQL-Server Express because the product data came from another website that we're replacing with the WooCommerce one. I exported all the Woocommerce products with basic info like the product ID, SKU, Title and Post_Content and wrote a query in SQL-Server to match the two together. That's been exported as a flat file and imported into MySQL. Now I've writen a query to update the post.excerpt field but what I can't find is a way to update the ACF field in the same query (or another one).
set
'wp.posts'.'post.excerpt' = 'updatelist'.'excerpt'
From 'updatelist'
where
'wp_posts'.'ID' = 'updatelist'.'product_id'
Can anyone help? Please don't suggesting importing a CSV file. There's 180,000 products and using a csv, well it's about 10% of the way through and has taken, so far, 24 hours.
To update ACF fields, first i would usually prepare an aray of key-value pairs of ACF fields to loop over and update them:
# first prepare your array of ACF fields you need to update
acf_fields = [
'field_5f70*********' => 'product_name',
'field_5f80*********' => 'product_color',
'field_5f90*********' => 'product_price'
];
# to find the key values for your own ACF fields, just go to admin dashboard under custom fields, select your group of ACF fields and then on the "Edit Field Group" you see those keys. If you don't see them, choose "screen options" and select "fields keys".
# Now we're going to loop over them and update each field
foreach(acf_fields as $key => $name){
update_field(a, b, c);
# a = $key
# b = value to be updated which comes from your old list (from querying your list)
# c = which post it belongs to (custom query for your custom post type that contains ACF fields)
};
That's how i update my ACF fields, there are other methods using Wordpress REST API too.

Query for substrings from freeform STT input

I have a PostgreSQL database with vocabulary in a table.
I want to receive Speech to Text (STT) input and query my vocabulary table for matches.
This is tricky since STT is somewhat free-form.
Let's say the table contains the following vocabulary and phrases:
How are you?
Hi
Nice to meet you
Hill
Nice
And the user is prompted to speak: "Hi, nice to meet you"
I transcribe their input as it comes in as "Hi nice to meet you" and query my database for individual vocabulary matches. I want to return:
[
{
id: 2,
word: "Hi"
},
{
id: 3,
word: "Nice to meet you"
}
]
I could query with wildcards where word ilike '%${term}% but then I'd need to pass in the correct substring so it'd find the match, e.g., where word ilike '%Hi%, but this may incorrectly return Hill. I could also split the spoken input by space, giving me ["Hi", "nice", "to", "meet", you"], and loop through each word looking for a match, but this may return Nice rather than the phrase Nice to meet you.
Q: How can I correctly pass substrings to a query and return accurate results for free-form speech?
Two PostgreSQL functions could help you here:
to_tsvector: creates a text search list of tokens (lexemes: unit of lexical meaning)
to_tsquery for querying the vector for occurrences of certain words or phrases.
See Mastering PostgreSQL Tools: Full-Text Search and Phrase Search
If that's not enough you need to turn to natural language processing (NLP).
Something like PyTextRank could help (something that goes beyond the bag-of-words technique):
import spacy
import pytextrank
text = "Hi, how are you?"
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
# examine the top-ranked phrases in the document
for p in doc._.phrases:
print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
print(p.chunks)

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.