I am using pymongo to query a json file.
I have queried the json file to find the most popular book and the counts of that book sold.
I am using pymongo and pipelines for my query. This is the format of my query
books = list(db.books.aggregate(pipeline1))
I have stored this result list in a variable called "books".
When I print books, i.e. print(books)
it prints the following:
{'_id': 'Harry Potter ', 'sold': 456289}
Is there anyway to print just the name: "Harry Potter" without "id", "sold" or "456289"?
Have you inserted in your pipeline the projection step disabling the _id field?
https://docs.mongodb.com/manual/reference/operator/aggregation/project/
in details from documentation:
_id: <0 or false>
Specifies the suppression of the _id field.
Related
I am working on an NLP project and I have a large amount of text data to index with Solr. I have already created an initial index (Solr core) with fields title, authors, publication date, authors, abstract. The is an ID that is unique to each article (PMID). Since then, I have extracted more information from the dataset and I am stuck with how to incorporate this new info into the existing index. I don't know how to approach the problem and I would appreciate suggestions.
The new information is currently stored in JSON files that look like this:
{id: {entity: [[33, 39, 0, subj], [103, 115, 1, obj], ...],
another_entity: [[88, 95, 0, subj], [444, 449, 1, obj], ...],
...},
another id,
...}
where the integers are the character span and the index of the sentence the entity appears in.
Is there a way to have something like subfields in Solr? Since the id is the same as the unique key in the main index I was thinking of adding a field entities, but then this field would need to have its own subfields start character, end character, sentence index, dependency tag. I have come across Nested Child Documents and I am considering changing the structure of the extracted information to:
{id: {entity: [{start:33, end:39, sent_idx:0, dep_tag:'subj'},
{start:103, end:115, sent_idx:1, dep_tag:'obj'}, ...],
another_entity: [{}, {}, ...],
...},
another id,
...}
Having keys for the nested values, I should be able to use the methods linked above - though I am still unsure if I am on the right track here. Is there a better way to approach this? All fields should be searchable. I am familiar with Python, and so far I have been using the library subprocess to post documents to Solr via Python script
sp.Popen(f"./post -c {core_name} {json_path}", shell=True, cwd=SOLR_BIN_DIR)
Additionally, I want to index some information that is not linked to a specific PMID (does not have the same unique key), so I assume I need to create a new Solr core for it? Does it mean I have to switch to SolrCloud mode? So far I have been using a simple, single core.
Example of such information (abbreviations and the respective long form - also stored in a JSON file):
{"IEOP": "immunoelectroosmophoresis",
"ELISA": "enzyme-linked immunosorbent assay",
"GAGs": "glycosaminoglycans",
...}
I would appreciate any input - thank you!
S.
I log messages that are JSON objects. The JSON has an array that contains key/value pairs:
{
...
"arr": [{"key": "foo", "value": "bar"}, ...],
...
}
Now I want to filter results that contains a specific key and extract the values for a specific key in the array.
I've tried using regex, something like parse #message /.*"key":"my_specific_key","value":(?<value>.*}).*/ which extracts the value but also returns the rest of the message. Also it doesn't filter the results.
How can I filter results and extract the values for a specific key?
If in your log entry in the cloudwatch log group they are actually showing up as json, you can just reference the key directly in any place you would a field.
(don't need the #, cloudwatch appends that automatically to all default values)
If you are using python, you can use aws_lambda_powertools to do this as well, in a very slick way (and its an actual aws product)
If they are showing up in your log as a string, then it may be an escaped string and you'll have to match it -exactly- - including spaces and what not. when you parse, you will want to do something like this:
if this is the string of your log message '{"AKey" : "AValue", "Key2" : "Value2"}
parse #message "{\"*\" : \"*\",\"*\" : \"*\"} akey, akey_value, key2, key2_value
then you can filter or count or anything against those variables. parse is specifically a statement to match a pattern and assign the wildcard to a variable, one at a time in order
tho with a complex json, if your above regex works than all you need is a filter statement
field #message
| pares #message ... your regex as value_var
| filer value_var /some more regex/
if its not a string in the log entry, but an actual json, you can just reference against the key:
filter a_key ~="some value" (or regex here)
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData-discoverable-fields.html
for more info
I have a news article scraper that pulls articles based on certain content. On occasions, the crawlers pull back articles irrelevant to what they're supposed to.
I want to delete documents that DO NOT contain the relevant keywords. I ran the below code in pandas and was successful in deleting the unwanted documents:
relevant_words = ['Bitcoin', 'bitcoin', 'Ethereum', 'ethereum', 'Tether', 'tether', 'Cardano', 'cardano', 'XRP', 'xrp']
articles['content'] = articles['content'][articles['content'].str.contains('|'.join(relevant_words))].str.lower()
articles.dropna(subset=['content'], inplace=True)
My DB structure is as follows:
_id:
title:
url:
description:
author:
publishedAt:
content:
source_id:
urlToImage:
summarization:
The content field can contain anyone from one sentence to several paragraphs. I'm thinking a python script that iterates over the content field looking for documents without the relevant words and deleting them.
Filter into a new dataframe.
You were on the right track except you have to go in the following order;
join and convert to lower case:
'|'.join(relevant_words).lower()
Filter
m= articles['content'].str.contains('|'.join(relevant_words).lower())
Mask the filter
articles[m]
Combined code
new_articles=articles[articles['content'].str.contains('|'.join(relevant_words).lower())]
I've got a WooCommerce set up where I currently need to do two things. The first is to move data that is currently in the post.excerpt field to an ACF field that I've created specifically for the purpose while the second is to update the post.excerpt field with new data. All the product data is in SQL-Server Express because the product data came from another website that we're replacing with the WooCommerce one. I exported all the Woocommerce products with basic info like the product ID, SKU, Title and Post_Content and wrote a query in SQL-Server to match the two together. That's been exported as a flat file and imported into MySQL. Now I've writen a query to update the post.excerpt field but what I can't find is a way to update the ACF field in the same query (or another one).
set
'wp.posts'.'post.excerpt' = 'updatelist'.'excerpt'
From 'updatelist'
where
'wp_posts'.'ID' = 'updatelist'.'product_id'
Can anyone help? Please don't suggesting importing a CSV file. There's 180,000 products and using a csv, well it's about 10% of the way through and has taken, so far, 24 hours.
To update ACF fields, first i would usually prepare an aray of key-value pairs of ACF fields to loop over and update them:
# first prepare your array of ACF fields you need to update
acf_fields = [
'field_5f70*********' => 'product_name',
'field_5f80*********' => 'product_color',
'field_5f90*********' => 'product_price'
];
# to find the key values for your own ACF fields, just go to admin dashboard under custom fields, select your group of ACF fields and then on the "Edit Field Group" you see those keys. If you don't see them, choose "screen options" and select "fields keys".
# Now we're going to loop over them and update each field
foreach(acf_fields as $key => $name){
update_field(a, b, c);
# a = $key
# b = value to be updated which comes from your old list (from querying your list)
# c = which post it belongs to (custom query for your custom post type that contains ACF fields)
};
That's how i update my ACF fields, there are other methods using Wordpress REST API too.
I have a python dictionary that has {objctID},'GUID',{'A':A's score'},{ 'B':'B's score}.I have got 5 guid's in the dictionary.
The same format document is already stored in mongoDB.
I want to check if the guid's in python dict are there in mongodb collection. If exist it has be to updated else insert to mongoDB.
How can I do this using pyMongo?
Assuming that RESULT is a dict with keys that match yourIDs and that contains your scores, you can do the following :
ids = RESULT.keys()
for id in ids:
collection.update({"ID": id}, {$set : {id : RESULT[id]}}, true)
The true, will insert if the document doesn't already exist.
Here are the pymongo docs