Solr: indexing nested JSON files + some fields independent of UniqueKey (need new core?) - indexing

I am working on an NLP project and I have a large amount of text data to index with Solr. I have already created an initial index (Solr core) with fields title, authors, publication date, authors, abstract. The is an ID that is unique to each article (PMID). Since then, I have extracted more information from the dataset and I am stuck with how to incorporate this new info into the existing index. I don't know how to approach the problem and I would appreciate suggestions.
The new information is currently stored in JSON files that look like this:
{id: {entity: [[33, 39, 0, subj], [103, 115, 1, obj], ...],
another_entity: [[88, 95, 0, subj], [444, 449, 1, obj], ...],
...},
another id,
...}
where the integers are the character span and the index of the sentence the entity appears in.
Is there a way to have something like subfields in Solr? Since the id is the same as the unique key in the main index I was thinking of adding a field entities, but then this field would need to have its own subfields start character, end character, sentence index, dependency tag. I have come across Nested Child Documents and I am considering changing the structure of the extracted information to:
{id: {entity: [{start:33, end:39, sent_idx:0, dep_tag:'subj'},
{start:103, end:115, sent_idx:1, dep_tag:'obj'}, ...],
another_entity: [{}, {}, ...],
...},
another id,
...}
Having keys for the nested values, I should be able to use the methods linked above - though I am still unsure if I am on the right track here. Is there a better way to approach this? All fields should be searchable. I am familiar with Python, and so far I have been using the library subprocess to post documents to Solr via Python script
sp.Popen(f"./post -c {core_name} {json_path}", shell=True, cwd=SOLR_BIN_DIR)
Additionally, I want to index some information that is not linked to a specific PMID (does not have the same unique key), so I assume I need to create a new Solr core for it? Does it mean I have to switch to SolrCloud mode? So far I have been using a simple, single core.
Example of such information (abbreviations and the respective long form - also stored in a JSON file):
{"IEOP": "immunoelectroosmophoresis",
"ELISA": "enzyme-linked immunosorbent assay",
"GAGs": "glycosaminoglycans",
...}
I would appreciate any input - thank you!
S.

Related

MongoDB delete documents not containing custom words

I have a news article scraper that pulls articles based on certain content. On occasions, the crawlers pull back articles irrelevant to what they're supposed to.
I want to delete documents that DO NOT contain the relevant keywords. I ran the below code in pandas and was successful in deleting the unwanted documents:
relevant_words = ['Bitcoin', 'bitcoin', 'Ethereum', 'ethereum', 'Tether', 'tether', 'Cardano', 'cardano', 'XRP', 'xrp']
articles['content'] = articles['content'][articles['content'].str.contains('|'.join(relevant_words))].str.lower()
articles.dropna(subset=['content'], inplace=True)
My DB structure is as follows:
_id:
title:
url:
description:
author:
publishedAt:
content:
source_id:
urlToImage:
summarization:
The content field can contain anyone from one sentence to several paragraphs. I'm thinking a python script that iterates over the content field looking for documents without the relevant words and deleting them.
Filter into a new dataframe.
You were on the right track except you have to go in the following order;
join and convert to lower case:
'|'.join(relevant_words).lower()
Filter
m= articles['content'].str.contains('|'.join(relevant_words).lower())
Mask the filter
articles[m]
Combined code
new_articles=articles[articles['content'].str.contains('|'.join(relevant_words).lower())]

How to programmatically list available Google BigQuery locations?

How to programmatically list available Google BigQuery locations? I need a result similar to what is in the table of this page: https://cloud.google.com/bigquery/docs/locations.
As #shollyman has mentioned
The BigQuery API does not expose the equivalent of a list locations call at this time.
So, you should consider filing a feature request on the issue tracker.
Meantime, I wanted to add Option 3 to those two already proposed by #Tamir
This is a little naïve option with its pros and cons, but depends on your specific use case can be useful and easy adapted to your application
Step 1 - load page (https://cloud.google.com/bigquery/docs/locations) html
Step 2 - parse and extract needed info
Obviously, this is super simple to implement in any client of your choice
As I am huge BigQuery fan - I went through "prove of concept" using BigQuery Tool - Magnus
I've created workflow with just two Tasks:
API Task - to load page's HTML into variable var_payload
and
BigQuery Task - to parse and extract wanted info out of html
The "whole" workflow is as simple as it looks in below screenshot
The query I used in BigQuery Task is
CREATE TEMP FUNCTION decode(x STRING) RETURNS STRING
LANGUAGE js AS """
return he.decode(x);
"""
OPTIONS (library="gs://my_bucket/he.js");
WITH t AS (
SELECT html,
REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(html,
r'\n|<strong>|</strong>|<code>|</code>', ''),
r'<table>(.*?)</table>'
)[OFFSET(0)] x
FROM (SELECT'''<var_payload>''' AS html)
)
SELECT pos,
line[SAFE_OFFSET(0)] Area,
line[SAFE_OFFSET(1)] Region_Name,
decode(line[SAFE_OFFSET(2)]) Region_Description
FROM (
SELECT
pos, REGEXP_EXTRACT_ALL(line, '<td>(.*?)</td>') line
FROM t,
UNNEST(REGEXP_EXTRACT_ALL(x, r'<tr>(.*?)</tr>')) line
WITH OFFSET pos
WHERE pos > 0
)
As you can see, i used he library. From its README:
he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would ...
After workflow is executed and those two steps are done - result is in project.dataset.location_extraction and we can query this table to make sure we've got what is expected
Note: obviously parsing and extracting needed locations info is quite simplified and surely can be improved to be more flexible in terms of changing source page layout
Unfortunately, There is no API which provides BigQuery supported location list.
I see two options which might be good for you:
Option 1
You can manually manage a list and expose this list to your client via an API or any other means your application support (You will need to follow BigQuery product updates to follow on updates on this list)
Option 2
If your use case is to provide a list of the location you are using to store your own data you can call dataset.list to get a list of location and display/use it in your app
{
"kind": "bigquery#dataset",
"id": "id1",
"datasetReference": {
"datasetId": "datasetId",
"projectId": "projectId"
},
"location": "US"
}

How to get parents by IDs when using polymorphic associations

I have a many-to-many relation table site_sections with the following columns:
id
site_id
section_id
which is used a join table between sections and sites tables. So one site has many sections and a section is available in many sites.
Sites table has the following columns:
id
store_number
The sites_sections table is used in a polymorphic association with parameters table.
I'd like to find all the parameters corresponding to the site sections for a specific site by its store_number. Is it possible to pass in an array of site_settings.id to SQL using the IN clause, something like that:
Parameter.where("parent_id IN (" + [1, 2, 3, 4] + ") and parent_type ='com.models.SiteSection'");
where [1, 2, 3, 4] should be an array of IDs from sites_sections table or there is a better solution ?
Your solution is correct:
Site aSite = Site.findFirst("store_number=?", STORE_NUMBER);
List<SiteSection> siteSections= SiteSection.where("site_id=?", aSite.getId()).include(Parameter.class);
for (SiteSection siteSection : siteSections) {
List<Parameter> siteParams = siteSection.getAll(Parameter.class);
for (Parameter siteParam : siteParams) { ... }
}
In addition, by using the include(), you are also avoiding an N+1 problem: http://javalite.io/lazy_and_eager#improve-efficiency-with-eager-loading
However there can be a catch. If you have a very large number of parameters, you will be using a lot of memory, since include() loads all results into heap at once. If your result sets are relatively small, you are saving resources by running a single query. If your result sets are large, you are wasting heap space.
See docs: LazyList#include()
Side note: use aSite.getId() or aSite.getLongId() instead of aSite.get("id")

Neo4j 2.0.1 Cypher performance difference between using start and match with a predicate

Started using Cypher about a week ago (really like it). In the 'browser' interface I'm running two queries:
1) start n=node:Node(name="foo") match (n)-[r*..4]-(m) return n,m
2) match (n{name:"foo"})-[r*..4]-(m) return n,m
The first query returns almost immediately, the second query more than an hour and counting. Naively I would think these would be equivalent, clearly they are not. I ran a 'smaller' (path just up to 1) version of both in the neo-shell so I could profile them.
profile start n=node:Node(name="foo") match (n)-[r*..1]-(m) return n,m;
ColumnFilter(symKeys=["m", "n", " UNNAMED51", "r"], returnItemNames=["n", "m"], _rows=4, _db_hits=0)
TraversalMatcher(start={"expr": "Literal(foo)", "identifiers": ["n"], "key": "Literal(name)",
"idxName": "Node", "producer": "NodeByIndex"}, trail="(n)-[*1..1]-(m)", _rows=4, _db_hits=5)
.
profile match (n{name:"foo"})-[r*..1]-(m) return n,m
ColumnFilter(symKeys=["n", "m", " UNNAMED33", "r"], returnItemNames=["n", "m"], _rows=4, _db_hits=0)
Filter(pred="Property(n,name(0)) == Literal(foo)", _rows=4, _db_hits=196870)
TraversalMatcher(start={"producer": "AllNodes", "identifiers": ["m"]},
trail="(m)-[*1..1]-(n)", _rows=196870, _db_hits=396980)
From other stackoverflow questions I understand db_hits is good to look at, so it looks like the second query has basically done a linear scan (my db is almost 400k nodes). This seems to be confirmed by the "producer" value of "AllNodes" instead of "NodeByIndex".
Obviously I need to specify the match (predicate) differently so that it hits the index. The index is called 'Node' on parameter 'name'. My googling, stacko search is failing me.. how do I specify the conditional in the match so that it hits the index?
Update:
After some poking around it appears I'm using a 'legacy' index? and then trying to hit that with the 'new style (don't use start) query... (kinda extrapolating here). So I can do the following:
create index ON :label(name)
and that would provide an index for a particular label on the name property, but I really want an index (I guess non-legacy index) on ALL the node names. I have use cases where that's important (user may not know the label but does know the name).
Any suggestions or guidance is much appreciated.
Right now there is no global schema index, so you would probably want to create an index on a generic label like Entity or Node and create an index like this:
create index on :Entity(name);
And add that Entity label to all your nodes.
match (n) set n:Entity;

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.