Best practice to keep RSS feeds unique in sql database - sql

I am working on a project which shows rss feeds from different sites.
I keep them in the database, every 3 hours my program fetches and inserts them into sql database.
I want unique records for providers not to show duplicate content.
But problem is some providers do not give GUID field, and some others gives GUID field but not pubdate.. And some others does not even give GUID or PubDate just title and link.
So to keep rss feeds uniqe in sql server what would be the best way?
Should I check for first guid, then pubbdate, then link, then title? Will it be to good practice to compare link fields in SQL to check uniqueness?
Thanks.

I would develop a routine that takes certain key parameters like the title, source and body and then combines them to create a CRC hash. Then store the hash as an attribute with the feed and check for a matching hash before adding a new feed.
I'm not sure what your environment contraints are but here is an example for calculating CRC-32 in C#: http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net

Currently, this is what I am doing
# If we have a GUID in the feed item, use it as the feed_item_id else use link
# http://www.詹姆斯.com/blog/2006/08/rss-dup-detection
def build_feed_item_id(entry):
guid = trim(entry.get('id', ''))
if len(guid):
feed_item_id = guid
else:
feed_item_id = trim(entry.get('link', ''))
return hashlib.md5(feed_item_id.encode(encoding)).hexdigest()
It is based on the reasoning mentioned in the blog post linked in the snippet which I ll reference here in case the post gets taken down
RSS 2.0 has a guid element that fits the bill perfectly, but it’s not
a required element and many feeds don’t use it.
I can’t say for sure what algorithms applications are using, but after
running 150 tests on more than 20 different aggregators, I think have
a fair idea how many of them work.
As you would expect, for most the guid is considered the key element
for determining duplicates. This is pretty straightforward. If two
items have the same guid they are considered duplicates; if their
guids differ then they are considered different.
If a feed doesn’t contain guids, though, aggregators will most likely
resort to one of three general strategies – all of which involve the
link element in some way.
Technique 1
Guid must be unique
If a post doesnt have guid, consider link, title, description or any combination of them to get a unique hash
Technique 2
Link must be unique
If both link and guid are missing, check other elements such as title or description
Technique 3
Combination of link + title or link + description must be unique
The most obvious recommendation is that you should always include
guids in your feeds.
In addition, I would recommend you also include a unique link element
for each item in your feed, to allow for aggregators that don’t handle
guids very well. No two items should ever have the same link element,
and ideally a link should never change (if you do update a link, be
aware that it could show up as a new item for some aggregators).
Finally, although this is not essential, it is advisable that you
refrain from updating your article titles if at all possible. There
are at least two aggregators that will consider an entry with an
altered title to be a completely new post – somewhat annoying to
readers when all you’ve done is make a spelling correction in your
title.

Related

How to get all hashes in foo:* using a single id counter instead of a set/array

Introduction
My domain has articles, which have a title and text. Each article has revisions (like the SVN concept), so every time it is changed/edited, those changes will be stored as a revision. A revision is composed of changes and the description of those changes
I want to be able to obtain all revisions descriptions at once.
What's the problem?
I'm certain that I would store the revision as a hash in articles:revisions:<id> storing the changes, and the description in it.
What I'm not certain of is how do I get all of the descriptions at once.
I have many options to do this, but none of them convinces me.
Store the revision ids for an article as a set, and use SORT articles:revisions:idSet BY NOSORT GET articles:revisions:*->description. This means that I would store a set for each article. If every article had 50 revisions, and we had 10.000 articles, we would have 500.000 ids stored.
Is this the best way? Isn't this eating up too much RAM?
I have other ideas in mind, but I don't consider them good either.
Iterate from 0 to the last revision's id, doing a HGET for each id using MULTI
Create the idSet for a specific article if it doesn't exist and is request, expire after some time.
Isn't there a way for redis to do a SORT array BY NOSORT GET, with array being an adhoc array in the form of [0, MAX]?
Seems like you have a good solution.
As long as you keep those id numbers less than 10,000 and your sets with less than 512 elements(set-max-intset-entries), your memory consumption will be much lower than you think.
Here's a good explanation of it.
This can be solved in an optimized way using a TRIE or DAWG better than what Redis provides. I don't know your application or other info on your search problem (e.g. construction time, unsuccessful searches, update performance).
If you search much more often than you need to update / insert into your lookup storage, I'd suggest you have a look at DAWGDIC [1] as a library, and construct "search paths" (similar as you already described) using a string format that can be search-completed later:
articleID:revisionID:"changeDescription":"change"
Example (I assume you have one description per revision, and n changes. This isn't clear to me from your question):
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
2:4:"Advertisement changes":"Added this, removed that"
Note: Even though you construct these strings with duplicate prefixes, the DAWG will store them in a very space efficient way (simply put, it will append the right side of the string to the data structure and create a shortcut for the common prefix, see also [2] for a comparison of TRIE data structures).
To list changes of article 1, revision 2, set the common prefix for your lookup:
completer.Start(index, "1:2");
Now you can simple call completer.Next() to lookup a next record that shares the same prefix, and completer.value() to get the record's value. In our example we'll get:
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
Of course you need to parse the strings yourself into your data object.
Maybe that's not what you're looking for and overkill. But it can be a very space and search performance efficient way, if it meets your requirements.
[1] https://code.google.com/p/dawgdic/
[2] http://kmike.ru/python-data-structures/

What is meaning of different fields returned by get login form call?

I am looking for specific meaning of following fields
valueIdentifier
valueMask
fieldType
FieldInfoMultiFixed
AutoRegFieldInfoSingle
FieldInfoMultiVariable
and in most cases we are getting numerical value for helpText. How do we identify whether helpText is present or not?
A lot of the stuff like FieldInfoMultiFixed/Variable is discussed in the Yodlee SDK Developer guide. Search for either one. They're just basically silly combos where people breakup a single value into multiple fields (like phone number or ssn into 3 textboxes)
As for the helpText, every time i've seen a Yodlee tech respond they say no. The number corresponds to an internal resource identifier that is apparently not exposed through the api. I want to say I saw somebody say that it might be available for things like forum signup/registration (where it would be more useful). The SDK makes mention as if it works as you would expect it to but that is an error.
Currently Yodlee does not have helptext populated for any field. Hence a numerical value is associated to it. In future if any helptext gets added then instead of numerical value you will have text in that field.
Hence if you are receiving numerical values then you should take it as helptext not present.
Shreyans

Getting the exact edited data from SQL Server

I have two Tables:
Articles(artID, artContents, artPublishDate, artCategoryID, publisherID).
ArticleUpdated(upArtID, upArtContents, upArtEditedData, upArtPublishDate, upArtCategory, upArtOriginalArticleID, upPublisherID)
A user logging in to the application and update an article's
contents at (artContents) column. I want to know about:
Which Changes the user made to the article's contents?
I want to store both versions of the Article, Original version and Edited Version!
What should I do for doing above two task:
Any necessary changes into the tables?
The query for getting exact edited data of (artContents).
(The exact edited data means, that there may 5000 characters in the coloumns, the user may edit 200 characters in the middle or somewhere else in column's characters, I want exact those edited characters, before of edit and after of edit)
Note: I am using ASP.NET with C# for Developing
You are not going to be able to do the exact editing using SQL. You need an algorithm such as the Unix diff on files (which works on the line level). At the character level, the algorithm would be some variation of Levenshtein distance. If diff meets your needs, you could download it, write a stored-procedure to call it, and then use it in the database. This would be rather expensive.
The part of your question of maintaining the different versions is much easier. I would add two colmnns EffDate and EndDate onto each record. You can get the most recent version by looking for EndDate is NULL and find the version active at any given time. Merge is generally useful for maintaining such a table.
Basically this type for requirement needs custom logging.
The example what you have provided i.e. "The exact edited data means, that there may 5000 characters in the coloumns, the user may edit 200 characters in the middle or somewhere else in column's characters, I want exact those edited characters, before of edit and after of edit"
Can have a case that user updates particular words from different place from the text.
You can use http://nlog-project.org/ for logging, its a fast and robust tool that normally we use for doing .net logging.
Also you can take a look
http://www.codeproject.com/Articles/38756/Two-Simple-Approaches-to-WinForms-Dirty-Tracking
Asp.net Event for change tracking of entities
What would be the best way to implement change tracking on an object
Above urls will clear some air, on how to do it.
You would obviously need to track down and store every change.

What's the difference between an inverted index and a plain old index?

In software engineering we create indexes all the time (e.g., in databases) but I also hear a lot of people talk about inverted indices. Is there something fundamentally different between the two? They sound like the same thing.
One common use is "...to allow fast full-text searching."
The two types denote directionality. One takes you forward through the index, and the other takes you backward (the inverse) through the index. That's it. There's no mystery to uncover here. Otherwise the two types are identical, it's just a question of what information you have, and as a result what information you're trying to find.
To address your inquiry, I don't think there's actually a way to know why the use is what it is today. The only reason it's important to define which is forward and which one is inverted is so that we can all have a conversation about them, and everyone knows which direction we're talking about. Think about the terms "left" and "right": they are relative. Which is which doesn't matter, except that everyone needs to agree which one is "left" and which one is "right" in order for the words to have meaning. If, as a culture, we decided to flip left and right, then you'd have the same issue figuring out what a "right turn" vs a "left turn" is since the agreed upon meaning had changed. However, the naming is arbitrary, so which one is which (in and of itself) doesn't matter - what matters is that we all agree on the meaning.
In your comment where you ask, "please don't just define the terms", you're missing the point, and I think you're just getting hung up on the wording when there is absolutely no difference between them.
For the benefit of future readers, I will now provide several "forward" and "inverted" index examples:
Example 1: Web search
If you're thinking that the inverse of an index is something like the inverse of a function in mathematics, where the inverse is a special thing that has a different form, then you're mistaken: that's not the case here.
In a search engine you have a list of documents (pages on web sites), where you enter some keywords and get results back.
A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page.
The inverted index is the list of words, and the documents in which they appear. In the web search example, you provide the list of words (your search query), and Google produces the documents (search result links).
They are both indexes - it's just a question of which direction you're going. Forward is from documents->to->words, inverted is from words->to->documents.
Example 2: DNS
Another example is a DNS lookup (which takes a host name, and returns an IP address) and a reverse lookup (which takes an IP address, and gives you the host name).
Example 3: A book
The index in the back of a book is actually an inverted index, as defined by the examples above - a list of words, and where to find them in the book. In a book, the table of contents is like a forward index: it's a list of documents (chapters) which the book contains, except instead of listing the words in those sections, the table of contents just gives a name/general description of what's contained in those documents (chapters).
Example 4: Your cell phone
The forward index in your cell phone is your list of contacts, and which phone numbers (cell, home, work) are associated with those contacts. The inverted index is what allows you to manually enter a phone number, and when you hit "dial" you see the person's name, rather than the number, because your phone has taken the phone number and found you the contact associated with it.
They called it inverted just because there is already a forward index. Take the example of search engine, it composed by two parts: the first part is "web crawler and parser" which build a index from document to word, the second part is search database which build a index from word to document. Because of the first index exist, we naturally call the second index as inverted index.
If you name the TOC (Table of Content) of a book as index, then you should call the index at the end of book as "inverted index". Or, in other side, you can call the TOC as inverted index.
typically when speaking about index, you mean some added calculations or stored results of procedures which have been done in order to speed up application (e.g. MySQL or other RDBMS Consult MySQL the docs). Indexing can also be related to caching etc.
Inverted index creates file with structure that is primarily intender for (fulltext) searching.
Inverted index consists of two main files:
Vocabulary
Occurences
In vocabulary are common words extracted from text (of course after filtering blacklist words like pronouns). The occurences file holds the connection between words and documents (word1 appears in doc1 and doc2, not in doc3). It is represented in a form of a matrix.
In the above image is shown the process of creating the two files mentioned.
If you are further interester in this problematic I can recommend you a great book written by Ricardo Yated - Modern Information Retrieval (See it on Amazon) - about page 200 I think.
Hope it helps :-)
normalocity has already wonderfully differentiated between a forward and an inverted index but for the question of why one is called a forward index and the other an inverted index, maybe this is why they are called that way---
Taking example of search engine crawling and indexing (or building index for a book), a forward index can be built simultaneously while you are crawling the web pages(or reading the book) or going forward. So if you have 10 webpages to crawl(or 10 chapters in a book) you can crawl the first webpage(read the first chapter) and then make a list of words which appear in the webpage(words which appear in the chapter) and continue this process for other webpages(other chapters) so by the time you have crawled all the 10 webpages(read all 10 chapters) your forward index is complete with each webpage(chapter) pointing to a list of words it contains.
But to make an inverted index you have to crawl all the 10 webpages(read the 10 chapters) and and then take each word from each documents list and figure out which documents contain that word. So this is like going backward once you have crawled the webpages(read chapters of the book). So its called an inverted index.
This is just my speculation.
The term "Inverted Word Index" refers to the change in relationship of
a single-document containing many-words, to each unique word containing
(or identifying) a list of many-documents. This is effectively taking a One-to-Many Relationship (Docs to Words) and Inverting (or reversing) it such that a new "Inverted" One-to-Many Relationship now exists, which is each-unique-word relating to Many-Documents (i.e., all that contain that word). It's origin really is that simple, and the term "inverted index" was used to describe manual indexes of the same type long before computers and electronic high-speed indexing even existed (yes, admittedly, I'm an old, geezer programmer, almost old enough to have considered Grace Hopper a "sweet young lady" age appropriate for courting back when COBOL was a shiny new language). Please don't discard us geezers just yet, as we may occasionally provide a useful, and possibly even valuable, historical tid-bit or two - when our personal RAM is still working, that is. [grin]
There are many types of index. For example, B-tree, R-tree, hash... For different purposes, we must choose correct index.
Inverted index is a special one. Inverted index usually used in full text search engine. Use inverted index we can find out a word's locate in a document(or documents set) as fast as possible. Think about the limit of memory and cpu, other index can't finish this job.
You can read lucene document for more details. It's a open source search engine. http://lucene.apache.org/java/docs/index.html
in inverted indexes, we have the following form:
word1-> list of docs it occurs in (sorted order)
word2-> list of docs it occurs in (sorted order)
It is very useful for search engine query processing as it allows us to find docs that word occurs in .
You can use supervised machine learing to build this inverted index.
One more difference:
Handling updates with the inverted index are expensive in comparison with forward index.
Forward index handles updates easily by reflecting the changes only in the corresponding document index, whereas in the inverted index, the same change has to reflect in multiple positions across the inverted index.

Combining hits from multiple documents into a single hit in Lucene

I am trying to get a particular search to work and it is proving problematic. The actual source data is quite complex but can be summarised by the following example:
I have articles that are indexed so
that they can be searched. Each
article also has multiple properties
associated with it which are also
indexed and searchable. When users
search, they can get hits in either
the main article or the associated
properties. Regardless of where a hit
is achieved, the article is returned
as a search hit (ie. the properties
are never a hit in their own right).
Now for the complexity:
Each property has security on it,
which means that for any given user,
they may or may not be able to see the
property. If a user cannot see a
property, they obviously do not get a
search hit in it. This security check
is proprietary and cannot be done
using the typical mechanism of storing
a role in the index alongside the
other fields in the document.
I currently have an index that contains the articles and properties indexed separately (ie. an article is indexed as a document, and each property has its own document). When a search happens, a hit in article A or a hit in any of the properties of article A should be classed as hit for article A alone, with the scores combined.
To achieve this originally, Lucene v1.3 was modified to allow this to happen by changing BooleanQuery to have a custom Scorer that could apply the logic of the security check and the combination of two hits in different documents being classed as a hit in a single document. I am trying to upgrade this version to the latest (v2.3.2 - I am using Lucene.Net), but ideally without having to modify Lucene in any way.
An additional problem occurs if I do an AND search. If an article contains the word foo and one of its properties contains the word bar, then searching for "foo AND bar" will return the article as a hit. My current code deals with this inside the custom Scorer.
Any ideas how/if this can be done?
I am thinking along the lines of using a custom HitCollector and passing that into the search, but when doing the boolean search "foo AND bar", execution never reaches my HitCollector as the ConjunctionScorer filters out all of the results from the sub-queries before getting there.
EDIT:
Whether or not a user can see a property is not based on the property itself, but on the value of the property. I cannot therefore put the extra security conditions into the query upfront as I don't know the value to filter by.
As an example:
+---------+------------+------------+
| Article | Property 1 | Property 2 |
+---------+------------+------------+
| A | X | J |
| B | Y | K |
| C | Z | L |
+---------+------------+------------+
If a user can see everything, then searching for "B and Y" will return a single search result for article B.
If another user cannot see a property if its value contains Y, then searching for "B and Y" will return no hits.
I have no way of knowing what values a user can and cannot see upfront. They only way to tell is to perform the security check (currently done at the time of filtering a hit from a field in the document), which I obviously cannot do for every possible data value for each user.
Having now implemented this (after a lot of head-scratching and stepping through Lucene searches), I thought I'd post back on how I achieved it.
Because I am interested in all of the results (ie. not a page at a time), I can avoid using the Hits object (which has been deprecated in later versions of Lucene anyway). This means I can do my own hit collection using the Search(Weight, Filter, HitCollector) method of IndexSearcher, iterating over all possible results and combining document hits as appropriate. To do this, I had to hook into Lucene's querying mechanism, but only when AND and NOT clauses are present. This is achieved by:
Creating a custom QueryParser and overriding GetBooleanQuery(ArrayList, bool) to return my own implementation.
Creating a custom BooleanQuery (returned from the custom QueryParser) and overriding CreateWeight(Searcher) to return my own implementation.
Creating a custom Weight (returned from the custom BooleanQuery) and overriding Scorer(IndexReader) to return my own implementation.
Creating a custom BooleanScorer2 (returned from the custom Weight) and overriding the Score(HitCollector) method. This is what deals with the custom logic.
This might seem like a lot of classes, but most of them derive from a Lucene class and just override a single method.
The implementation of the Score(HitCollector) method in the custom BooleanScorer2 class now has the responsibility of doing the custom logic. If there are no required sub-scorers, the scoring can be passed to the base Score method and run as normal. If there are required sub-scorers, it means there was a NOT or an AND clause in the query. In this case, the special combination logic mentioned in the question comes into play. I have a class called ConjunctionScorer that does this (this is not related to the ConjunctionScorer in Lucene).
The ConjunctionScorer takes a list of scorers and iterates over them. For each one, I extract the hits and their scores (using the Doc() and Score() methods) and create my own search hits collection containing only those hits that the current user can see after performing the relevant security checks. If a hit has already been found by another scorer, I combine them together (using the mean of their scores for their new score). If a hit is from a prohibited scorer, I remove the hit if it was already found.
At the end of all of this, I set the hits onto the HitCollector passed into the BooleanScorer2.Score(HitCollector) method. This is a custom HitCollector that I passed into the IndexSearcher.Search(Query, HitCollector) method to originally perform the search. When this method returns, my custom HitCollector now contains my search results combined together as I wanted.
Hopefully this information will be useful to someone else faced with the same problem. It sounds like a lot of effort, but it is actually pretty trivial. Most of the work is done in combining the hits together in the ConjunctionScorer. Note that this is for Lucene v2.3.2, and may be different in later versions.
Review just another way
This security check is proprietary and
cannot be done using the typical
mechanism of storing a role in the
index alongside the other fields in
the document.
What about checking the permission of property on query building stage? So if property explicitly hidden from user avoid including it to result tree