Content-based reco system in neo4j for large dataset - sql

I am trying to make a recommendation website of books. I have crawled some book sites and, and have around 15 million separate books in the DB, which is in neo4j.
Now for some genres, like mystery and thriller, there are about 1 million books at least. I have to make a top 20 list of recommendations. My current approach-
get the books
run a similarity comparison (vec-cosine or pearsons)
sort and display
are expensive and take time, not at all good for a realtime system. I thought keeping a sorted list per genre by linking neo4j to a traditional DB and getting the top ones from that db via neo4j. But that also is slow (takes a few 10s of seconds). Is there a simpler and more intuitive way to do this? Any ideas will help.

It would be good to know what other criteria you would like to base your recommendations on, e.g. how exactly you measure similarities between books. I'm assuming it's not purely genre based.
One approach we have been taking with these dense nodes (such as your genres, or cities people live in, etc.), is to find recommendations first based on some other criteria, then boost the relevance score of the recommendation if it is connected to the correct dense node. Such a query is much more performant.
For example, when recommending 20 people you should be friends with, I'd find 100 candidates based on all other criteria and then boost the scores of candidates living in the same location as the user we're recommending for. That's 100 single-hop traversals, which will be very quick.
Have a look at this recent webinar recording, you may find some inspiration in it.
Regarding similarity measures, these may need to be pre-computed, linking similar books together by SIMILAR_TO relationships. Such pre-computation might be done using the Runtime of GraphAware Framework, which only executes this background computation during quiet periods, thus not interfering with your regular transactional processing. Look at the NodeRank module, which computes PageRank in Neo4j during quiet periods.

Related

Data model guidance, database choice for aggregations on changing filter criteria

Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)

How to encode inputs like artist or actor

I am currently developing a neural network that tries to make a suggestion for a specific user based on his recent activities. I will try to illustrate my problem with an example.
Now, let's say im trying to suggest new music to a user based on the music he recently listened to. Since people often listen to artists they know, one input of such a neural network might be the artists he recently listened to.
The problem is the encoding of this feature. As the id of the artist in the database has no meaning for the neural network, the only other option that comes to my mind would be one-hot encoding every artist, but that doesn't sound to promising either regarding the thousands of different artists out there.
My question is: How can i encode such a feature?
The approach you describe is called content-based filtering. The intuition is to recommend items to customer A similar to previous items liked by A. An advantage to this approach is that you only need data about one user, which tends to result in a "personalized" approach for recommendation. But some disadvantages include the construction of features (the problem you're dealing with now), the difficulty to build an interesting profile for new users, plus it will also never recommend items outside a user's content profile. As for the difficulty of representation, features are usually handcrafted and abstracted afterwards. For music specifically, features would be things like 'artist', 'genre', etc. and abstraction for informative keywords (if necessary) is widely done using tf-idf.
This may go outside the scope of the question, but I think it is also worth mentioning an alternative approach to this: collaborative filtering. Rather than similar items, here we instead try to find users with similar tastes and recommend products that they liked. The only data you need here are some sort of user ratings or values of how much they (dis)liked some data - eliminating the need for feature design. Furthermore, since we analyze similar persons rather than items for recommendation, this approach tends to also work well for new users. The general flow for collaborative filtering looks like:
Measure similarity between user of interest and all other users
(optional) Select a smaller subset consisting of most similar users
Predict ratings as a weighted combination of "nearest neighbors"
Return the highest rated items
A popular approach for the similarity weighting in the algorithm is based on the Pearson correlation coefficient.
Finally, something to consider here is the need for performance/scalability: calculating pairwise similarities for millions of users is not really light-weight on a normal computer.

Horizontal scaling of search query

We are building cv scoring service, and we are using Postgres for making complex queries to find cv's that match vacancy best.
The problem is, that we use really complex set of heuristics to score cv to vacancy, and the average number of cvs to be scored per query is growing.
I want to put this kind of load outside of database, and looking for existing solutions for horizontal scaling such load.
Query should be executed in fraction of a second, there can be hundreds of concurrent queries. Each query scores on average 10k cvs. Each cv is like about 50 records in maybe 10 tables in its current relational form.
I want a clustered system to run each query in multiple parallel processes (on many servers) and return aggregated result. It should be fast and fault tolerant.
I was looking to Hadoop, but it looks like it is designed for batch processing, and not for realtime low latency load. There is Apache Storm, but it is designed for continous stream processing. So I am not shure :)
What kind of tool could will suit my needs?
Thank you!
Make sure you are not redoing work, if a cv has been scored tag it as scored and don't reprocess unless it's necessary.
Unless you are partitioning the data in postgres you might want to do that. Usually not all rows need to be accessed regularly.
Sounds like you want to primarily scale reads, in that case a postgres read-only cluster could be an option.
Take a look at Elasticsearch, it is designed to do weighted scoring, faceting, etc. It should also scale, haven't tried that myself though.
I would definitely start with 1 though, don't do work unless you have to.

Solr Relevancy - How to A/B Test for Search Quality?

I am looking to perform live A/B and controlled side-by-side experiments to help understand how changes affect search quality. I will be testing variables such as boost value and fuzzyqueries.
What other metrics are used to determine whether users prefer A vs B? Here are 2 metrics I found online...
In Google Analytics, “% Search Exits” is a metric you can use to
measure the quality of your site-search results
Another way to measure search quality is to measure the number of
search result pages the visitor views.
Search Quality is something not easily measurable. For measuring relevance you need to have couple of things:
A competitor to measure relevance. For your case the different instance of your search engine will be the competitors for each other. I mean one search engine instance would have the basic algorithm running, the other with fuzzy enabled, another with both fuzzy and boosting and so on.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).
Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.
According to wikipedia:
Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine >algorithm or related applications, often used in information retrieval. Using a graded >relevance scale of documents in a search engine result set, DCG measures the usefulness, >or gain, of a document based on its position in the result list. The gain is accumulated >from the top of the result list to the bottom with the gain of each result discounted at >lower ranks.
Wikipedia explains NDCG in a great manner. It is a short article, please go through that.
As you have mentioned you can also have click through rate/data where in you have kind of wisdom of crowd Algorithm and you tweak the relevance based on that. It is a very good way out but it attracts spamming. So it has to be coupled with some metric such as NDCG/MAP etc. to solve your relevance problem.
I can provide more details on this if you still need to know more on how whole stuff put together would work in your case study.

How to evaluate a search engine?

I am a student carrying out a study to enhance a search engine's existing algorithm.
I want to know how I can evaluate the search engine - which I have improved - to quantify how much the algorithm was improved.
How should I go about comparing the old and new algorithm?
Thanks
This is normally done by creating a test suite of questions and then evaluating how well the search response answers those questions. In some cases the responses should be unambiguous (if you type slashdot into a search engine you expect to get slashdot.org as your top hit), so you can think of these as a class of hard queries with 'correct' answers.
Most other queries are inherently subjective. To minimise bias you should ask multiple users to try your search engine and rate the results for comparison with the original. Here is an example of a computer science paper that does something similar:
http://www.cs.uic.edu/~liub/searchEval/SearchEngineEvaluation.htm
Regarding specific comparison of the algorithms, although obvious, what you measure depends on what you're interested in knowing. For example, you can compare efficiency in computation, memory usage, crawling overhead or time to return results. If you are trying to produce very specific behaviour, such as running specialist searches (e.g. a literature search) for certain parameters, then you need to explicitly test this.
Heuristics for relevance are also a useful check. For example, when someone uses search terms that are probably 'programming-related', do you tend to get more results from stackoverflow.com? Would your search results be better if you did? If you are providing a set of trust weightings for specific sites or domains (e.g. rating .edu or .ac.uk domains as more trustworthy for technical results), then you need to test the effectiveness of these weightings.
First, let me start out by saying, kudos to you for attempting to apply traditional research methods to search engine results. Many SEO's have done this before you, and generally keep this to themselves as sharing "amazing findings" usually means you can't exploit or have the upper hand anymore, this said I will share as best I can some pointers and things to look for.
Identify what part of the algorithm are you trying to improve?
Different searches execute different algorithms.
Broad Searches
For instance in a broad term search, engines tend to return a variety of results. Common part of these results include
News Feeds
Products
Images
Blog Posts
Local Results (this is based off of a Geo IP lookup).
Which of these result types are thrown into the mix can vary based on the word.
Example: Cats returns images of cats, and news, Shoes returns local shopping for shoes. (this is based on my IP in Chicago on October 6th)
The goal in returning results for a broad term is to provide a little bit of everything for everyone so that everyone is happy.
Regional Modifiers
Generally any time a regional term is attached to a search, it will modify the results greatly. If you search for "Chicago web design" because the word Chicago is attached, the results will start with a top 10 regional results. (these are the one liners to the right of the map), after than 10 listings will display in general "result fashion".
The results in the "top ten local" tend to be drastically different than those in organic listing below. This is because the local results (from google maps) rely on entirely different data for ranking.
Example: Having a phone number on your website with the area code of Chicago will help in local results... but NOT in the general results. Same with address, yellow book listing and so forth.
Results Speed
Currently (as of 10/06/09) Google is beta testing "caffeine" The main highlight of this engine build is that it returns results in almost half the time. Although you may not consider Google to be slow now... speeding up an algorithm is important when millions of searches happen every hour.
Reducing Spam Listings
We have all found experienced a search that was riddled with spam. The new release of Google Caffeine http://www2.sandbox.google.com/ is a good example. Over the last 10+ one of the largest battles online has been between Search Engine Optimizers and Search Engines. Gaming google (and other engines) is highly profitable and what Google spends most of its time combating.
A good example is again the new release of Google Caffeine. So far my research and also a few others in the SEO field are finding this to be the first build in over 5 years to put more weight on Onsite elements (such as keywords, internal site linking, etc) than prior builds. Before this, each "release" seemed to favor inbound links more and more... this is the first to take a step back towards "content".
Ways to test an algorythm.
Compare two builds of the same engine. This is currently possible by comparing Caffeine (see link above or google, google caffeine) and the current Google.
Compare local results in different regions. Try finding search terms like web design, that return local results without a local keyword modifier. Then, use a proxy (found via google) to search from various locations. You will want to make sure you know the proxies location (find a site on google that will tell your your IP address geo IP zipcode or city). Then you can see how different regions return different results.
Warning... DONT pick the term locksmith... and be wary of any terms that when returning result, have LOTS of spammy listings.. Google local is fairly easy to spam, especially in competitive markets.
Do as mentioned in a prior answer, compare how many "click backs" users require to find a result. You should know, currently, no major engines use "bounce rates" as indicators of sites accuracy. This is PROBABLY because it would be EASY to make it look like your result has a bounce rate in the 4-8% range without actually having one that low... in other words it would be easy to game.
Track how many search variations users use on average for a given term in order to find the result that is desired. This is a good indicator of how well an engine is smart guessing the query type (as mentioned WAY up in this answer).
**Disclaimer. These views are based on my industry experience as of October 6th, 2009. One thing about SEO and engines is they change EVERY DAY. Google could release Caffeine tomorrow, and this would change a lot... that said, this is the fun of SEO research!
Cheers
In order to evaluate something, you have to define what you expect from it. This will help to define how to measure it.
Then, you'll be able to measure the improvement.
Concerning a search engine, I guess that you might be able to measure itsability to find things, its accuracy in returning what is relevant.
It's an interesting challenge.
I don't think you will find a final mathematical solution if that is your goal. In order to rate a given algorithm, you require standards and goals that must be accomplished.
What is your baseline to compare against?
What do you classify as "improved"?
What do you consider a "successful search"?
How large is your test group?
What are your tests?
For example, if your goal is to improve the process of page ranking then decide if you are judging the efficiency of the algorithm or the accuracy. Judging efficiency means that you time your code for a consistent large data set and record results. You would then work with your algorithm to improve the time.
If your goal is to improve accuracy then you need to define what is "inaccurate". If you search for "Cup" you can only say that the first site provided is the "best" if you yourself can accurately define what is the best answer for "Cup".
My suggestion for you would be to narrow the scope of your experiment. Define one or two qualities of a search engine that you feel need refinement and work towards improving them.
In the comments you've said "I have heard about a way to measure the quality of the search engines by counting how many time a user need to click a back button before finding the link he wants , but I can use this technique because you need users to test your search engine and that is a headache itself". Well, if you put your engine on the web for free for a few days and advertise a little you will probably get at least a couple dozen tries. Provide these users with the old or new version at random, and measure those clicks.
Other possibility: assume Google is by definition perfect, and compare your answer to its for certain queries. (Maybe sum of distance of your top ten links to their counterparts at Google, for example: if your second link is google's twelveth link, that's 10 distance). That's a huge assumption, but far easier to implement.
Information scientists commonly use precision and recall as two competing measures of quality for an information retrieval system (like a search engine).
So you could measure your search engine's performance relative to Google's by, for example, counting the number of relevant results in the top 10 (call that precision) and the number of important pages for that query that you think should have been in the top 10 but weren't (call that recall).
You'll still need to compare the results from each search engine by hand on some set of queries, but at least you'll have one metric to evaluate them on. And the balance of these two is important too: otherwise you can trivially get perfect precision by not returning any results or perfect recall by returning every page on the web as a result.
The Wikipedia article on precision and recall is quite good (and defines the F-measure which takes into account both).
I have had to test a search engine professionally. This is what I did.
The search included fuzzy logic. The user would type into a web page "Kari Trigger", and the search engine would retrieve entries like "Gary Trager", "Trager, C", "Corey Trager", etc, each with a score from 0->100 so that I could rank them from most likely to least likely.
First, I re-architected the code so that it could be executed removed from the web page, in a batch mode using a big file of search queries as input. For each line in the input file, the batch mode would write out the top search result and its score. I harvested thousands of actual search queries from our production system and ran them thru the batch setup in order to establish a baseline.
From then on, each time I modified the search logic, I would run the batch again and then diff the new results against the baseline. I also wrote tools to make it easier to see the interesting parts of the diff. For example, I didn't really care if the old logic returned "Corey Trager" as an 82 and the new logic returned it as an 83, so my tools would filter those out.
I could not have accomplished as much by hand-crafting test cases. I just wouldn't have had the imagination and insight to have created good test data. The real world data was so much richer.
So, to recap:
1) Create a mechanism that lets you diff the results of running new logic versus the results of prior logic.
2) Test with lots of realistic data.
3) Create tools that help you work with the diff, filtering out the noise, enhancing the signal.
You have to clearly identify positive and negative qualities such as how fast one gets the answer they are seeking or how many "wrong" answers they get on the way there. Is it an improvement if the right answer is #5 but the results are returned 20 times faster? Things like that will be different for each application. The correct answer may be more important in a corporate knowledge base search but a fast answer may be needed for a phone support application.
Without parameters no test can be claimed to be a victory.
Embrace the fact that the quality of search results are ultimately subjective. You should have multiple scoring algorithms for your comparison: The old one, the new one, and a few control groups (e.g. scoring by URI length or page size or some similarly intentionally broken concept). Now pick a bunch of queries that exercise your algorithms, say a hundred or so. Let's say you end up with 4 algorithms total. Make a 4x5 table, displaying the first 5 results of a query across each algorithm. (You could do top ten, but the first five are way more important.) Be sure to randomize which algorithm appears in each column. Then plop a human in front of this thing and have them pick which of the 4 result sets they like best. Repeat across your entire query set. Repeat for as many more humans as you can stand. This should give you a fair comparison based on total wins for each algorithm.
http://www.bingandgoogle.com/
Create an app like this that compares and extracts the data. Then run a test with 50 different things you need to look for and then compare with the results you want.