Text Analysis Services/Libraries - text-mining

Right now I'm getting tweets from Twitter Streaming API and doing some semantic analysis. Basically I want to extract tweets about people leaving home and going somewhere (a place) for a vacation or business trip or some related matters, so I can recommend them a weather app that can show the real-time weather of the place they're going to.
Now I'm using key words like: going to, heading for, leaving for, trip and travel for the streaming API and then I feed the filtered tweets to AYLIEN to do semantic labeling. I'm currently using labels: trip, travel, vacation and holiday. As long as any of those labels has a score higher than 0.01, I consider the corresponding tweet as the one I want.
But I found AYLIEN is not very satisfying for my task, I still get a lot of tweets such as I am leaving for a while, Babe is leaving for 2 weeks and I'm leaving for work, which are not what I want.
So I want to know if anybody know some descent text analysis services or libraries that can help me to achieve my goal. Thanks.

Related

Number of results google (or other) search programmatically

I am making a little personal project.
Ideally I would like to be able to make programmatically a google search and have the count of results. (My goal is to compare the results count between a lot (100000+) of different phrases).
Is there a free way to make a web search and compare the popularity of different texts, by using Google Bing or whatever (the source is not really important).
I tried Google but seems that freely I can do only 10 requests per day.
Bing is more permissive (5000 free requests per month).
Is there other tools or way to have a count of number of results for a particular sentence freely ?
Thanks in advance.
There are several things you're going to need if you're seeking to create a simple search engine.
First of all you should read and understand where the field of information retrieval started with G. Salton's paper or at least read the wiki page on the vector space model. It will require you learning at least some undergraduate linear algebra. I suggest Gilbert Strang's MIT video lectures for this.
You can then move to the Brin/Page Pagerank paper which outlays the original concept behind the hyperlink matrix and quickly calculating eigenvectors for ranking or read the wiki page.
You may also be interested in looking at the code for Apache Lucene
To get into contemporary search algorithm techniques you need calculus and regression analysis to learn machine learning and deep learning as the current google search has moved away from Pagerank and utilizes these. This is partially due to how link farming enabled people to artificially engineer search results and the huge amount of meta data that modern browsers and web servers allow to be collected.
EDIT:
For the webcrawler only portion I'd recommend WebSPHINX. I used this in my senior research in college in conjunction with Lucene.

Foursquare API: Getting an exhaustive list of venues in a given area

I'm using Foursquare API to get a list of venues of a certain category.
One important requirement is that the list is exhaustive, i.e. includes all relevant points. The v2/venues/search API endpoint enforces a limit of 50 venues on the output.
So the first idea that comes to mind is splitting the area into several sections (using "sw" and "ne" params) and then combining the results.
Clearly, the density of points will vary dramatically depending on location, so we'll need to use some kind of adaptive algorithm to flexibly adjust the size of the search window so that it contains all points. Also, there's an increased risk of running into the rate limit, so we might need the algorithm to stop when it's used up its quota of requests.
Finally, it seems that the only way to tell if a search window should be shrunk even further is to count the number of points in the result: if we have less than 50, then we've got a complete list for this section and can move on to the next one; otherwise, we should split it further. It seems to be wasteful as we'll be throwing away the intermediate results (i.e. all results in our search tree except for the leaves).
So here are some questions that I have:
Is it the best way to put together an exhaustive list? Maybe I'm
missing some API functionality?
Is there any specific algorithm you'd use in this case?
How would you go about reducing the number of results that have to be thrown away?
Thanks in advance!
An important disclaimer would be that foursquare does not like it when you perform a lot of searches in the same area.
Having said that, you should look into experimenting with categoryId filter in the venue search api. Most of the data on foursquare is food (restaurants) and nightlife related.
So if you exclude these (by including others, no way to exclude) you can search on a larger area and still get below 50 results.
Never really tried using such an algorithm because the categoryId filtering worked good enough, but in theory, the algorithm is simple, each lat/lng 0.001 is ~111 meters.
Search using a small radius (~200 for large metropolitan areas) and triangulate (scan) areas.
What got us to originally perform a lot of searches (and later stop doing so) is that sometimes foursquare filter out results without asking you (for me, it looks like bugs, for them its part of the algorithm). So for example I would search on a 50 meter radius, find the place I want (I know what I am searching for), expand to 500 meters, not find it (and get less than 50 results - so it was not dropped out because I hit the cap, it was dropped out because ???), move my search location ~300 meters north, find it -> sporadic behavior.
My point is (and the reason for why we stopped making a lot of searches and changed our approach), what you are trying to achieve, 'complete coverage' is very hard to do given the current API and the current usage policy, and -> it is not important really. After a few months of playing with it, we figured out that we should query foursqaure for what our users are looking for and require at this moment, we cache the results - over time we will have a complete coverage, maybe at start we will miss a few spots, but for the long run its not really important.
Hopefully this is not what you're doing, but as a friendly reminder: scraping foursquare's website and/or API is very much prohibited by its terms of service.

I am looking for a radio advertising scheduling algorithm / example / experience

Tried doing a bit of research on the following with no luck. Thought I'd ask here in case someone has come across it before.
I help a volunteer-run radio station with their technology needs. One of the main things that have come up is they would like to schedule their advertising programmatically.
There are a lot of neat and complex rule engines out there for advertising, but all we need is something pretty simple (along with any experience that's worth thinking about).
I would like to write something in SQL if possible to deal with these entities. Ideally if someone has written something like this for other advertising mediums (web, etc.,) it would be really helpful.
Entities:
Ads (consisting of a category, # of plays per day, start date, end date or permanent play)
Ad Category (Restaurant, Health, Food store, etc.)
To over-simplify the problem, this will be a elegant sql statement. Getting there... :)
I would like to be able to generate a playlist per day using the above two entities where:
No two ads in the same category are played within x number of ads of each other.
(nice to have) high promotion ads can be pushed
At this time, there are no "ad slots" to fill. There is no "time of day" considerations.
We queue up the ads for the day and go through them between songs/shows, etc. We know how many per hour we have to fill, etc.
Any thoughts/ideas/links/examples? I'm going to keep on looking and hopefully come across something instead of learning it the long way.
Very interesting question, SMO. Right now it looks like a constraint programming problem because you aren't looking for an optimal solution, just one that satisfies all the constraints you have specified. In response to those who wanted to close the question, I'd say they need to check out constraint programming a bit. It's far closer to stackoverflow that any operations research sites.
Look into constraint programming and scheduling - I'll bet you'll find an analogous problem toot sweet !
Keep us posted on your progress, please.
Ignoring the T-SQL request for the moment since that's unlikely to be the best language to write this in ...
One of my favorites approaches to tough 'layout' problems like this is Simulated Annealing. It's a good approach because you don't need to think HOW to solve the actual problem: all you define is a measure of how good the current layout is (a score if you will) and then you allow random changes that either increase or decrease that score. Over many iterations you gradually reduce the probability of moving to a worse score. This 'simulated annealing' approach reduces the probability of getting stuck in a local minimum.
So in your case the scoring function for a given layout might be based on the distance to the next advert in the same category and the distance to another advert of the same series. If you later have time of day considerations you can easily add them to the score function.
Initially you allocate the adverts sequentially, evenly or randomly within their time window (doesn't really matter which). Now you pick two slots and consider what happens to the score when you switch the contents of those two slots. If either advert moves out of its allowed range you can reject the change immediately. If both are still in range, does it move you to a better overall score? Initially you take changes randomly even if they make it worse but over time you reduce the probability of that happening so that by the end you are moving monotonically towards a better score.
Easy to implement, easy to add new 'rules' that affect score, can easily adjust run-time to accept a 'good enough' answer, ...
Another approach would be to use a genetic algorithm, see this similar question: Best Fit Scheduling Algorithm this is likely harder to program but will probably converge more quickly on a good answer.

Automatic music rating based on listening habits

I've created a Winamp-like music player in Delphi. Not so complex, of course. Just a simple one.
But now I would like to add a more complex feature: Songs in the library should be automatically rated based on the user's listening habits.
This means: The application should "understand" if the user likes a song or not. And not only whether he/she likes it but also how much.
My approach so far (data which could be used):
Simply measure how often a song was played per time. Start counting time when the song was added to the library so that recent songs don't have any disadvantage.
Measure how long a song was played on average (minutes).
Starting a song but directly change to another one should have a bad influence on the ranking since the user didn't seem to like the song.
...
Could you please help me with this problem? I would just like to have some ideas. I don't need the implementation in Delphi.
I would track all of your users' listening habits in a central database, so you can make recommendations based on what other people like too ("people that liked this song, also liked these other songs")
some other metrics to consider:
proportion of times that the song was immediately replayed (ex. this song was immediately replayed 12% of the times it was played)
did they turn on the "repeat this song" button during play?
times played per hour, day, week, month
proportion of times this song was skipped. (ex. this song was played, but immediately skipped 99% of the time)
proportion of song listened to (the user listened to 50% of this song on average, versus 100% of some other song)
also:
listen in on the user's microphone. do they sing along? :D
what volume do they play the song? do they crank it up?
Put in a "recommend this song to friends" button (that emails song title to friend or something). Songs they recommend, they probably like.
You might want to do some feature extraction on the audio stream, and find similar songs. This is hard, but you can read more about it here:
"Automatic Feature Extraction for Classifying Audio Data "
Link
"Understandable models Of music collections based on exhaustive feature generation with temporal statistics"
http://portal.acm.org/citation.cfm?id=1150523
"Collaborative Use of Features in a Distributed System for the Organization of Music Collections"
http://www.idea-group.com/Bookstore/Chapter.aspx?TitleId=24432
Measure how long a song was played on average (minutes).
I don't think this is a good metric, because a long song would gain an unfair advantage over a short song. You should use a percentage instead:
avg. time played / total song length
Please let degrade likeliness over time. You seem to like songs better if you heard them often during the last n days, while older songs should only get a casual mentioning, since you like them but heard them way too much, probably.
Least but not last you could add beat detection (and maybe frequence spectrum) to find similar songs, which could provide you with more data than the user inputted by hearing the songs.
I would also go for grouping songs having the same MP3-Id Tag here, since this also gives a hint what the user is currently on. And if you want to provide some autoplay function, it would also help. After hearing a great Goa song, switching to Punk is strange, even if I like songs of both worlds.
Concerning your additional metrics: Shouldn't one combine metric #4 and metric #5? If a song is immediately skipped, then the proportion listened to is just 1% or so, right? – marco92w May 21 at 15:08
These should be separate. Skipping should result in negative rating for the song that was skipped. However, if the user closes the application when a song begins, you should not consider it as negative rating, even though only a low percentage of the song was played.
(ListenPartCount * (ListenFullCount ^ 2)) + (AverageTotalListenTime * ListenPartTimeAverage)
--------------------------------------------------------------------------------------------
((AverageTotalListenTime - ListenPartTimeAverage) + 0.0001f)
This formula will produce an nice result, since user could really like just part of song, this should be seen in the score, also if user likes full song then weight should be doubled.
You can tweak this folmula in various ways, f.ex include user tree of listening, f.ex if user listens one song and after that he listens another song few times, etc.
Use the date the song was added to the library as a starting point.
Measure how often the song/genre/artist/album is played (fully, or in part or skipped) - this will also allow you to measure how often a song/genre/artist/album is not played.
Come up with a weighting based on these parameters, when a song, it's genre, artist or album has not been played frequently, it should rank poorly. When an artist is played every day songs from this artist should get a boost, but say one of the artist's songs is never played this song should still rank pretty low
Simply measure how often a song was
played per time.
Often, I go to play a particular song, and then just let my iPod run until the end of an album. So this method would give an unfair advantage to songs late in an album. Something you might want to compensate for if your music player works the same way.
What about artificial intelligence appliance on this problem?
Well! Let me say that starting from scratch could be really funny to use
a network of clients with their own "intelligence" and finally collect
client results on a central "intelligence".
Each client could produce his own "user ratings" based on user habitudes
(as already said: average listenig, listenig count, etc...).
Than a central "intelligent" collector could merge individual ratings into "global ratings"
showing trands, suggestions and every high level rating you need.
Anyway to train such a "brain" means that you have to solve the problem in an analytical way first, but really could be funny to build such a cloud of interconnected small brains to produce higher level "intelligence".
As usual, as I don´t know your skills, take a look to neural networks, genetic algorithms, fuzzy logic, pattern recognition and similar problems for a deeper understanding.
You can use some simple function like:
listened_time_of_song/(length_of_song + 15s)
or
listened_time_of_song/(length_of_song * 1.1)
that means that if song was stopped in 15 seconds then it would be rated with negative score, or maybe the second case is even better (length of song would have no matter to final note if user listened whole song)
Another way may be using neural networks if you are common with this subject.

How to evaluate a search engine?

I am a student carrying out a study to enhance a search engine's existing algorithm.
I want to know how I can evaluate the search engine - which I have improved - to quantify how much the algorithm was improved.
How should I go about comparing the old and new algorithm?
Thanks
This is normally done by creating a test suite of questions and then evaluating how well the search response answers those questions. In some cases the responses should be unambiguous (if you type slashdot into a search engine you expect to get slashdot.org as your top hit), so you can think of these as a class of hard queries with 'correct' answers.
Most other queries are inherently subjective. To minimise bias you should ask multiple users to try your search engine and rate the results for comparison with the original. Here is an example of a computer science paper that does something similar:
http://www.cs.uic.edu/~liub/searchEval/SearchEngineEvaluation.htm
Regarding specific comparison of the algorithms, although obvious, what you measure depends on what you're interested in knowing. For example, you can compare efficiency in computation, memory usage, crawling overhead or time to return results. If you are trying to produce very specific behaviour, such as running specialist searches (e.g. a literature search) for certain parameters, then you need to explicitly test this.
Heuristics for relevance are also a useful check. For example, when someone uses search terms that are probably 'programming-related', do you tend to get more results from stackoverflow.com? Would your search results be better if you did? If you are providing a set of trust weightings for specific sites or domains (e.g. rating .edu or .ac.uk domains as more trustworthy for technical results), then you need to test the effectiveness of these weightings.
First, let me start out by saying, kudos to you for attempting to apply traditional research methods to search engine results. Many SEO's have done this before you, and generally keep this to themselves as sharing "amazing findings" usually means you can't exploit or have the upper hand anymore, this said I will share as best I can some pointers and things to look for.
Identify what part of the algorithm are you trying to improve?
Different searches execute different algorithms.
Broad Searches
For instance in a broad term search, engines tend to return a variety of results. Common part of these results include
News Feeds
Products
Images
Blog Posts
Local Results (this is based off of a Geo IP lookup).
Which of these result types are thrown into the mix can vary based on the word.
Example: Cats returns images of cats, and news, Shoes returns local shopping for shoes. (this is based on my IP in Chicago on October 6th)
The goal in returning results for a broad term is to provide a little bit of everything for everyone so that everyone is happy.
Regional Modifiers
Generally any time a regional term is attached to a search, it will modify the results greatly. If you search for "Chicago web design" because the word Chicago is attached, the results will start with a top 10 regional results. (these are the one liners to the right of the map), after than 10 listings will display in general "result fashion".
The results in the "top ten local" tend to be drastically different than those in organic listing below. This is because the local results (from google maps) rely on entirely different data for ranking.
Example: Having a phone number on your website with the area code of Chicago will help in local results... but NOT in the general results. Same with address, yellow book listing and so forth.
Results Speed
Currently (as of 10/06/09) Google is beta testing "caffeine" The main highlight of this engine build is that it returns results in almost half the time. Although you may not consider Google to be slow now... speeding up an algorithm is important when millions of searches happen every hour.
Reducing Spam Listings
We have all found experienced a search that was riddled with spam. The new release of Google Caffeine http://www2.sandbox.google.com/ is a good example. Over the last 10+ one of the largest battles online has been between Search Engine Optimizers and Search Engines. Gaming google (and other engines) is highly profitable and what Google spends most of its time combating.
A good example is again the new release of Google Caffeine. So far my research and also a few others in the SEO field are finding this to be the first build in over 5 years to put more weight on Onsite elements (such as keywords, internal site linking, etc) than prior builds. Before this, each "release" seemed to favor inbound links more and more... this is the first to take a step back towards "content".
Ways to test an algorythm.
Compare two builds of the same engine. This is currently possible by comparing Caffeine (see link above or google, google caffeine) and the current Google.
Compare local results in different regions. Try finding search terms like web design, that return local results without a local keyword modifier. Then, use a proxy (found via google) to search from various locations. You will want to make sure you know the proxies location (find a site on google that will tell your your IP address geo IP zipcode or city). Then you can see how different regions return different results.
Warning... DONT pick the term locksmith... and be wary of any terms that when returning result, have LOTS of spammy listings.. Google local is fairly easy to spam, especially in competitive markets.
Do as mentioned in a prior answer, compare how many "click backs" users require to find a result. You should know, currently, no major engines use "bounce rates" as indicators of sites accuracy. This is PROBABLY because it would be EASY to make it look like your result has a bounce rate in the 4-8% range without actually having one that low... in other words it would be easy to game.
Track how many search variations users use on average for a given term in order to find the result that is desired. This is a good indicator of how well an engine is smart guessing the query type (as mentioned WAY up in this answer).
**Disclaimer. These views are based on my industry experience as of October 6th, 2009. One thing about SEO and engines is they change EVERY DAY. Google could release Caffeine tomorrow, and this would change a lot... that said, this is the fun of SEO research!
Cheers
In order to evaluate something, you have to define what you expect from it. This will help to define how to measure it.
Then, you'll be able to measure the improvement.
Concerning a search engine, I guess that you might be able to measure itsability to find things, its accuracy in returning what is relevant.
It's an interesting challenge.
I don't think you will find a final mathematical solution if that is your goal. In order to rate a given algorithm, you require standards and goals that must be accomplished.
What is your baseline to compare against?
What do you classify as "improved"?
What do you consider a "successful search"?
How large is your test group?
What are your tests?
For example, if your goal is to improve the process of page ranking then decide if you are judging the efficiency of the algorithm or the accuracy. Judging efficiency means that you time your code for a consistent large data set and record results. You would then work with your algorithm to improve the time.
If your goal is to improve accuracy then you need to define what is "inaccurate". If you search for "Cup" you can only say that the first site provided is the "best" if you yourself can accurately define what is the best answer for "Cup".
My suggestion for you would be to narrow the scope of your experiment. Define one or two qualities of a search engine that you feel need refinement and work towards improving them.
In the comments you've said "I have heard about a way to measure the quality of the search engines by counting how many time a user need to click a back button before finding the link he wants , but I can use this technique because you need users to test your search engine and that is a headache itself". Well, if you put your engine on the web for free for a few days and advertise a little you will probably get at least a couple dozen tries. Provide these users with the old or new version at random, and measure those clicks.
Other possibility: assume Google is by definition perfect, and compare your answer to its for certain queries. (Maybe sum of distance of your top ten links to their counterparts at Google, for example: if your second link is google's twelveth link, that's 10 distance). That's a huge assumption, but far easier to implement.
Information scientists commonly use precision and recall as two competing measures of quality for an information retrieval system (like a search engine).
So you could measure your search engine's performance relative to Google's by, for example, counting the number of relevant results in the top 10 (call that precision) and the number of important pages for that query that you think should have been in the top 10 but weren't (call that recall).
You'll still need to compare the results from each search engine by hand on some set of queries, but at least you'll have one metric to evaluate them on. And the balance of these two is important too: otherwise you can trivially get perfect precision by not returning any results or perfect recall by returning every page on the web as a result.
The Wikipedia article on precision and recall is quite good (and defines the F-measure which takes into account both).
I have had to test a search engine professionally. This is what I did.
The search included fuzzy logic. The user would type into a web page "Kari Trigger", and the search engine would retrieve entries like "Gary Trager", "Trager, C", "Corey Trager", etc, each with a score from 0->100 so that I could rank them from most likely to least likely.
First, I re-architected the code so that it could be executed removed from the web page, in a batch mode using a big file of search queries as input. For each line in the input file, the batch mode would write out the top search result and its score. I harvested thousands of actual search queries from our production system and ran them thru the batch setup in order to establish a baseline.
From then on, each time I modified the search logic, I would run the batch again and then diff the new results against the baseline. I also wrote tools to make it easier to see the interesting parts of the diff. For example, I didn't really care if the old logic returned "Corey Trager" as an 82 and the new logic returned it as an 83, so my tools would filter those out.
I could not have accomplished as much by hand-crafting test cases. I just wouldn't have had the imagination and insight to have created good test data. The real world data was so much richer.
So, to recap:
1) Create a mechanism that lets you diff the results of running new logic versus the results of prior logic.
2) Test with lots of realistic data.
3) Create tools that help you work with the diff, filtering out the noise, enhancing the signal.
You have to clearly identify positive and negative qualities such as how fast one gets the answer they are seeking or how many "wrong" answers they get on the way there. Is it an improvement if the right answer is #5 but the results are returned 20 times faster? Things like that will be different for each application. The correct answer may be more important in a corporate knowledge base search but a fast answer may be needed for a phone support application.
Without parameters no test can be claimed to be a victory.
Embrace the fact that the quality of search results are ultimately subjective. You should have multiple scoring algorithms for your comparison: The old one, the new one, and a few control groups (e.g. scoring by URI length or page size or some similarly intentionally broken concept). Now pick a bunch of queries that exercise your algorithms, say a hundred or so. Let's say you end up with 4 algorithms total. Make a 4x5 table, displaying the first 5 results of a query across each algorithm. (You could do top ten, but the first five are way more important.) Be sure to randomize which algorithm appears in each column. Then plop a human in front of this thing and have them pick which of the 4 result sets they like best. Repeat across your entire query set. Repeat for as many more humans as you can stand. This should give you a fair comparison based on total wins for each algorithm.
http://www.bingandgoogle.com/
Create an app like this that compares and extracts the data. Then run a test with 50 different things you need to look for and then compare with the results you want.