algorithms to select price ranges - e-commerce

What is best way to represent a sereis of item, price ranges to reduce noise for the end user.
Typically when an item is displayed they show a histogram of price ranges is displayed in ecommerce sites. Are there standard algorithms that these sites use for this display?.

Well it seems to me that you would first and foremost need a way to aggregate this data. That having been said, if you have that data and need to create a histogram it can be fairly simple in the programming language R (here is some documentation: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html ). There is also an R extension I've read about that allows you to post/run R code in wiki-like pages ( http://mars.wiwi.hu-berlin.de/mediawiki/slides/index.php/R_extension_-_Mediawiki ).
If you already have this data (in this case prices) I dont think you need an algorithm so much as you just need a way to display it in a type of graph. I think R should be useful. I hope this helps!

Related

MS SSAS - Need to return a measure in a calculated member based on a tuple set and a max ofunderlying ID

I require some more advanced MDX knowledge than mine.
I need to get the RepoRate_MAX for repo products, at book and instrument level, but also looking at the Java code I'm replacing that code always uses the max MurexId.
How can I perform the below (I've placed MAX in here on the dimension but this is wrong) and I need the combo of the dimensions and also the MAX MurexId:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],MAX([Deal].[MurexId])),[Measures].[RepoRate_MAX])
I'm sure it's a simple one but my mind is part way between the Java OO and MDX worlds currently haha :D
Thanks
Leigh
So after some experimenting I found out about the TAIL and Item MDX functions.
I think at one point I did get it working, but didn't make a note of what did work. I was playing around with this and variants of it..but most versions ended up in unusable query times:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],TAIL(EXISTING([Deal].[MurexId].[MurexId])).Item(0)),[Measures].[RepoRate_MAX])
So I then decided to push the RepoRate calculation back to the SQL data preparation script. Cleaner/smoother data is always better and then to have simple calculated members.
I used SQL to determine the RepoRate from tradelevel with MAX(MurexId) and GROUP BY on Book, Instrument to then update my main fact table to ensure that the correct RepoRate was set at Book, Instrument level.
Thus the calculated member is then:
[Measures].[RepoRate_VAL] = (([Deal].[Book],[Deal].[Instrument]),[Measures].[RepoRate_MAX])
Fast data prep and a fast calculated member on the Excel/Pivot/UI layer.

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.

selecting "similar" groups - where to start with probabilities?

Let's say I have a table with 10.000 lines (representing 10.000 persons) and the following columns:
id qualification gender age income
When I select all persons having a certain qualification (say "plumber") I get 100 lines, having a certain gender, age and income distribution.
What I now want to do is select some kind of test group to check if the income is influenced by qualification or by the distribution of the other attributes.
That means (and now I come to my question) I want to get another set of 100 lines, having the same gender and age distribution (but a different qualification value). These 100 lines should of course been chosen by random.
My primary problem is that I don't know how to write an SQL command that would take care of the distributions (which of course could and maybe should be seen as probabilities in this context) when I select random lines.
Thank you in advance!
You seem to be trying to solve something that is tightly related to this extremely thorny problem.
The wiki page depicts a number of approaches for detecting correlations in a database, complete with references to prior pg-hacker discussions (here's another), a variety of (rejected) patch proposals, and scientific papers that discusses the topic.
If it sounds too thorny, I'd second Catcall's pl/r suggestion. Or another applicable pl, for that matter.
As an aside, you might find pg-kmeans useful too:
http://pgxn.org/dist/kmeans/doc/kmeans.html
As well as PostStat (never tried it myself):
http://poststat.projects.postgresql.org/
Might be better on stats.stackexchange.com.
Selecting random rows is easy; matching the distribution is hard.
You could write a stored procedure that
repeatedly selects 100 rows at random,
calculates the statistics,
and returns when it finds 100 rows that fit.
But that seems a lot like kicking dead whales down the beach. And, depending on your data, it might never return.
Before you spend much time trying to do this in SQL, consider spending a little time to see how hard (or how easy) this is to do with statistical software, like R.
Later
Just discovered that there's a package called pl/R.
PL/R is a loadable procedural language that enables you to write
PostgreSQL functions and triggers in the R programming language. PL/R
offers most (if not all) of the capabilities a function writer has in
the R language.
Google postgresql +statistics +r +pl for additional links to papers and tutorials.
SELECT * from Table1 order by random() limit 100;
random() is valid for PostgreSql. For MySql you can use RAND() instead of Random()

What formula is used for building a list of related items in a tag-based system?

There are a lot of sites out there that use 'tags' to categorize items in their system. For example, YouTube uses keywords to categorize videos, Stack Overflow uses tags to categorize questions, etc.
What formulas do these sites use (especially SO) to build a list of items related to another item based on the tags it has? I'm building a system much like the one on SO and I'd like to find a way to generate a list of 20 items or so based on the tags of one item, but also make it spread enough so that each photo generates a vastly different list, and so that clicking an item in any given related list could eventually lead you to almost every item in the database.
The technical term for an organization based on user tags is a folksonomy. A google search for that term brings up a huge amount of material on how these systems are put together. A good place to start is the Wikipedia article.
I had to solve this exact problem for a contract a few years back, and the company was nice enough to let me blog about how I did it at http://bentilly.blogspot.com/2011/02/finding-related-items.html.
You'll note that if you get a decent volume of data then you'll really, really want to do this out of the database.
Similarity between items is often represented as dot products between the vectors representing the items. So if you have a tag based system, each tag will define one dimension. The vector then for an item becomes 1 in dimension i if tag i is set for this item (or higher numbers if you allow multiple tagging). If you calculate the dot product of the vectors of two items you will get the similarity for those items (N.b. the vectors have to be normalized so that the absolute value is 1).
Note that the dimensionality will get very large (several tens of thousands of tags are common). This sounds like a show stopper for this kind of thing. But you will also not that the vectors are really sparse and multiple dot product become one big matrix multiplication of a sparse matrix with it's own transposition. Using efficient algorithms for sparse matrix multiplication, this can be done relatively fast.
Also note, that most systems do not only rely on tags, but rather on "user behavior" (whatever that means). I.e. for Youtube user behavior would be "Watching a video", "Subscribing to a channel", "looking for similar videos as video X" or "tagging video x with tag y".
I ended up using the following code (with different names), which finds all other items with at least one tag in common, and orders the results by number of common tags, descending, and subsorts by other criteria specific to my problem:
SELECT PT.WidgetID, COUNT(*) AS CommonTags, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date FROM WidgetTags PT INNER JOIN WidgetStatistics PS ON PT.WidgetID = PS.WidgetID
WHERE PT.TagID IN (SELECT PTInner.TagID FROM WidgetTags PTInner WHERE PTInner.WidgetID = #WidgetID)
AND PT.WidgetID != #WidgetID
GROUP BY PT.WidgetID, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date
ORDER BY CommonTags DESC, PS.OtherOrderingCriteria1 DESC, PS.OtherOrderingCriteria2 DESC, PS.OtherOrderingCriteria3 DESC, PS.Date DESC, PT.WidgetID DESC