dirichlet process group selection - process

I understand the dirichlet process group selection as explained here: How to decide group assignments in Dirichlet process clustering
But I don't understand why the DP group selection algorithm doesn't use a new item's features to determine the best group membership. How will a DP find the distinct groups if it doesn't use the members' features to guide group membership?

The "generative model" is not a program to label new items.
It's a hypothetical program to generate 'fake' data. If you are generating data, you have to first choose the group, then generate the attributes. There is are no "existing" features that you could use.
To label observed data, you have to infer the parameters that are most likely to have generated this new data if it had been randomly generated.

After putting the items into random clusters initially, the training phase moves items one at a time to the cluster that they are closest to or puts them in a new cluster if there isn't a close fit to an existing cluster. The training phase is run until convergence (there are no movements of items to different clusters).

Related

How to see absolute values instead of percents on Events graphs?

I use custom events for tracking statistics about some deprecated modules are used by users. And I`ll want to remove migrations from deprecated module to a new one when amount of usages will be lower a "waterline".
So, it is not enough convenient to track it via clicking on a date on a graph and check amount of events at the date. Could I somehow switch a type of values on a graph to absolute values?
Mike from Fabric here. For the graphs, we will either show the percentage if the custom attribute is a string or the 25th, median and 75th percentiles if the custom attribute is a number. However, the top 10 custom attribute count will be present below the graph.

Optaplanner, How to make the lecture start time to be chosen within time window according to the constraints in DRL and not an input specific time?

In the curriculum course example, I want the lecture to start within a time window and according to its duration the constraints between two consecutive lectures are set. Defining the constraints in the DRL file is not the problem, I've already tried it with a fixed lecture start time but I want to make the lectures start time to be chosen from a period (time window) and satisfying the constraints.
I want to make the inputs earliestStartTime and latestStartTime as inputs and the used start time for each lecture to be chosen between these two boundaries.
i.e Lecture Time x belongs to [earliestStartTime, latestStartTime], and x should be chosen by the program within this range and ofcourse satisfying the defined constraints.
what I have successed in is that I made x to be the input and I obtained the output as a schedule determining the order of the lectures according to the defined constraints. Now I want to make x as a variable determined by the program and the inputs will be the earliestStartTime and the latestStartTime.

Managing PerformancePoint Filters With Slowly Changing Dimensions

Just a bit of background info:
I have dimension table which uses SCD2 to track user changes in our company (team changes, job title changes etc) See example below:
I've built an Analysis Services Cube and created all the necessary hierarchy's for the dimensions and it works well when navigating and drilling down through the fact table.
The problem I have is with the filters on the PerformancePoint dashboard. As I'm using the User Dimension table with it's multiple instances of users it's showing duplicates up in the list. I can understand why as the surrogate ID is being referenced on the Dimension. But if I choose the first instance of the A-team I will see all their sales for a particular period and if I choose the second instance I will see all their sales for a different period.
What is the best way to handle this type of behavior? Ideally I'd like to see a distinct list of teams in alphabetical order and when I choose the team name it shows all of their data over time.
I've considered using MDX query filters but I'd like to see if there's anything I haven't thought about.
I realise this isn't an easy and quick question but any help would be appreciated!
The answer was simple after having a trawl through my User Dimension table on the Cube.
Under my user dimension I added 2 duplicate attributes to my attributes list ("Team Filter" is a copy of "Team", "User Filter" a copy of "User Name") these will be used only for filtering the dashboard.
Under the attribute properties for each duplicate I then set AttributeHierarchyOptimizedState to "Not Optimized", I also set their AttributeHierarchyVisible to false as I'd shown the two duplicate attributes in the hierarchy window in the middle.
Deploy your Cube to the server and go in to PerformancePoint. Create a new MDX Filter (this image shows the finished filter)
This is the code I used, it only shows dimension members which have a fact against them (reduces the list a considerable amount) and by using allmembers at the dimension it also gives me the option to show "All" at the top of the list.
Deploy the new filters and now you can see the distinct list of users and teams, works perfectly and selects every instance (regardless of the SCD2 row)

Qlikview line chart with multiple expressions over time period dimension

I am new to Qlikview and after several failed attempts I have to ask for some guidance regarding charts in Qlikview. I want to create Line chart which will have:
One dimension – time period of one month broke down by days in it
One expression – Number of created tasks per day
Second expression – Number of closed tasks per day
Third expression – Number of open tasks per day
This is very basic example and I couldn’t find solution for this, and to be honest I think I don’t understand how I should setup my time period dimension and expression. Each time when I try to introduce more then one expression things go south. Maybe its because I have multiple dates or my dimension is wrong.
Here is my simple data:
http://pastebin.com/Lv0CFQPm
I have been reading about helper tables like Master Callendar or “Date Island” but I couldn’t grasp it. I have tried to follow guide from here: https://community.qlik.com/docs/DOC-8642 but that only worked for one date (for me at least).
How should I setup dimension and expression on my chart, so I can count the ID field if Created Date matches one from dimension and Status is appropriate?
I have personal edition so I am unable to open qwv files from other authors.
Thank you in advance, kind regards!
My solution to this would be to change from a single line per Call with associated dates to a concatenated list of Call Events with a single date each. i.e. each Call will have a creation event and a resolution event. This is how I achieve that. (I turned your data into a spreadsheet but the concept is the same for any data source.)
Calls:
LOAD Type,
Id,
Priority,
'New' as Status,
date(floor(Created)) as [Date],
time(Created) as [Time]
FROM
[Calls.xlsx]
(ooxml, embedded labels, table is Sheet1) where Created>0;
LOAD Type,
Id,
Priority,
Status,
date(floor(Resolved)) as [Date],
time(Resolved) as [Time]
FROM
[Calls.xlsx]
(ooxml, embedded labels, table is Sheet1) where Resolved>0;
Key concepts here are allowing QlikView's auto-conatenate to do it's job by making the field-names of both load statements exactly the same, including capitalisation. The second is splitting the timestamp into a Date and a time. This allows you to have a dimension of Date only and group the events for the day. (In big data sets the resource saving is also significant.) The third is creating the dummy 'New' status for each event on the day of it's creation date.
With just this data and these expressions
Created = count(if(Status='New',Id))
Resolved = count(if(Status='Resolved',Id))
and then
Created-Resolved
all with full accumulation ticked for Open (to give you a running total rather than a daily total which might go negative and look odd) you could draw this graph.
For extra completeness you could add this to the code section to fill up your dates and create the Master Calendar you spoke of. There are many other ways of achieving this
MINMAX:
load floor(num(min([Date]))) as MINTRANS,
floor(num(max([Date]))) as MAXTRANS
Resident Calls;
let zDateMin=FieldValue('MINTRANS',1);
let zDateMax=FieldValue('MAXTRANS',1);
//complete calendar
Dates:
LOAD
Date($(zDateMin) + IterNo() - 1, '$(DateFormat)') as [Date]
AUTOGENERATE 1
WHILE $(zDateMin)+IterNo()-1<= $(zDateMax);
Then you could draw this chart. Don't forget to turn Suppress Zero Values on the Presentation tab off.
But my suggestion would be to use a combo rather than line chart so that the calls per day are shown as discrete buckets (Bars) but the running total of Open calls is a line

Show hitted documents in the same series together in Lucene

The are some articles are written in several parts,
for example, I got those articles from IBM developer works:
Distributed data processing with
Hadoop, Part 1:Getting started
Distributed data processing with
Hadoop, Part 2:Going further
Distributed data processing with
Hadoop, Part 3: Application
development
I will index those three articles separately. And some one search certain keywords, it is possible the part3 is on the top of hit whle part1 is on the 32th. Therefor, if I list results page by page, the part1 and part3 will display on different page.
How can I make sure the hitted documents in the same series displayed together?
I guess in SQL, we can use "group by".
I believe what you are asking for is Field Collapsing, which is currently a trunk feature in Solr, and will be incorporated into the next Solr version.
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a series. You will have to ensure that this gets incremented for every new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.