this should be an easy question but I cannot solve it...
I have a simple metric, let's say it is called "a", with entries like
a = (B,A,A,B,C,A,A)
Now what I would like to gain is the frequency of each entry, thus
A: 4
B: 2
C: 1
Can somebody help me? Thanks :)
Could you convert "a" to an attribute, and then use the Row Count metric against it?
Related
I have a tableau table as follows:
This data can be visualized as follows:
I'd like to flag cases that have lumps/clusters. This would flag items B, C and D because there are spikes only in certain weeks of the 13 weeks. Items A and E would not be flagged as they mostly have a 'flat' profile.
How can I create such a flag in Tableau or SQL to isolate this kind of a case?
What I have tried so far?:
I've tried a logic where for each item I calculate the MAX and MEDIAN. Items that need to be flagged will have a larger (MAX - MEDIAN) value than items that have a fairly 'flat' profile.
Please let me know if there's a better way to create this flag.
Thanks!
Agree with the other commenters that this question could be answered in many different ways and you might need a PhD in Stats to come up with an ideal answer. However, given your basic requirements this might be the easiest/simplest solution you can implement.
Here is what I did to get here:
Create a parameter to define your "spike". If it is going to always be a fixed number you can hardcode this in your formulas. I called min "Min Spike Value".
Create a formula for the Median Values in each bucket. {fixed [Buckets]: MEDIAN([Values])} . (A, B, ... E = "Buckets"). This gives you one value for each letter/bucket that you can compare against.
Create a formula to calculate the difference of each number against the median. abs(sum([Values])-sum([Median Values])). We use the absolute value here because a spike can either be negative or positive (again, if you want to define it that way...). I called this "Spike to Current Value abs difference"
Create a calculated field that evaluates to a boolean to see if the current value is above the threshold for a spike. [Spike to Current Value abs difference] > min([Min Spike Value])
Setup your viz to use this boolean to highlight the spikes. The beauty of the parameter is you can change the value for what a spike should be and it will highlight accordingly. Above the value was 4, but if you change it to 8:
I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!
You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.
I am working with a large data set and want to run a logit regression on monthly data. For this I create a DataFrame and use the GLM package in Julia.
My code looke something like that:
f=glm((Y ~ Age + Duration + Gender + Nationality + MonthIn), Data2000, Binomial(), LogitLink())
My question is, as I have monthly data I want to create dummy variables for the 12 months, or eleven when I want to use a constant. The MonthIn is just a column which has numbers for the month (eg 3 for march). I do not want to run the regression on this, just included it here to explain it easier.
Now when I tried to find how this is done I just learned that in R this possibility as it is build into some regression methods s.t. it can automatically create monthly dummies. This is, I think, not the case for Julia.
Now one guess of mine would be to use the pooling data function build in the dataframe.jl to create an indicator matrix, but I am not sure how this or something similar would be done. Or just how to create the dummies by hand.
I highly appreciate any help and please feel free to ask if my question is not clear.
Cheers
PS: From this question I know that I have to create a Pooled Data Array but I am not sure how it is done.
Dummy Variables in Julia
I'm new to Pig. I need to do some calculation for all fields/columns in a table. However, I can't find a way to do it by searching online. It would be great if someone here can give some help!
For example: I have a table with 100 fields/columns, most of them are numeric. I need to find the average of each field/column, is there an elegant way to do it without repeat AVERAGE(column_xxx) for 100 times?
If there's just one or two columns, then I can do
B = group A by ALL;
C = foreach B generate AVERAGE(column_1), AVERAGE(columkn_2);
However, if there's 100 fields, it's really tedious to repeatedly write AVERAGE for 100 times and it's easy to have errors.
One way I can think of is embed Pig in Python and use Python to generate a string like that and put into compile. However, that still sounds weird even if it works.
Thank you in advance for help!
I don't think there is a nice way to do this with pig. However, this should work well enough and can be done in 5 minutes:
Describe the table (or alias) in question
Copy the output, and reorgaize it manually into the script part you need (for example with excel)
Finish and store the script
If you need to be able with columns that can suddenly change etc. there is probably no good way to do it in pig. Perhaps you could read it in all columns (in R for example) and do your operation there.
I'm setting up a SSAS project for our websites but I can't managed to find the good value whereas it's quite simple in plain SQL query.
Here's my setup : I have a datawarehouse filled with user connection fact on my sites. And so I have a Member dimension and a date dimension. Heres's the KPI I'm loking for : "For how many average days a user is coming to see our site".
Let's take an example :
Member Day
a 1
a 1
a 2
b 2
a 4
a 5
b 5
a 6
In this case the KPI should give 3,5 (a=5, b=2). In plain SQL I would have done an average on a group by on a group by (it's the first request I've got in mind, maybe there's a better one).
But as soon as I try to assemble dimension and facts together I can't find the right measure.
Am I looking for the wrong thing ? Should I abandon my SQL way of thinking ? How would you do to get the value I need ?
I understand now! It was just an internationalisation problem. To me 3,5 means the numbers 3 and 5, I'd write it as 3.5 :)
SELECT
AVG(CountOfDay) As AverageDays
FROM
(SELECT Member, COUNT(DISTINCT Day) CountOfDay FROM YourTable GROUP BY Member) AS UniqueDaysByMember
Really you don't need "Member" in the sub-query's SELECT. It just makes it "mean" something to me, so I don't get confused if I come back and look at the code later!