Indicator matrix for categorical data in GLM.jl with DataFrames.jl - dataframe

I am working with a large data set and want to run a logit regression on monthly data. For this I create a DataFrame and use the GLM package in Julia.
My code looke something like that:
f=glm((Y ~ Age + Duration + Gender + Nationality + MonthIn), Data2000, Binomial(), LogitLink())
My question is, as I have monthly data I want to create dummy variables for the 12 months, or eleven when I want to use a constant. The MonthIn is just a column which has numbers for the month (eg 3 for march). I do not want to run the regression on this, just included it here to explain it easier.
Now when I tried to find how this is done I just learned that in R this possibility as it is build into some regression methods s.t. it can automatically create monthly dummies. This is, I think, not the case for Julia.
Now one guess of mine would be to use the pooling data function build in the dataframe.jl to create an indicator matrix, but I am not sure how this or something similar would be done. Or just how to create the dummies by hand.
I highly appreciate any help and please feel free to ask if my question is not clear.
Cheers
PS: From this question I know that I have to create a Pooled Data Array but I am not sure how it is done.
Dummy Variables in Julia

Related

Splitting the sum of several datasets

My workspace has several datasets; specifically dataset 1 and dataset 2. Each dataset has a dollar value fact that I’m plotting. My aim is to make an insight that splits the sum of dataset 1 value and dataset 2 value. Is it possible to create such an insight directly in GoodData, or does my model need to calculate the totals outside of GoodData and inport into another dataset?
it is possible to sum two facts from different datasets, but don't forget to make sure your Logical Data Model (LDM) structure allows you to do so, i.e., both datasets must be connected/referenced correctly. Please check Connection Points in Logical Data Models. Once you do this, you will need to create a slightly complex metric that may look as follows:
SELECT SUM (Fact1 + Fact2)
You could then reference this metric in a compound metric, for example:
SELECT FactsSum / Revenue_Last_Year

python - pandas - dataframe - data padding multidimensional statistics

i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!

How to do sampling in sql query to get dataframe with pandas

Note my question is a bit different here:
I am working with pandas on a dataset that has a lot of data (10M+):
q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')
I know I can do with pandas function with a frac option like
df_sample = df.sample(frac=0.01)
however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.
I've read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.
Can anyone shed me with more light?
Thank you very much.
UPDATE:
Below is a table showing how the data looks like:
Reputation is the field I am working on. You can see majority records have a very small reputation.
I don't want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that's what I meant "evenly".
I hope this clarifies a bit.
A simple random sample can be performed using the following syntax:
select * from mydata where rand()>0.9
This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach
http://sqlfiddle.com/#!9/21d1ee/2
On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

How to replace the missing data from AMELIA results

I have run a AMELIA imputation for a data set including missing data. I need to replace the missing points by the result of amelia(). But it content 5 group of imputed values. How can i choose the best one to replace the missing values (to plot a graph of data set after imputing)
You use all 5.
You have to perform whatever you wanted to do with the data on all 5 sets of data and then combine the results of that.
i.e. you run a t-test on all 5 datasets and then combine the results..somehow.. I have not yet looked into that, but from what I have heared you can use the zelig R package to do it somewhat easily. I also noted a reference to papers that should describe methods to combine those, but have not looked into that either: King et al. (2001) and Schafer (1997).
My guess is that you just average out the p-values gained from the analysis?

How would I calculate EXPECTED income if I have PAST income data in mySQL?

Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc)
I am thinking taking some averages and whatnot, but I can't think of a specific formula (there must be something along those lines) to take say average rise of income over time (weekly/monthly) and then apply it to a select future period and display it weekly/monthly/etc?
Any suggestions?
use AVG() on the income in the past devide it to proper weekly/monthly amounts if neccessary.
see http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_avg for more info on AVG()
Linear regression + simple integration is probably sufficient for your needs. I leave sorting out exact implementation for your DB up to you, but that follow that link to the "Estimation Methods" section, and probably use Ordinary Least Squares.
Alternatively, you can always slurp your data into something like R where the details are already implemented.
EDIT:
For more detail: you're trying to model INCOME = BASE + SCALING*T where we are assuming that a linear model is "good" (it's probably not great, but it's probably good enough on a short time scale). For two value linear regression, you're pretty much just taking averages; follow that link to "Fitting the Regression Line" and you'll see which things you need to average (y = INCOME and x = T). There are some tricks you can play to simplify the calculation for the computer if you can enforce some other conditions (e.g., having equally spaced time periods + no missing data), but you'll need to math a bit more yourself first if you want to do that (and you'll be less flexible in the face of changing db assumptions).