My workspace has several datasets; specifically dataset 1 and dataset 2. Each dataset has a dollar value fact that I’m plotting. My aim is to make an insight that splits the sum of dataset 1 value and dataset 2 value. Is it possible to create such an insight directly in GoodData, or does my model need to calculate the totals outside of GoodData and inport into another dataset?
it is possible to sum two facts from different datasets, but don't forget to make sure your Logical Data Model (LDM) structure allows you to do so, i.e., both datasets must be connected/referenced correctly. Please check Connection Points in Logical Data Models. Once you do this, you will need to create a slightly complex metric that may look as follows:
SELECT SUM (Fact1 + Fact2)
You could then reference this metric in a compound metric, for example:
SELECT FactsSum / Revenue_Last_Year
Related
I'm using BigQuery at my new position, and I'm totally new to SQL/BigQuery.
I'm testing a machine learning model and monitoring an A/B test with a different ratio, e.g., 3 vs. 10. To compare the A/B results, e.g., # of page view, I want to make the ratios equal first so that I can compare easily. For example, say we have a table with 13 records (3 are from A and 10 are from B). In addition, each row contains an id field that is identical. What I want to do is to extract only 3 samples out of 10 for B to match the sample number to A.
I'm trying to use the FARM_FINGERPRINT function to map fields to integers. Then I'm taking ABS and then calculating MOD to convert the integer numbers to a specific range, e.g., [0, 10). Eventually, I would like to get 3 in 10 items using the following line:
MOD(ABS(FARM_FINGERPRINT(field)), 10) < 3
However, I found that even if I run A/B with exactly the same ML model with different A/B ratio, the result is different between A and B (The results should be same because A and B are running the same ML model with just the different ratio). This made me doubt that the above implementation may bring some biased data sampling. I also read this post and confirmed the FARM_FINGERPRINT might not bring a randomly distributed result.
*There's a critical reason why I cannot simply multiply 3/10 to B, which is confidential and cannot disclose here.
Is there a better way to accomplish the equally distributed sampling?
Thank you in advance. (I'm sorry if the question is vague, as I'm hiding the confidential parts.)
I am working on a research problem and due to a small sized dataset with subjects I am trying to implement Leave N Out style analyses.
Currently I am doing this ad-hoc and I stumbled upon scikit-learn LeavePGroupsOut function.
I read the docs but I am unable to understand how to use it in multidimensional array.
My data are the following: I have 50 subjects, around 20 entries per subject (not fixed) and 20 features per entry with ground-truth value (0 or 1) for every entry.
Well the documentation is actually pretty clear:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut
In your case you need to concatenate your array s.t. you can provide for every entry and feature the group index. Thus your feature array will have the shape 50*20 datapoints times 20 features (1000,20), so your group array also needs to have shape (1000,).
Then you need to define the cross validation via
lpgo = LeavePGroupsOut(n_groups=n_groups)
It's important to notice that this will result in all possible combinations of left out test groups.
i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!
I understand there's already another post, but it's a bit old and it doesn't really answer the question.
I understand that we can use the parameter DATA_SPLIT_METHOD to separate dataset for training and evaluation. But I how do I make sure that they're both different data set?
So for example, I set DATA_SPLIT_METHOD to AUTO_SPLIT, and my data set is between 500 and 500k rows, so 20% of data will be used as evaluation. How do I make sure that the rest of 80% will be used for training when I run my evaluation (ML.EVALUATE?
The short answer is BigQuery does it for you.
The long answer would be that DATA_SPLIT_METHOD is a parameter of CREATE MODEL which upon called will already create and train the model using the right percentage set at DATA_SPLIT_METHOD.
When you run ML.EVALUATE, you run it for the model which will have DATA_SPLIT_METHOD as a parameter. Therefore, it already knows which part of the data set has to evaluate and uses the already trained model.
Very interesting question I would say.
As stated in the BQ's parameters of CREATE MODEL by using the DATA_SPLIT_METHOD (i.e. The method to split input data into training and evaluation sets.), that is done for you.
But in case you would still like to split your data, then here is a way for a random sampling method:
-- Create a new table that will include a new variable that splits the existing data into **training (80%)**, **evaluation (10%)**, and **prediction (10%)**
CREATE OR REPLACE TABLE `project_name.table_name` AS
SELECT *,
CASE
WHEN split_field < 0.8 THEN 'training'
WHEN split_field = 0.8 THEN 'evaluation'
WHEN split_field > 0.8 THEN 'prediction'
END AS churn_dataframe
FROM (
SELECT *,
ROUND(ABS(RAND()),1) AS split_field
FROM `project_name.table_name`
)
split_field = A field that generates a random number for each row between 0 to 1 (it is assumed that the numbers that are generated, are uniformly distributed)
Hope this could help.
Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.