Insert ceros instead of interopolate ARIMA_PLUS bigquery

Insert ceros instead of interopolate ARIMA_PLUS bigquery - google-bigquery

I want to do ARIMA_plus forecasting on a series of sale records. The problem is that sale records only contain sales. When doing the forecast we need to insert for every product the "non sales", which, essentially, are rows with the import column set to cero for every day the product has not been sold. We have here two options:
Fill the database with those zero-rows (uses a lot of space)
When doing the forecasting with ARIMA_PLUS in bigquery tell the model to fill with zeros instead of interpolating (default and seemingly unique option).
I want to follow the second option, yet, i dont see how. Here you can see a screenshot of the documentation Google info about interpolation
The first option would be carried out with a merge, nevertheless I would prefer to discard it since it increases the size of the sales table.
I have scanned the documentation and havent seen any solution

You need to provide an input dataset covering the missing values with the right method for your use case.
In other words, the SQL query must solve the interpolation so that the input for the model already contains the expected data.
You can, for example, create a query to add a liner interpolation solution for your use case.
So, the first approach you mentioned can be solved using that input SQL (rather than adding the data to the source table) and the second approach is not valid in bigquery, as far as I know.
Here you have an example: https://justrocketscience.com/post/interpolation_sql/

Related

How to populate all possible combination of values in columns, using Spark/normal SQL

I have a scenario, where my original dataset looks like below
Data:
Country,Commodity,Year,Type,Amount
US,Vegetable,2010,Harvested,2.44
US,Vegetable,2010,Yield,15.8
US,Vegetable,2010,Production,6.48
US,Vegetable,2011,Harvested,6
US,Vegetable,2011,Yield,18
US,Vegetable,2011,Production,3
Argentina,Vegetable,2010,Harvested,15.2
Argentina,Vegetable,2010,Yield,40.5
Argentina,Vegetable,2010,Production,2.66
Argentina,Vegetable,2011,Harvested,15.2
Argentina,Vegetable,2011,Yield,40.5
Argentina,Vegetable,2011,Production,2.66
Bhutan,Vegetable,2010,Harvested,7
Bhutan,Vegetable,2010,Yield,35
Bhutan,Vegetable,2010,Production,5
Bhutan,Vegetable,2011,Harvested,2
Bhutan,Vegetable,2011,Yield,6
Bhutan,Vegetable,2011,Production,3
Image of the above csv:
Now there is a very small country lookup table which has all possible countries the source data can come with, listed. PFB:
I want to have the output data's number of columns always fixed (this is to ensure the reporting/visualization tool doesn't get dynamic number columns with every day's new source data ingestions depending on the varying distinct number of countries present).
So, I've to somehow join the source data with the country_lookup csv and populate all those columns with default value as F. Every country column would be binary with T or F being the possible values.
The original dataset from the above has to be converted into below:
Data (I've kept the Amount field unsolved for column Type having Derived Yield as is, rather than calculating them below for a better understanding and for you to match with the formulae):
Country,Commodity,Year,Type,Amount,US,Argentina,Bhutan,India,Nepal,Bangladesh
US,Vegetable,2010,Harvested,2.44,T,F,F,F,F,F
US,Vegetable,2010,Yield,15.8,T,F,F,F,F,F
US,Vegetable,2010,Production,6.48,T,F,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
US,Vegetable,2011,Harvested,6,T,F,F,F,F,F
US,Vegetable,2011,Yield,18,T,F,F,F,F,F
US,Vegetable,2011,Production,3,T,F,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+2)/(3+3),T,F,T,F,F,F
US,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Argentina,Vegetable,2010,Harvested,15.2,F,T,F,F,F,F
Argentina,Vegetable,2010,Yield,40.5,F,T,F,F,F,F
Argentina,Vegetable,2010,Production,2.66,F,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Argentina,Vegetable,2011,Harvested,10,F,T,F,F,F,F
Argentina,Vegetable,2011,Yield,90,F,T,F,F,F,F
Argentina,Vegetable,2011,Production,9,F,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Bhutan,Vegetable,2010,Harvested,7,F,F,T,F,F,F
Bhutan,Vegetable,2010,Yield,35,F,F,T,F,F,F
Bhutan,Vegetable,2010,Production,5,F,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Bhutan,Vegetable,2011,Harvested,2,F,F,T,F,F,F
Bhutan,Vegetable,2011,Yield,6,F,F,T,F,F,F
Bhutan,Vegetable,2011,Production,3,F,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
The image of the above expected output data for a structured look at it:
Part 1 -
Part 2 -
Formulae for populating Amount Field for Derived Type:
Derived Amount = Sum of Harvested of all countries with T (True) grouped by Year and Commodity columns divided by Sum of Production of all countries with T (True)grouped by Year and Commodity columns.
So, the target is to have a combination of all the countries from source and calculate the sum of respective Harvested and Production values which then has to be divided. The commodity can be more than one in the actual scenario for any given country, but that should not bother as the summation of amount happens on grouped commodity and year.
Note: The users in the frontend can select any combination of countries. The sole purpose of doing it in the backend rather than dynamically doing it in the frontend is because AWS QuickSight (our visualisation tool), even though can populate sum on selected column filters but doesn't yet support calculation on those derived summed fields. Hence, the entire calculation of all combination of countries has to be pre-populated (very naive approach) in order to make it available in report on dynamic users selection of countries.
Also if you've any better approach (than the above naive approach mentioned in note) to solve this problem, you are most welcome to guide me. I've also posted a question on the same problem without writing my expected approach for experts to show me the path on how we can solve this kind of a problem better than this naive approach. If you want to help solve it with some other technique, you're most welcome, here is the link to that question.
Any help shall be greatly acknowledged.

Grand totals row not summing in Google Data Studio

Well, I'm absolute newbie in Google Data Studio, but for any reason, my grand totals rows is not working.
I'm learning to use this tool, and I made an easy table with just countries and sessions.
Piece of Cake. Now I just want to add a total row where it sums all sessions. That's all. I activated option Show Summary Row but it shows nothing.
Thing's I've done and not worked:
Update and refresh
Changed time period and tried different dates just in case.
Delete and create again full table.
Checked connection. I get data and the data is right, I just cannot sum it.
Changed size and format of table, just in case it where a problems or margins or font color.
And I know it can be done, because different sources. I've read this question here:
Grand Total is wrong in Google Data Studio
But it did not help. In that question, a user posted an image in the comments:
As you can see, he managed to get what I'm trying to do.
So I must be doing something wrong, and I do not why.
UPDATE 2: If I apply a filter, I get no totals. You can see my config in the right side of image.
Can anybody give me a clue of how to make a grand totals row in Google Data Studio?
Thanks

Sounds like a bug. It should be a case of selecting that tick box. Strangely, I looked at an existing table I have with totals and when I unticked the box and then ticked again, the totals didn't reappear and disappeared off another table on the page (like your example). They did reappear eventually with some refreshing of the data and page but seems like there's something wrong with them.

I don't think this is a bug I think it part of the design.
I actually just discovered the reason this is happening at least for me, it doesn't actually sum the values in the table, the grand total summary of a table is a sum of whatever the metric being used is not the actual rows shown in the chart. so if you have a dimension (like age / gender) where there is data thresholding applied internally by google but are using a metric such as users you will see the grand total from the metric value without the thresholding applied from the dimension.
Proof below
You can see the grand total for column 2 is not 953.6 its 453.6 and if i look at a non threshold dimension (country)
you can see where the 953.6 comes from since the data source supplied to the table uses 80% of all users 1192 * .8 give me 953.6 which is what the grand total is displaying. Conclusion, the only way this number could be possible is if, when using a threshold dimension for a table with metric there will be a discrepancy since the grand total value is not coming from the table values but rather from metric source data, which will not have the tables dimension applied for some odd reason.

What is difference between Pentaho DI "variables" and "fields"?

Could not find much information about this. I can see that fields can have multiple copies per row in a transformation. But what are variables? Are they unique across all rows a transformation produces? But, by the name, variables are meant to vary.
What is difference between fields and variables exactly?
Can someone enlighten me please
Thank you

PDI transformations work with a stream of rows that pass through all the steps. The rows consist of a number of fields that the steps can act on, converting them, filtering them, sorting, etc.
Variables are more like a configuration help and have a single value in the transformation. It's very important to remember that they can NOT be set/changed and used within the same transformation, because all the steps execute in parallel!
Example
In your transformation you have a variable called "last_staging_run" and its value is "2017/01/19 05:00:00". This one has been passed to the transformation from the parent job.
You then use it in a Table Input:
SELECT id, product_id, price, number
FROM sales
WHERE purchase_date > ${last_staging_run}
This will give you the new rows since the last staging run with the fields id, product_id, price and number. You might then lookup the product names or filter products with a zero price with other steps, then store it in a table again.

Build a Kibana Histogram with buckets dynamically created by ElasticSearch terms aggregation

I want to be able to combine the functionality of the Kibana Terms Graph (be able to create buckets based on uniqueness of values from a particular attribute) and Histogram Graph (separate data into buckets based on queries and then illustrate the date based on time).
Overall, I want to create a Histogram, but I only want to create the Histogram based on the results of one query, not multiple queries like it's being done in the Kibana demo app. Instead, I want each bucket to be dynamically created per unique value of my particular field. For example, consider the following data returned by my query:
{"myValueType": "New York"}
{"myValueType": "New York"}
{"myValueType": "New York"}
{"myValueType": "San Francisco"}
{"myValueType": "San Francisco"}
Also assume that each record has a timestamp field for separating histogram data by date. For that particular date, I want the data to be communicated as a count of 3 into the New York bucket and a count of 2 into the San Francisco bucket. However, I am only able to show a count of 5 for my one linked query. When I configure the Histogram, I am able to specify a field to use for my timestamp, but not to create buckets from. I could've sent a field to compute a total/min/max/mean, but this field would've had to be numeric, so that is not the solution either.
If I were to use a Term Graph to create a pie or bar graph, I am indeed able to separate my data into buckets based on the unique values of my specified field (in this case, "myValueType"), but this would total up the data for all-time, not split up the data by timestamp. Although this is good information to know, it is not ideal because I wouldn't be able to detect trends in my data.
I am looking for a solution that will do one of the following:
Let me dynamically create queries in my Kibana dash board to create "buckets" in a Histogram
Allow me to run an ElasticSearch Terms Aggregation to supposidly split up my data into buckets based on "myValueType" and integrate these results into my Histogram
Customize the JSON of my dashboard, but this doesn't look possible to me
Create my own custom panel, but this is not desirable
Link a Kibana "TopN" query in Kibana. Actually, this has proven to be a work-around for my problem because the TopN query dynamically created one query per unique value/term from the specified fieldName. However, the problem is that I can only link one colour to this TopN query and each unique term will be placed in a bucket that uses a different shade of the colour. Ideally, every bucket in my Histogram will have a completely different colour associated to it. Imagine how difficult it will be to distinguish unique terms as the number of buckets grows.
If all else fails, I make one query per unique value from my search field. This will allow me to have one unique colour per bucket, but as the number of unique terms in the "myValueType" field changes, I need to keep adding/removing queries from Kibana, which can get quite messy.
I'm sure there is someting that I am missing here. Please help me out. Many thanks.
A highly related SOF question: Is it Possible to Use Histogram Facet or Its Curl Response in Kibana

This would be a great feature. It looks like it will be supported in Kibana4, but there doesn't seem to be much more info out there than that.
For reference: https://github.com/elasticsearch/kibana/issues/1249

Maybe a little late but it is actually possible in the newest BETA release.
kibana 4 beta 3 installation download

SRSS: Dynamic amount of subreports in a report

it might be possible I'm searching for the wrong keywords, but so far I couldn't find anything useful.
My problem is quite simple: At the moment I get a list of individual Ids through a report parameter, I pass them to a procedure and show the results.
The new request is like this: Instead of showing the list for all individuals at once, there should be a list for each individual id.
Since I'm quite a beginner in srss, I thought the easiest approach would be the best: Create a subreport, copy the shown list, and create a subreport per individual id.
The amount of this IDs is dynamic, so I have to create a dynamic amount of subreports.
Funny enought, this doesnt seem to be possible. This http://forums.asp.net/t/1397645.aspx url doesnt show exactly the problem, but it shows the limit of the subreports.
I even ran trough the whole msdn pages starting http://technet.microsoft.com/en-us/library/dd220581.aspx but I couldnt find anything there.
So is there a possibility, to create a loop like:
For each Individual ID in Individual IDs, create a subreport and pass ONE ID to this?
Or is there another approach I should use to make this work?
I tried to create a 'Fake'-Dataset with no sql query but just for iterating the id list, but it seems the dataset needs a data-source...
As usual, thanks so far for all answers!
Matthias Müller

Or is there another approach I should use to make this work?
You didn't provide much detail about what sort of information needs to be included in the subreport, but assuming it's a small amount of data (say, showing a personnel record), and not a huge amount (such as a persons sales for the last year), a List might be the way to go.
I tried to create a 'Fake'-Dataset with no sql query but just for iterating the id list, but it seems the dataset needs a data-source...
All datasets require a data source, though if you're merely hard-coding some fake return data, any data source will do, even a local SQL instance with nothing in it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas