How to populate all possible combination of values in columns, using Spark/normal SQL - sql
I have a scenario, where my original dataset looks like below
Data:
Country,Commodity,Year,Type,Amount
US,Vegetable,2010,Harvested,2.44
US,Vegetable,2010,Yield,15.8
US,Vegetable,2010,Production,6.48
US,Vegetable,2011,Harvested,6
US,Vegetable,2011,Yield,18
US,Vegetable,2011,Production,3
Argentina,Vegetable,2010,Harvested,15.2
Argentina,Vegetable,2010,Yield,40.5
Argentina,Vegetable,2010,Production,2.66
Argentina,Vegetable,2011,Harvested,15.2
Argentina,Vegetable,2011,Yield,40.5
Argentina,Vegetable,2011,Production,2.66
Bhutan,Vegetable,2010,Harvested,7
Bhutan,Vegetable,2010,Yield,35
Bhutan,Vegetable,2010,Production,5
Bhutan,Vegetable,2011,Harvested,2
Bhutan,Vegetable,2011,Yield,6
Bhutan,Vegetable,2011,Production,3
Image of the above csv:
Now there is a very small country lookup table which has all possible countries the source data can come with, listed. PFB:
I want to have the output data's number of columns always fixed (this is to ensure the reporting/visualization tool doesn't get dynamic number columns with every day's new source data ingestions depending on the varying distinct number of countries present).
So, I've to somehow join the source data with the country_lookup csv and populate all those columns with default value as F. Every country column would be binary with T or F being the possible values.
The original dataset from the above has to be converted into below:
Data (I've kept the Amount field unsolved for column Type having Derived Yield as is, rather than calculating them below for a better understanding and for you to match with the formulae):
Country,Commodity,Year,Type,Amount,US,Argentina,Bhutan,India,Nepal,Bangladesh
US,Vegetable,2010,Harvested,2.44,T,F,F,F,F,F
US,Vegetable,2010,Yield,15.8,T,F,F,F,F,F
US,Vegetable,2010,Production,6.48,T,F,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
US,Vegetable,2011,Harvested,6,T,F,F,F,F,F
US,Vegetable,2011,Yield,18,T,F,F,F,F,F
US,Vegetable,2011,Production,3,T,F,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+2)/(3+3),T,F,T,F,F,F
US,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Argentina,Vegetable,2010,Harvested,15.2,F,T,F,F,F,F
Argentina,Vegetable,2010,Yield,40.5,F,T,F,F,F,F
Argentina,Vegetable,2010,Production,2.66,F,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Argentina,Vegetable,2011,Harvested,10,F,T,F,F,F,F
Argentina,Vegetable,2011,Yield,90,F,T,F,F,F,F
Argentina,Vegetable,2011,Production,9,F,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Bhutan,Vegetable,2010,Harvested,7,F,F,T,F,F,F
Bhutan,Vegetable,2010,Yield,35,F,F,T,F,F,F
Bhutan,Vegetable,2010,Production,5,F,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Bhutan,Vegetable,2011,Harvested,2,F,F,T,F,F,F
Bhutan,Vegetable,2011,Yield,6,F,F,T,F,F,F
Bhutan,Vegetable,2011,Production,3,F,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
The image of the above expected output data for a structured look at it:
Part 1 -
Part 2 -
Formulae for populating Amount Field for Derived Type:
Derived Amount = Sum of Harvested of all countries with T (True) grouped by Year and Commodity columns divided by Sum of Production of all countries with T (True)grouped by Year and Commodity columns.
So, the target is to have a combination of all the countries from source and calculate the sum of respective Harvested and Production values which then has to be divided. The commodity can be more than one in the actual scenario for any given country, but that should not bother as the summation of amount happens on grouped commodity and year.
Note: The users in the frontend can select any combination of countries. The sole purpose of doing it in the backend rather than dynamically doing it in the frontend is because AWS QuickSight (our visualisation tool), even though can populate sum on selected column filters but doesn't yet support calculation on those derived summed fields. Hence, the entire calculation of all combination of countries has to be pre-populated (very naive approach) in order to make it available in report on dynamic users selection of countries.
Also if you've any better approach (than the above naive approach mentioned in note) to solve this problem, you are most welcome to guide me. I've also posted a question on the same problem without writing my expected approach for experts to show me the path on how we can solve this kind of a problem better than this naive approach. If you want to help solve it with some other technique, you're most welcome, here is the link to that question.
Any help shall be greatly acknowledged.
Related
Insert ceros instead of interopolate ARIMA_PLUS bigquery
I want to do ARIMA_plus forecasting on a series of sale records. The problem is that sale records only contain sales. When doing the forecast we need to insert for every product the "non sales", which, essentially, are rows with the import column set to cero for every day the product has not been sold. We have here two options: Fill the database with those zero-rows (uses a lot of space) When doing the forecasting with ARIMA_PLUS in bigquery tell the model to fill with zeros instead of interpolating (default and seemingly unique option). I want to follow the second option, yet, i dont see how. Here you can see a screenshot of the documentation Google info about interpolation The first option would be carried out with a merge, nevertheless I would prefer to discard it since it increases the size of the sales table. I have scanned the documentation and havent seen any solution
You need to provide an input dataset covering the missing values with the right method for your use case. In other words, the SQL query must solve the interpolation so that the input for the model already contains the expected data. You can, for example, create a query to add a liner interpolation solution for your use case. So, the first approach you mentioned can be solved using that input SQL (rather than adding the data to the source table) and the second approach is not valid in bigquery, as far as I know. Here you have an example: https://justrocketscience.com/post/interpolation_sql/
Need to divide a Dataframe in various tables using multiple categories and date time
this is my first time asking a question here, so if I'm doing something wrong please guide me to the right place. I have a big and clean dataset. (29000+ , 24). The thing is that I have to calculate the churn rate based on 4 different categorical columns, and I'm given just 1 column that contains the subs for a given period. I have a date column too. My idea on calculating the churn is to do churn_rate= (Sub_start_period-Sub_end_period)/Sub_start_period*100 The Problem I don't know how to group the data using these 4 different categorical variables. Also If I manage to do so I would end up with more than 200 different tables, so I don't believe this would be a good approach. My goal is able to predict the churn rate using the information in the table but I should be able to determine the churn rate based on these variables. The churn is not given, it has to be calculated, so I'm having problems here as I can't think of a way of working through this.
Merge two CSV and collate data
I have two CSV files, the first like so: Book1: ID,TITLE,SUBJECT 0001,BLAH,OIL 0002,BLAH,HAMSTER 0003,BLAH,HAMSTER 0004,BLAH,PLANETS 0005,BLAH,JELLO 0006,BLAH,OIL 0007,BLAH,HAMSTER 0008,BLAH,JELLO 0009,BLAH,JELLO 0010,BLAH,HAMSTER 0011,BLAH,OIL 0012,BLAH,OIL 0013,BLAH,OIL 0014,BLAH,JELLO 0015,BLAH,JELLO 0016,BLAH,HAMSTER 0017,BLAH,PLANETS 0018,BLAH,PLANETS 0019,BLAH,HAMSTER 0020,BLAH,HAMSTER And then a second CSV with items associated with the first list, with ID being the common attribute between the two. Book2: ID,ITEM 0001,PURSE 0001,STEAM 0001,SEASHELL 0002,TRUMPET 0002,TRAMPOLINE 0003,PURSE 0003,DOLPHIN 0003,ENVELOPE 0004,SEASHELL 0004,SERPENT 0004,TRUMPET 0005,CAR 0005,NOODLE 0006,CANNONBALL 0006,NOODLE 0006,ORANGE 0006,SEASHELL 0007,CREAM 0007,CANNONBALL 0007,GUM 0008,SERPENT 0008,NOODLE 0008,CAR 0009,CANNONBALL 0009,SERPENT 0009,GRAPE 0010,SERPENT 0010,CAR 0010,TAPE 0011,CANNONBALL 0011,GRAPE 0012,ORANGE 0012,GUM 0012,SEASHELL 0013,NOODLE 0013,CAR 0014,STICK 0014,ORANGE 0015,GUN 0015,GRAPE 0015,STICK 0016,BASEBALL 0016,SEASHELL 0017,CANNONBALL 0017,ORANGE 0017,TRUMPET 0018,GUM 0018,STICK 0018,GRAPE 0018,CAR 0019,CANNONBALL 0019,TRUMPET 0019,ORANGE 0020,TRUMPET 0020,CHERRY 0020,ORANGE 0020,GUM The real datasets are millions of records, so I'm sorry in advance for my simple example. The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...) Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT. Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal. Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation. Thanks in advance.
An Alteryx solution: Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you. Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID" Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count" Drag a browse tool on and connect the summary's output to the browse tool's input. run the workflow After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause. I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many. Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg. Items = CALCULATE ( CONCATENATEX ( DISTINCT ( 'Book2'[ITEM] ), 'Book2'[ITEM], ", ", 'Book2'[ITEM], ASC ) ) Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency. Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer. As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx. With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not. Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7... Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results. I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows just navigate to the directory which contains the CSV and write the following command: copy pattern newfileName.csv #example copy *.csv merged.csv now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy. I hope this help you.
What is difference between Pentaho DI "variables" and "fields"?
Could not find much information about this. I can see that fields can have multiple copies per row in a transformation. But what are variables? Are they unique across all rows a transformation produces? But, by the name, variables are meant to vary. What is difference between fields and variables exactly? Can someone enlighten me please Thank you
PDI transformations work with a stream of rows that pass through all the steps. The rows consist of a number of fields that the steps can act on, converting them, filtering them, sorting, etc. Variables are more like a configuration help and have a single value in the transformation. It's very important to remember that they can NOT be set/changed and used within the same transformation, because all the steps execute in parallel! Example In your transformation you have a variable called "last_staging_run" and its value is "2017/01/19 05:00:00". This one has been passed to the transformation from the parent job. You then use it in a Table Input: SELECT id, product_id, price, number FROM sales WHERE purchase_date > ${last_staging_run} This will give you the new rows since the last staging run with the fields id, product_id, price and number. You might then lookup the product names or filter products with a zero price with other steps, then store it in a table again.
Modeling products pricing structure
I need to model a rather complex pricing structure for some of our products. Today we lookup the prices manually. Here's a picture with explanations of the "matrix" that we use today: Sample model (sorry for the link - but I'm not allowed to post images because I've just opened my account.) Now I need to transfer this model to a RDBMS system (SQL Server 2008 R2). The entry point when looking up a price is the Category, then the yearly interval and finally the interval depending on how many products we're selling on this order. The result of the query should be two prices. Do you have any suggestions on how to model this? I was thinking of modeling it as a matrix with a RowNumber, CellNumber and a CellValue. But then I need another table for describing what is contained in each cell (by referencing the row and cell numbers). If doing that, I could just include the prices in that description table. But that doesn't seem like the best solution. Do you have any hints/solutions on how to model this problem the best way?
I think I would make something like this: Categories are separated into its own table. Each row in the price table are uniquely identified by the category and starting point of the sold and shipped range. I don't think you would need to specify ending point in the table (since the end point of a range should be the starting point of the next range minus one). Edit: With this model, you will need to add a row in the Prices table for each combination of category, units sold-interval and units shipped-interval, but right now I can't think of an easier way.