How do I create SQL script to create a new column in the dataset and generating a subcategory on each item if it matches another table? - google-bigquery

enter image description hereMy business task is to match our sales data with subcategories based on UPCs in our master spreadsheet. That way, we can make better data viz comparing product sales within each subcategory. What is the best way to approach this problem?
I have 4 year's worth sales data that carries product name, quantity sold, and main categories per line. It does not have subcategories. I oversee the data through Connected Sheets, generate a pivot table with desired outputs in the rows and columns, and then I built a FILTER function/VLOOKUP function in google sheets in order to search for a product subcategory based on its UPC. Since the data set is huge, I am running into some problems where it would not produce subcategory results if its a discontinued item or if there are duplicates in my master list of subcategories per product.
I think it would be more efficient if I created a table of our items from our master spreadsheet and downloaded that table into Bigquery so that I would have a subcategory matched to the UPC code. From there, I'd create a query to have a new column generated in the sales data to populate a subcategory IF the UPC code of the product on the line would match the UPC on the table I've added.
Does that make sense? How would I go about doing that? What would the SQL look like in order to perform such task?
I haven't started the task yet, still exploring options on BigQuery and/or Connected Sheets. Any help would be appreciated!

Related

How to populate all possible combination of values in columns, using Spark/normal SQL

I have a scenario, where my original dataset looks like below
Data:
Country,Commodity,Year,Type,Amount
US,Vegetable,2010,Harvested,2.44
US,Vegetable,2010,Yield,15.8
US,Vegetable,2010,Production,6.48
US,Vegetable,2011,Harvested,6
US,Vegetable,2011,Yield,18
US,Vegetable,2011,Production,3
Argentina,Vegetable,2010,Harvested,15.2
Argentina,Vegetable,2010,Yield,40.5
Argentina,Vegetable,2010,Production,2.66
Argentina,Vegetable,2011,Harvested,15.2
Argentina,Vegetable,2011,Yield,40.5
Argentina,Vegetable,2011,Production,2.66
Bhutan,Vegetable,2010,Harvested,7
Bhutan,Vegetable,2010,Yield,35
Bhutan,Vegetable,2010,Production,5
Bhutan,Vegetable,2011,Harvested,2
Bhutan,Vegetable,2011,Yield,6
Bhutan,Vegetable,2011,Production,3
Image of the above csv:
Now there is a very small country lookup table which has all possible countries the source data can come with, listed. PFB:
I want to have the output data's number of columns always fixed (this is to ensure the reporting/visualization tool doesn't get dynamic number columns with every day's new source data ingestions depending on the varying distinct number of countries present).
So, I've to somehow join the source data with the country_lookup csv and populate all those columns with default value as F. Every country column would be binary with T or F being the possible values.
The original dataset from the above has to be converted into below:
Data (I've kept the Amount field unsolved for column Type having Derived Yield as is, rather than calculating them below for a better understanding and for you to match with the formulae):
Country,Commodity,Year,Type,Amount,US,Argentina,Bhutan,India,Nepal,Bangladesh
US,Vegetable,2010,Harvested,2.44,T,F,F,F,F,F
US,Vegetable,2010,Yield,15.8,T,F,F,F,F,F
US,Vegetable,2010,Production,6.48,T,F,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
US,Vegetable,2011,Harvested,6,T,F,F,F,F,F
US,Vegetable,2011,Yield,18,T,F,F,F,F,F
US,Vegetable,2011,Production,3,T,F,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+2)/(3+3),T,F,T,F,F,F
US,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Argentina,Vegetable,2010,Harvested,15.2,F,T,F,F,F,F
Argentina,Vegetable,2010,Yield,40.5,F,T,F,F,F,F
Argentina,Vegetable,2010,Production,2.66,F,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Argentina,Vegetable,2011,Harvested,10,F,T,F,F,F,F
Argentina,Vegetable,2011,Yield,90,F,T,F,F,F,F
Argentina,Vegetable,2011,Production,9,F,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Bhutan,Vegetable,2010,Harvested,7,F,F,T,F,F,F
Bhutan,Vegetable,2010,Yield,35,F,F,T,F,F,F
Bhutan,Vegetable,2010,Production,5,F,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Bhutan,Vegetable,2011,Harvested,2,F,F,T,F,F,F
Bhutan,Vegetable,2011,Yield,6,F,F,T,F,F,F
Bhutan,Vegetable,2011,Production,3,F,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
The image of the above expected output data for a structured look at it:
Part 1 -
Part 2 -
Formulae for populating Amount Field for Derived Type:
Derived Amount = Sum of Harvested of all countries with T (True) grouped by Year and Commodity columns divided by Sum of Production of all countries with T (True)grouped by Year and Commodity columns.
So, the target is to have a combination of all the countries from source and calculate the sum of respective Harvested and Production values which then has to be divided. The commodity can be more than one in the actual scenario for any given country, but that should not bother as the summation of amount happens on grouped commodity and year.
Note: The users in the frontend can select any combination of countries. The sole purpose of doing it in the backend rather than dynamically doing it in the frontend is because AWS QuickSight (our visualisation tool), even though can populate sum on selected column filters but doesn't yet support calculation on those derived summed fields. Hence, the entire calculation of all combination of countries has to be pre-populated (very naive approach) in order to make it available in report on dynamic users selection of countries.
Also if you've any better approach (than the above naive approach mentioned in note) to solve this problem, you are most welcome to guide me. I've also posted a question on the same problem without writing my expected approach for experts to show me the path on how we can solve this kind of a problem better than this naive approach. If you want to help solve it with some other technique, you're most welcome, here is the link to that question.
Any help shall be greatly acknowledged.

DAX Calculate function to sumif many values across data sources

I've been able to get the =calculate(sum(Filter,earlier) method to work on items in a single data table, but I need to extend these values into a separate table. Here is the situation:
two data sources, PoList and MinList. PoList is a list of purchases orders and open units by item barcode. MinList is a list of every store/barcode combination. I need to see total open units per bardcode in MinList. Problem is, there are many different instances of each barcode on PoList as each barcode is on many orders. Here was my approach:
1) create a new column on PoList to calculate the total units per barcode,which i did successfully using
= CALCULATE(SUM('PoList'[Open Units]), FILTER('PoList', 'PoList'[Barcode] = EARLIER ('PoList'[Barcode])))
2) Then, pull the first instance of the [Open Units] into MinList for each Barcode value. (like a vlookup, but can't create a relationship since there are multiple barcode instances on PoList
Notes:
-Im trying to avoid creating a pivottable of PoList to eliminate duplicates. Since i can't subsequently make that pivot table, a data source, I would be creating something that would need manually updated.
Any thoughts? Again, I can get it to work in a single data source, I just need that aggregate value to be calculated in a separate data source than the one that is being analyzed to calculate the sum.
Thanks in advance!

Databases design and primary key composed

I have a table named minibar_bill and i use it for keeping evidence of client's expenditure. I'm trying to build a hotel/pension system management.
I thought that i could make a table
Minibar_bill with (id_bill, id_minibar_product, id_client)
And i would like to add those info on an invoice based on bill_id...
How should i do it ?
I mean i want to have something like that:
Id_bill(1)
id_minibar_product(1,2,3)
id_client(123)
So first 3 records will be :
1, 1, 123
1, 2, 123
1, 3, 123
And i want the id_bill to be on invoice ... maybe i could switch id_product with id_bill
Where id_bill(1) - would be the first bill record in database
id_minibar_product(1,2,3) - would be product 1,2,3 which has been consumed by client
id_client(123) - client id which we use on invoice to collect data from Client table in order to print them on invoice( i will use C# for UI ).
What I have tried:
I've tried to make a db with field id_bill and id_product but i think it's a wrong approach since i made them a composed primary key and i cannot add them to foreign key in Invoice table.
Here are some suggestions for your design:
It's a good idea to name things descriptively, but if you create a table called Minibar_bill, that's going to be inconsistent and short sighted if you want to start charging in-room movies and in-room dining, services etc. to the room. I suggest you call it something more generic - remove Minibar from all of your table names.
You must never put comma separated values into a single field.
There are a million sales data models online, including, as already suggested, templates in MS Access. There's no point reinventing the wheel
I suggest you have something like this
Client A list of clients
Products A list of products you can be billed for (not just minibar)
Bill A client has zero or more bills (usually one)
BillLine A bill has zero ore more lines. Each line represents
One product being charged for on a bll
So Bill is the header. It's up to you whether you add a column indicating when / if it is invoiced, paid etc., or whether you want to create a seperate invoicing module.
With regards to this comment:
What i wish for is to link Invoice to minibar_bill in order to have the status on a single Invoice of all products from minibar which have been bought by a customer.
If you have a seperate invoice table you can write the BillID to it to link it.
I'm not sure if you understand that all this info exists across different tables, and when, for example, you print an invoice, you go and collect all the info from across the tables at that time.

Searching through data with multiple conditions VBA/Excel

I have a list of data with columns indicating the test a product went under and the product. Each product undergoes several tests for example (hot medium cold). If you can imagine, the data for a specific product may look like
A B
hot product1
medium product1
cold product1
I have many products that under went testing so the spreadsheet is extensive (A400 = cold, B400 = productX). What I am trying to do is see if each product underwent the hot medium and cold testing. I made an additional column to eliminate repeated product listing and search the spreadsheet and find the tests (no success). The end goal is to create an additional column with all the parts that did not go through all of the testing.
Make four new columns with these formulas
C: COUNTIFS(B:B,$A1,A:A,"hot")
D: COUNTIFS(B:B,$A1,A:A,"medium")
E: COUNTIFS(B:B,$A1,A:A,"cold")
These will show you how many times each of those products has been tested in the respective category. Then, use this in the last column:
F: IF(AND($C1>0,$D1>0,$E1>0,"",$B1)
This will account for items that were tested in a more than once. If it had a category it was not tested in, it will show you that product name. There might be multiple values in this column.
EDIT: If you absolutely had to have only one column, you could easily combine these together into a superformula.

Modeling products pricing structure

I need to model a rather complex pricing structure for some of our products.
Today we lookup the prices manually. Here's a picture with explanations of the "matrix" that we use today: Sample model (sorry for the link - but I'm not allowed to post images because I've just opened my account.)
Now I need to transfer this model to a RDBMS system (SQL Server 2008 R2). The entry point when looking up a price is the Category, then the yearly interval and finally the interval depending on how many products we're selling on this order. The result of the query should be two prices.
Do you have any suggestions on how to model this? I was thinking of modeling it as a matrix with a RowNumber, CellNumber and a CellValue. But then I need another table for describing what is contained in each cell (by referencing the row and cell numbers). If doing that, I could just include the prices in that description table. But that doesn't seem like the best solution.
Do you have any hints/solutions on how to model this problem the best way?
I think I would make something like this:
Categories are separated into its own table.
Each row in the price table are uniquely identified by the category and starting point of the sold and shipped range. I don't think you would need to specify ending point in the table (since the end point of a range should be the starting point of the next range minus one).
Edit: With this model, you will need to add a row in the Prices table for each combination of category, units sold-interval and units shipped-interval, but right now I can't think of an easier way.