Merge tables in Power BI - sql

I have a problem creating a table in Power BI/SQL
Basically, i have a CSV file with a dataset of the crime reports from a designated year.
Each crime has date (in 3 columns, day, month and year), location (in coordinates lat and long), type of crime, neighborhood and so on.
To make things less "dense" i created a few tables (Like for example, a "Location_ID" table with a PK and a combined Lat and Long for each ID), same as for Dates, Types of crime, neighborhood, etc.
The thing is that now i have my main table empty, and need to "replace" each of the data with the aforemention PK from each new table created. For example, i have the crime N°121 which happened in Buenos Aires, Argentina (Thats "3" in the New table ID_LOCATION), at 4/3/2022 (Thats "Z" in the New Table ID_DATE) and so on.
I dont know how to reassing every data in the column with the correct new value from the tables that i created without doing it manually (they are over 80k entrys, would take forever).
Thanks in advance

Related

Is it possible to recursively combine similar records (keeping - and adding - only specific columns) using a select?

I've been wracking my brain here trying to figure out a way to achieve a solution to the following without external applications (such as Excel).
I'll set up the structure: We are using a 3rd party ERP that provides a nicely configured conversion system for product packaging types. I'm trying to create a query that will take all conversions for a given product and return them inline. Because the number of conversion records is indeterminate, the query would need to be recursive.
To make things simple, let's use package quantites for this scenario example. If a product can be shipped in [eaches, pairs, sets, packages, and cartons], the conversion table records would look something like this:
pkConvKey
fkProdID
childUnit
parentUnit
chPerParent
ConvRec001
Prod123
each
pair
2
ConvRec002
Prod123
pair
set
3
ConvRec003
Prod123
set
pack
7
ConvRec004
Prod123
pack
carton
24
Using the table above, I can determine how many pairs of Prod123 are contained in a carton by following the math:
24 packs x 7 sets x 3 pairs = 504 pairs per carton.
I could further multiply that by 2 to get the count of individual pieces in a carton (1,008). That's the idea behind the conversion table but here's my actual problem.
I'd like to return a table of records where associated conversions are in-line, thusly:
fkProdID
unit1
unit2
qtyInUnit2
unit3
qtyInUnit3
unit4
qtyInUnit4
unit5
qtyInUnit5
Prod123
each
pair
2
set
3
pack
7
carton
24
Complicating the matter is that the unit types are unknown (arbitrary) values and there is no requirement to have a full, intact chain from unit A to unit Z. (For example, there might be a conversion record from each to pair, and another from set to pack, but not one from pair to set).
In this scenario, the select can't recursively link the records, and they would appear in the resulting table as two separate records - which is fine.
I have attempted to join the table to itself on t1.parentUnit = t2.childUnit, but that obviously doesn't work recursively.
I fear my only solution is to left join the table over and over - as many as 20 times in the query, settling for NULL values if additional conversions do not exist but then I would also have many duplicate rows (with incomplete conversion chains) to weed out.
Can this be done in a select query?
Thanks in advance!
-Dan

How to create Deltas in bigquery

I have a table in BQ which I refresh on daily basis. It's a full snapshot every day.
I have a business requirement to create deltas of that feed.
Table Details :
Table contains 10 columns
Out of 10 columns, 5 columns change on daily basis. How do I identify which columns changed and only create a snapshot for that?
For eg here are the columns in tableA: columns which will frequently change are in bold.
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - ebook
last_product_purchase_date - 2018-05-01
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - null
third_product_purchase_date - null
fourth_product - null
fourth_product_purchase_date - null
After more purchase Data will look like this:
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - Hardbook
last_product_purchase_date - 2018-05-17
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - CD
third_product_purchase_date - 2017-01-01
fourth_product - null
fourth_product_purchase_date - null
first_product = first product ever purchased
last_product = most recent product purchased
This is just one row of records for one customer. I have millions of customers with all these columns, and let's say half a million of the rows will be updated on daily basis.
In my delta, I just want the rows where any of the column value changed.
It seems like you have a column for each product bought and their repetition, perhaps this comes from a de-normalize dimensional models. To query the last "update" you would have to compare each columns the previous row by using the lead function. This would use a lot of computation and might not be optimal.
I recommend using repeated fields. The product and product_purchase_date would be repeated field and you could simply query using a where product_purchase_date = current_date() which would use much less computation.
De-normalize dimensional models are meant to use less computation on traditional data warehouses. Bigquery being fast, highly scalable, enterprise data warehouse has a lot of computing power.
To have a beter understanding on how BigQuery works under the hood I recommend reviewing this document.

CUBE dimension reduction with sequential slices

I have data in a cube, organized across 5 axes:
Source (data provider)
GEO (country)
Product (A or B)
Item (Sales, Production, Sales_national)
Date
In short, I have multiple data providers for different Product, Item, GEO and Date, i.e. for different slices of the cube.
Not all "sources" cover all dates, product, countries. Some will have more up to date information, but it will be preliminary.
The core of the problem is to have a synthesis of what all sources say.
Importantly, the choice of data provide for each "slice" of the cube is made by the user/analyst and needs to be so (business knowledge of provider methodology, quality etc).
What I am looking for, is a way to create a 'central dictionary' with all the calculation-types.
Such dictionary would be organized like this:
Operation Source GEO Item Product Date_start Date_end
Assign Source3 ITA Sales Product_A 01/01/2016 01/01/2017
Assign Source1 ITA Sales Product_A 01/01/2017 last
Assign with %delta Source2 ITA Sales Product_A 01/01/2018 last
This means:
From Jan2016 to Jan 2017, ProdA Sales in Italy, take Source 3
From Jan17 to last available, take Source 1
From Jan18 to last available, take the existing, add %difference across time from Source 2
The data and calculation are examples, there are other more complex, but the gist of it is putting slices of the "Source" 5-dimensional cube into a "Target" 4-dimensional cube, with a set of sequential calculations.
In SQL, it is the equivalent of a bunch of filtered SELECTs + INSERT, but the complexity of the calculations will probably lead to lots of nested JOINS.
The solution will be most likely custom functions, but I was wondering if anyone is aware of a language or software other than DAX/MDX which would allow to do this with minimal customization?
Many thanks

SSAS & SCD2 - how to deal with IsActive row in Dim

I am using SQL Server 2014 and Visual Studio 2015.
I have an SCD2 for staff names, for example
SK AltKey Name Gender IsActive
1 15 Sven Svensson M 1
2 16 Jo Jonsson M 1
and in the fact table
SK AgentSK CallDuration DateKey
100 1 335 20160808
101 2 235 20160809
So, you can see the cube is currently linked on FctAgentSK and DimSK. This works as planned. However, when Jo changes gender the SCD2 makes the row inactive (0) and inserts a new row with the new gender and IsActive of '1'.
The problem I face is that the factSK 101 still references the 'OLD' details for the Agent. How should I deal with this to be able to still report on the call, but also reference the "correct" details of the Agent - reflecting their current gender.
When a new fact is inserted it will have the 'NEW' SK assigned, but basically I would need to report on ALL calls that have happened either side of the gender change.
Any suggestions please?
Thank you.
As Nick.McDermaid suggested, if you don't want SCD2 functionality, you could remove it from the dimension design (I've often seen it over-implemented when it's not actually wanted: perhaps you've inherited that kind of setup?).
If you want to/must keep the SCD2 design, but want to report on current staff attributes (gender and any other SCD2 attributes).
Kimball documents a "Type 6" here: SCD types 0,4,5,6,7. You add a "current" value of the attribute to an existing Type2 design. You could then report on the "current" attributes only.
I'm assuming that the Staff Name "Alt Key" is the durable staff-member key, that stays the same through changes in staff attributes? You could make a slightly different Employee dimension (or, hierarchy inside the Employee dimension), that has Alt Key as its leaf-level key. If you don't still have SK as a dimension attribute, this will make the dimension table "collapse" into one member per AltKey, not one member per SK. Obviously, you can't add any SCD2 attributes to this Alt Key hierarchy, as there won't be a single value per key; and this raises special problems about what to call the durable "employee" (i.e. what the Name Column of the leaf level will be), since Employee Name is one of the most obvious SCD2 attributes that will not remain the same. Probably this approach is best combined with an underlying "Type6" inclusion of the "current value" in the dimension data, as described in (1) above.

Pig Latin: using field in one table as position value to access data in another table

Let's say we have two tables. The first table has following description:
animal_count: {zoo_name:chararray, counts:()}
The meaning of "zoo_name" fields is obvious. "counts" fields contains counts of each specific animal species. In order to know what exact species a given field in "counts" tuple represents, we use another table:
species_position : {species:chararray, position:int}
Let assume we have following data in "species_position" table:
"tiger", 0
"elephant", 1
"lion", 2
This data means the first field in animal_count.counts is the number of tigers in a given zoo. The second field in that tuple is the number of elephants, and so on. So, if we want to represent that fact that "san diego zoo" has 2 tigers, 4 elephants and no lion, we will have following data in "animal_count" table:
"san diego zoo", (2, 4, 0)
Given this setup, how can I write a query to extract the number of a given species in all zoos? I was hoping for something like:
FOREACH species_position GENERATE species, animal_count.counts.$position;
Of course, the "animal_count.counts.$position" won't work.
Is this possible without resorting to UDF?