Merge left on load data from BigQuery - sql

I have an input table: input and one or more maptables, where input contains data for multiple identifiers and dates stacked under each other. The schemas are as follows:
#input
Id: string (might contain empty values)
Id2: string (might contain empty values)
Id3: string (might contain empty values)
Date: datetime
Value: number
#maptable_1
Id: string
Id2: string
Target_1: string
#maptable_2
Id3: string
Target_2: string
What I do now is that I run a pipeline that for each date/(id, id2, id3) combination loads the data from input and applies a left merge in python against one or more maptables (both as a DataFrame). I then stream the results to a third table named output with the schema:
#output
Id: string
Id2: string
Id3: string
Date: datetime
Value: number
Target_1: string (from maptable_1)
Target_2: string (from maptable_2)
Target_x: ...
Now I was thinking that this is not really efficient. If I change one value from a maptable, I have to redo all the pipelines for each date/(id, id2, id3) combination.
Therefore I was wondering if its possible to apply directly a left merge when loading the data? How would such a Query look like?
In the case of multiple maptables and target columns, would it also be beneficial to do the same? Would the query not become too complex or unreadable, in particular as the id columns are not the same?

How would such a Query look like?
Below is for BigQuery Standard SQL
INSERT `project.dataset.output`
SELECT *
FROM `project.dataset.input` i
LEFT JOIN `project.dataset.maptable_1` m1 USING(id, id2)
LEFT JOIN `project.dataset.maptable_2` m2 USING(id3)
In the case of multiple maptables and target columns ...
If all your map tables are same/similar to two maps in your question - in this case it is just extra LEFT JOIN for each extra map

Related

Transform a column of type string to an array/record i.e. nesting a column

I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL

How to map each parameter in firebase analytics sql to a separate column?

We use firebase analytics and bigQuery to run sql queries on collected data. This is turning out to be complex as some fields like event_params are repeated records. I want to map each of these repeated fields to separate column.
I want to write queries in the above dataset like finding the difference between minIso and maxIso. How can I define a UDF or a view which can return me the table in the column schema?
I want to map each of these repeated fields to separate column.
Going direction of pivoting parameters into columns conceptually doable but (in my strong opinion) is a “dead end” in most practical cases
There are many posts here on SO showing how to pivot/transpose rows to columns and the patterns are 1) you just hardcode all possible keys in your query )and obviously no-one likes this) or 2) you create utility query that extracts all keys for you and contracts needed query for you which then you need to execute – so either you do it manually in two steps or you using client of your choice to script those to steps to run in automated way
As I mentioned – there are plenty example of such here on SO
I want to write queries in the above dataset like finding the difference between minIso and maxIso
If all you need is to do some math with few parameters in the record – see below example
Dummy Example: for each app_instance_idtween find diff between coins_awarded and xp_awarded
#standardSQL
SELECT user_dim.app_info.app_instance_id, ARRAY(
SELECT AS STRUCT name,
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'coins_awarded') -
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'xp_awarded') diff_awarded
FROM UNNEST(event_dim) dim
WHERE dim.name = 'round_completed'
) AS event_dim
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160607`
WHERE 'round_completed' IN (SELECT name FROM UNNEST(event_dim))
with result as
Row app_instance_id event_dim.name event_dim.diff_awarded
1 02B6879DF2639C9E2244AD0783924CFC round_completed 226
2 02B6879DF2639C9E2244AD0783924CFC round_completed 171
3 0DE9DCDF2C407377AE3E779FB05864E7 round_completed 25
...
Dummy Example: leave whole user_dim intact but replace event_dim with just calculated values
#standardSQL
SELECT * REPLACE(ARRAY(
SELECT AS STRUCT name,
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'coins_awarded') -
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'xp_awarded') diff_awarded
FROM UNNEST(event_dim) dim
WHERE dim.name = 'round_completed'
) AS event_dim)
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160607`
WHERE 'round_completed' IN (SELECT name FROM UNNEST(event_dim))
This is turning out to be complex as some fields like event_params are repeated records. I want to map each of these repeated fields to separate column.
Hope, from above examples, you can see how really simple it is to deal with repeated fields. I do really recommend you to learn / practice work with arrays to gain long term benefits rather than looking for what [wrongly] looks like shortcut

Transposing JSON array values to rows in Stream Analytics yields no output

I'm streaming JSON input from blob storage. Most data in the JSON is stored as name/value pairs in an array. I need to send each input as a single output where each name/value pair is transposed to a column in the output. I have code that works when using the "Test" feature while editing the query. However when testing live, only the debugblob1 output receives data.
Why would the the live test work different from the query test? Is there a better way to transpose array data to columns?
Note: The array's name/value pairs are always the same, though I don't want a solution that depends on their order always being the same, since that is out of my control.
QUERY
-- Get one row per input and array value
WITH OneRowPerArrayValue AS
(SELECT
INPUT.id AS id,
ARRAYVALUE.ArrayValue.value1 AS value1,
ARRAYVALUE.ArrayValue.value2 AS value2
FROM
[inputblob] INPUT
CROSS APPLY GetElements(INPUT.arrayValues) as ARRAYVALUE),
-- Get one row per input, transposing the array values to columns.
OneRowPerInput AS
(SELECT
INPUT.id as id,
ORPAV_value1.value1 as value1,
ORPAV_value2.value2 as value2
FROM
[inputblob] INPUT
left join OneRowPerArrayValue ORPAV_value1 ON ORPAV_value1.id = INPUT.id AND ORPAV_value1.value1 IS NOT NULL AND DATEDIFF(microsecond, INPUT, ORPAV_value1) = 0
left join OneRowPerArrayValue ORPAV_value2 ON ORPAV_value2.id = INPUT.id AND ORPAV_value2.value2 IS NOT NULL AND DATEDIFF(microsecond, INPUT, ORPAV_value2) = 0
WHERE
-- This is so that we only get one row per input, instead of one row per input multiplied by number of array values
ORPAV_value1.value1 is not null)
SELECT * INTO debugblob1 FROM OneRowPerArrayValue
SELECT * INTO debugblob2 FROM OneRowPerInput
DATA
{"id":"1","arrayValues":[{"value1":"1"},{"value2":"2"}]}
{"id":"2","arrayValues":[{"value1":"3"},{"value2":"4"}]}
See my generic example below. I believe this is what your asking; where you have a JSON object that contains an Array of json objects.
WITH MyValues AS
(
SELECT
arrayElement.ArrayIndex,
arrayElement.ArrayValue
FROM Input as event
CROSS APPLY GetArrayElements(event.<JSON Array Name>) AS arrayElement
)
SELECT ArrayValue.Value1, CAST(ArrayValue.Value2 AS FLOAT) AS Value
INTO Output
FROM MyValues

Parameters being passed in are in wrong format for table

I have a query whereby parameters are being passed in in the format 'GRB','MIN','OSH' and so on. These are called "Divisions". They may be passed in singly or in a list of any number of them. These are used in one of the tables n a report.
The problem is that the table I am going to read other divisions are going to have them in numerical format. so GRB = 02, MIN=04 and OSH = 08. There are 12 different values.
So in essence, the SSRS report might pass in all these for an "IN" statement. They have to be presented in this manner.
I need to find a way to convert all of these alphanumeric codes into their corresponding numeric codes.
Main query with alphanumeric:
dbo.TQMNCR.PlantID IN (#PlantID)
Temp table with (hopefully) corresponding numeric values:
ProdTable.DIMENSION2_ in (????????)
Presuming that dbo.TQMNCR has a PlantNumber column for the numeric key, you could use a join like this.
SELECT ...
FROM dbo.TQMNCR AS Source
INNER JOIN ProdTable AS Mapping
ON Mapping.DIMENSION2_ = Source.THE_NUMERIC_KEY_COLUMN_YOU_SPEAK_OF
WHERE Mapping.PlantID IN (#PlantID)
Alternatively, if you don't want to use a permanent table to map them, you can also use a common table expression (numerous other options).
WITH Mapping_CTE AS (
SELECT 'GRB' AS PlantID, '02' AS PlantNumber
UNION ALL
SELECT 'MIN' AS PlantID, '04' AS PlantNumber
UNION ALL
...
)
SELECT ...
FROM dbo.TQMNCR AS Source
INNER JOIN Mapping_CTE AS Mapping
ON Mapping.PlantNumber = Source.THE_NUMERIC_KEY_COLUMN_YOU_SPEAK_OF
WHERE Mapping.PlantID IN (#PlantID)

Splitting text in SQL Server stored procedure

I'm working with a database, where one of the fields I extract is something like:
1-117 3-134 3-133
Each of these number sets represents a different set of data in another table. Taking 1-117 as an example, 1 = equipment ID, and 117 = equipment settings.
I have another table from which I need to extract data based on the previous field. It has two columns that split equipment ID and settings. Essentially, I need a way to go from the queried column 1-117 and run a query to extract data from another table where 1 and 117 are two separate corresponding columns.
So, is there anyway to split this number to run this query?
Also, how would I split those three numbers (1-117 3-134 3-133) into three different query sets?
The tricky part here is that this column can have any number of sets here (such as 1-117 3-133 or 1-117 3-134 3-133 2-131).
I'm creating these queries in a stored procedure as part of a larger document to display the extracted data.
Thanks for any help.
Since you didn't provide the DB vendor, here's two posts that answer this question for SQL Server and Oracle respectively...
T-SQL: Opposite to string concatenation - how to split string into multiple records
Splitting comma separated string in a PL/SQL stored proc
And if you're using some other DBMS, go search for "splitting text ". I can almost guarantee you're not the first one to ask, and there's answers for every DBMS flavor out there.
As you said the format is constant though, you could also do something simpler using a SUBSTRING function.
EDIT in response to OP comment...
Since you're using SQL Server, and you said that these values are always in a consistent format, you can do something as simple as using SUBSTRING to get each part of the value and assign them to T-SQL variables, where you can then use them to do whatever you want, like using them in the predicate of a query.
Assuming that what you said is true about the format always being #-### (exactly 1 digit, a dash, and 3 digits) this is fairly easy.
WITH EquipmentSettings AS (
SELECT
S.*,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 5, 1) EquipmentID,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 3, 3) Settings
FROM
SourceTable S
INNER JOIN master.dbo.spt_values V
ON V.Value BETWEEN 1 AND Len(S.AwfulMultivalue) / 6
WHERE
V.type = 'P'
)
SELECT
E.Whatever,
D.Whatever
FROM
EquipmentSettings E
INNER JOIN DestinationTable D
ON E.EquipmentID = D.EquipmentID
AND E.Settings = D.Settings
In SQL Server 2005+ this query will support 1365 values in the string.
If the length of the digits can vary, then it's a little harder. Let me know.
Incase if the sets does not increase by more than 4 then you can use Parsename to retrieve the result
Declare #Num varchar(20)
Set #Num='1-117 3-134 3-133'
select parsename(replace (#Num,' ','.'),3)
Result :- 1-117
Now again use parsename on the same resultset
Select parsename(replace(parsename(replace (#Num,' ','.'),3),'-','.'),1)
Result :- 117
If the there are more than 4 values then use split functions