Convert the latest data pull of a raw Variant table into a normal table: Snowflake - sql

I have a variant table where raw json data is stored in a column called "raw" as shown here.
Each row of this table is a full data pull from an API and ingested via snowpipe. Within the json there is a 'pxQueryTimestamp' key and value pair. The latest value for this field should have the most up to date data. How would I go about only normalizing this row?
Usually my way around this, is to only pipe over the latest data from "s3" so that this table has only one row, then I normalize that.
I'd like to have a historic table of all data pulls as show below but when normalizing we only care about the most relevant up to date data.
Any help is appreciated!

If you are saying that you want to flatten and retain everything in the most current variant record, then I'd suggest leveraging a STREAM object in Snowflake, which would then only have the latest variant record. You could then TRUNCATE your flattened table and run an insert from the STREAM object to your flattened table, which would then move the offset forward and your STREAM would then be empty.
Take a look at the documentation here:
https://docs.snowflake.net/manuals/user-guide/streams.html

Related

Is it posible to insert into the bigQuery table rows with different fields?

Using bigQuery UI I've created new table free_schem_table and haven't set any schema, then I tried to execute:
insert into my_dataset.free_schema_table (chatSessionId, chatRequestId,senderType,senderFriendlyName)
values ("123", "1234", "CUSTOMER", "Player")
But BigQuery UI demonsrtrated me the popup where written:
Column chatSessionId is not present in table my_dataset.free_schema_table at [1:43]
I expected that BiqQuery is a NoSql storage and I should be able to insert rows with different columns.
How could I achieve it ?
P.S.
schema:
BigQuery requires a schema with strong type.
If you need free schema, similar thing in BigQuery is to define a single column in STRING type and store JSON inside.
JSON functions will help you extract field from JSON string later, but you don't benefit from BigQuery's optimization if you predefine your schema and save data in different columns.

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

Pixel tracking to BQ: how to save querystring parameter values directly to BQ table fields

I was setting up a serverless tracking pixel using this article: https://cloud.google.com/solutions/serverless-pixel-tracking-tutorial
This works but saves the entire pixel GET URL into one single field in BQ - as the pixel URL will carry multiple querystring paramter values with it and best is that these go into individual fields in BQ: I want to tweak it to save each querystring parameter value of the GET tracking pixel into its own BQ table field.
Assuming the names and number of the querystring parameters are known and they match 1-to-1 to the BQ table columns - what would be the recommended way to achieve this?
I was looking in the article if the logs query can be tuned to do this.
Also I saw that Dataflow may be the way to go but thinking if it is possible in a more direct & simple way.
The simple direct way is to create a query in BigQuery that will expose them into columns.
You can have this query first as a VIEW and you can query the view instead of the full query.
Then you can setup a scheduled query in BigQuery to run this query regularly and save into a new table. This way you have the old table getting the links as it is. The scheduler that will create a new table.
You can tune to remove existing rows from the incoming table, and append new rows to the materialized table.

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Kettle Pentaho backup transformation by latest data

I need to sychronize some data from a database to another using kettle/spoon transformation. The logic is i need to select latest date data that has existed in destination db. Then select from source db from the last date. What transformation element do i need to do this?
Thank you.
There can be many solutions:
If you have timestamp columns in both the source and destination tables, then you can take two table input steps. In the first one, just select the max last updated timestamp, use it as a variable in the next table input, taking it as a filter for the source data. You can do something like this:
If you just want the new data to be updated in the destination table and you don't care much about timestamp, I would suggest you to use insert/update step for output. It will bring all the data to the stream and if it finds a match, it won't insert anything. If it doesn't find a match, it will insert the new row. If it finds any modifications to the existing row in the destination table, it will update it accordingly.