When does hive URL-encode partition key names? - hive

I have tried inserting partitions like 'a:b' into Hive (0.13), and sometimes it ends up with "a%3Ab". Is that something Hive does (if so, when), or is there some other layer along the way?
This issue would indicate that Hive maintainers have seen this before.

Related

In BigQuery, what is the behaviour of WRITE_APPEND

If I have a partitioned-by-date table with a WRITE_APPEND policy, what happens if I write data into existing partitions? Does it simply get ignored or it gets appended as the name indicates? My understanding is that it appends existing data in the same partition but not 100% sure.
The doc only says that "WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.". This is highly ambiguous and doesn't even bother to speak about partitioned table.

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Truncate and Insert in ClickHouse Database

I have a particular scenario where I need to truncate and batch insert into a Table in ClickHouse DBMS for every 30 minutes or so. I could find no reference of truncate option in ClickHouse.
However, I could find suggestions that we can indirectly achieve this by dropping the old table, creating a new table with same name and inserting data into it.
With respect to that, I have a few questions.
How is this achieved ? What is the sequence of steps in this process ?
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
Is there a better and clean way this can be achieved ?
How is this achieved ? What is the sequence of steps in this process ?
TRUNCATE is supported. There is no need to drop and recreate the table now.
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
That depends on which table engine you use. For merge-tree family you get a snapshot-like behavior for SELECT.
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
I would assume it relies on how fast the underlying file system can handle file deletions. For large tables it might contain millions of data part files which leads to slow truncation. However in your case I wouldn't worry much.
Is there a better and clean way this can be achieved ?
I suggest using partitons with a (DateTime / 60) column (per minute) along with a user script that constantly do partition harvest for out of date partitions.

How do you append to a Hive array?

I have a Hive table where for a user ID I have a ts column, which is a timeseries, stored as array. I want to maintain the timeseries as a recentmost window.
(a) how do I append a new number to the end of each column from another table joined by ID?
(b) how do I drop the leading number?
Data in Hive is typically stored in HDFS. HDFS has limited append capabilities. If the constant modification of data is at the core of your analytics systems, then perhaps you should consider using alternatives like HBase or Cassandra.
However, if the data updates are a small part of your workflow, I would encourage you to continue using Hive (in order to make use of it's SQL like functionality) but reconsider your design for storing these updates.
A quick solution to your above problem would be to have more than one record per user ID in your table. Each record would have a timeseries corresponding to the User ID. When you want to do your last N analysis on the timeseries, you should do a select from the table by using by Distribute By on User ID column. Your custom reducer will simply pick out the last N (or less, if the size of the timeseries is less than N) timestamps and return them.
Harish Butani also did some work on Windowing functions in Hive. You can also take a look at his work and associated documentation to gain some more insight. Good luck, Alexy!