How do you append to a Hive array? - hive

I have a Hive table where for a user ID I have a ts column, which is a timeseries, stored as array. I want to maintain the timeseries as a recentmost window.
(a) how do I append a new number to the end of each column from another table joined by ID?
(b) how do I drop the leading number?

Data in Hive is typically stored in HDFS. HDFS has limited append capabilities. If the constant modification of data is at the core of your analytics systems, then perhaps you should consider using alternatives like HBase or Cassandra.
However, if the data updates are a small part of your workflow, I would encourage you to continue using Hive (in order to make use of it's SQL like functionality) but reconsider your design for storing these updates.
A quick solution to your above problem would be to have more than one record per user ID in your table. Each record would have a timeseries corresponding to the User ID. When you want to do your last N analysis on the timeseries, you should do a select from the table by using by Distribute By on User ID column. Your custom reducer will simply pick out the last N (or less, if the size of the timeseries is less than N) timestamps and return them.
Harish Butani also did some work on Windowing functions in Hive. You can also take a look at his work and associated documentation to gain some more insight. Good luck, Alexy!

Related

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

How can we use same partition schema with different partition function?

I'm learning table partitioning.
When I read this page, it said that
The TransactionHistoryArchive table must have the same design schema as the TransactionHistory table. There must also be an empty partition to receive the new data. In this case, TransactionHistoryArchive is a partitioned table that consists of just two partitions.
And with the following picture, we can see that TransactionHistory has 12 partitions, but TransactionHistoryArchive just has 2 partitions.
Illustration http://i.msdn.microsoft.com/dynimg/IC38652.gif
How could it possible? Please help me to understand it.
As long as two individual partitions have identical schema and the same boundary values you can switch them. They don't need to have the same partition scheme or function.
This is because SQL Server ensures that the binary data of those partitions on disk is compatible. That's the magic of partitioning and why you can move arbitrary amounts of data as a quick metadata-only operation.