Using HBase in place of Hive - hive

Today we are using Hive as our data warehouse, mainly used for batch/bulk data processing - hive analytics queries/joins etc - ETL pipeline
Recently we are facing a problem where we are trying to expose our hive based ETL pipeline as a service. The problem is related to the fixed table schema nature of hive. We have a situation where the table schema is not fixed, it could change ex: new columns could be added (at any position in the schema not necessarily at the end), deleted, and renamed.
In Hive, once the partitions are created, I guess they can not be changed i.e. we can not add new column in the older partition and populate just that column with data. We have to re-create the partition with new schema and populate data in all columns. However new partitions can have new schema and would contain data for new column (not sure if new column can be inserted at any position in the schema?). Trying to read value of new column from older partition (un-modified) would return NULL.
I want to know if I can use HBase in this scenario and will it solve my above problems?
1. insert new columns at any position in the schema, delete column, rename column
2. backfill data in new column i.e. for older data (in older partitions) populate data only in new column without re-creating partition/re-populating data in other columns.
I understand that Hbase is schema-less (schema-free) i.e. each record/row can have different number of columns. Not sure if HBase has a concept of partitions?

You are right HBase is a semi schema-less database (column families still fixed)
You will be able to create new columns
You will be able to populate data only in new column without re-creating partition/re-populating data in other columns
but
Unfortunately, HBase does not support partitions (talking in Hive terms) you can see this discussion. That means if partition date will not be a part of row key, each query will do a full table scan
Rename column is not trivial operation at all
Frequently updating existing records between major compaction intervals will increase query response time
I hope it is helpful.

Related

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

SQL Server Data Type Change

In SQL Server, We have table with 80 million of records and 3 column data type is float right now. Now we need to change the float data type column into Decimal column. How to proceed it with minimum downtime?
We executed the usual ALTER statement to change the data type, but log file got filled and going to system out of memory exception. So kindly let me the better way to solve this issue.
We cant use this technique Like: Creating 3 new temp columns and updating the existing data batch wise and dropping the existing column and renaming the temp column to live columns.
I have done exactly same in one of my project. Here are the steps which you can follow where minimal logging, minimum downtime with least complexity.
Create new table with new data types with same column names (Without index if table require any index create once data will be loaded in new table) but different name. For example existing table is EMPLOYEE then new table name should be EMPLOYEE_1. Keep in mind all constraints like foreign key and all you can create before loading or after loading it is not going to impact anything in terms of performance. But recommend don't create as existing table is having so names of constraints you have rename after renaming the table.
Keep in mind have max precision in new data type at what max precision value is available in your existing table.
Load data from your existing table to new table using SSIS with fast load option so that logging will not happen in temp database.
Rename the old table with EMPLOYEE_2 during downtime and rename the EMPLOYEE_1 to EMPLOYEE.
Alter table definition for foreign key, default or any other constrains.
Go live and create indexes in lower load time on table.
by using this approach we have changed the table data type where we had about more than billions records. Using this approach you can have minimum downtime and logging into tempDB as SSIS fast load option will not log anything in database.
This is too long for a comment.
One of these approaches might work.
The simplest approach is to give up. Well almost, but do the following:
Rename the existing column to something else (say, _col)
Add a computed column that does the conversion.
If the data is infrequently modified (so you only need read access), then create a new table. That is, copy the data in to a new table with the correct type, then drop the original table and rename the new table to the old name.
You can do something similar by swapping table spaces rather than renaming the table.
Another way could be:
Add a new column to the table with the correct datatype.
Insert the values from the float column to the decimal column, but do it in bulk of x records.
When everything is done, remove the float column and rename the decimal column to what the float column name was.

How to effectively save & restore data from the last three months and delete the old data?

I am using PostgreSQL. I need to delete all transaction data from database (except the last three month transaction data) then restore the data to new database with created/updated timestamp updated to now timestamp. Also the data more from last three months must be recaped into one data (example all invoice from party A must be grouped into one invoice with party A). Other rules is if the data is still foreign keys referenced for the last three month data.The data must not be deleted and only change the created/updated timestamp to now timestamp.
I am not good in SQL query so for now I am using this strategy:
First create the recap data (save in other temporary table) before delete (All data).
Then delete all data except the last three months.
Next create the recap data after delete.
Create the recap data from (All data - After delete data) so i get the recap data with nominal exactly same with data before the last three month.
Then insert the recap data to table. So the old data is clean + have recap data in the database.
So my strategy is only using same database and not create new database because process importing data using the program is very slow (because have 900++ tables).
But the client doesn't want use this strategy because he want the data is created in new database and tell me to using other way. So the question is: What is the real and correct procedure to clean database from some dates (filter with date) and recap the old data?
First of all, there is no way to find out when a row was added to a table unless you track it with a timestamp column.
That's the first change you'll have to make – add a timestamp column to all relevant columns that tracks when the row was created (or updated, depending on the requirement).
Then you have two choices:
Partition the tables by the timestamp column so that you have (for example) one partition per month.
Advantage: it is easy to get rid of old data: just drop the partition.
Disadvantage: Partitioning is tricky in PostgreSQL. It will become somewhat easier to handle in PostgreSQL v10, but the underlying problems remain.
Use mass DELETEs to get rid of old rows. That's easy to implement, but mass deletes really hurt (table and index bloat which might necessitate VACUUM (FULL) or REINDEX which impair availability).