How to maintain history data whose schema changes quarterly using Hadoop - pandas

I have json input file which stores survey data(feedback from the customers).
The columns in json file can vary
for e.g. in first quarter there can
be 70 columns and in next quarter it can have 100 columns and so on.
I want to store all this quarterly data in same table on hdfs.
Is there a way to maintain history either by drop and re-creating the table with changing schema?
How will it behave if the column length goes down let's say in 3rd quarter we get only 30 columns.

First point is that in HDFS you don't store tables just files. You create tables in hive impala etc. on top of files.
Some of the formats support schema merging at read, for example parquet
In general you will be able to recreate your table with a super-set of columns. In Impala you have similar capabilities for schema evolution.

Related

Create table in hive from data folder in HDFS - remove duplicated rows

I have a folder in HDFS, let's call it /data/users/
Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example
file_01Jan2020.csv: contains data for 1000 users who are currently active
file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)
In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.
I want to create a table in hive using the data in this folder. What I am doing now is:
Create External table from the data in the folder /data/users/
Create Internal table with the same structure
Write the data from external table to internal table where,
Duplicates are removed
If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created
I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.
So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.
I'd suggest partition data by date, that way you dont have to go through all the records every time you read the table.

How to move a table from a dataset in London to EU in BigQuery on a rolling basis?

I need to essentially create a query that transfers the contents of the table in London table_london to a table table_eu in a dataset in the EU dataset_eu. I will then schedule this query on a daily basis to overwrite the created table table_eu.
I have looked into using the transfer option in BigQuery but this will transfer the contents of the entire dataset containing table_london rather than just the one table I need.
It is not possible exporting a table between regions or to copy a table to a different region.
There is a similar situation when copying dataset among regions, even when there is in beta feature to do this, it is limited to certain regions. In addition, as far as I understand this is not an option for you since will copy all the tables.
Below, I'm listing a couple of possibilities that might help you:
Move query results:
As you are querying a table, you can save the results to a local file or even to Cloud Storage, and then from the other dataset load the data from that file.
Export a table:
In case you want to move the entire table data (not a query), you would need to use the Cloud Storage as the intermediate service, and in this case you should consider the regional and multiregional locations of the bucket destination. See exporting tables and its limitations.

How to handle a large dimension in BigQuery

I have a dimension table in my current warehouse (Netezza) which has 10 million records and which is being updated on a daily basis.
Should we keep this dimension table in BigQuery as it is as we are planning to migrate to BigQuery.
How can we redesign this large dimension in BigQuery?
Because bigquery is not intended for updates, it's not that easy to implement a dimension table. The proper answer depends on your use case.
But here are some alternatives:
Have an append-only dimension table with an "UpdatedAt" field. Than, use window function to get the last version (you can even create a view that has only the last version)
Truncate the dimension table daily with the latest version of your data.
Create an external table based on GCS / Big Table / Cloud SQL, and have the dimensions updated there.
Save your dimension table in a separate database, and use Cloud Dataflow to perform the join
Save the dimension data together with the fact table (Yes, there will be a lot of duplications, but sometimes it's worth the cost)
Simply update the dimension table whenever there is a change (there is a limit to do that)
All of these approaches have drawbacks. The solution can even be a mix of more than one approach.

Pentaho Data Integration - Star schema (PostgreSQL)

I have a CSV file with data and the database I need for the star schema.
However, the CSV file doesn't have the ID's of the dimensions tables (the primary keys), which means I only get those ID's after inserting the data into the dimension's tables (the ID will be an auto-increment value).
This means that first I need to load the data into the dimensions and after that, I need to read the dimensions tables (to know the ID's) and the remaning data from the CSV file and load all that into the facts table.
To load the data into the dimensions,I made this Transformation and it works perfectly.
The problem is in getting the ID's from the tables (and, simultaneously, the remaning data from the CSV file) and load all that into the facts table.
I don't know if it is even possible to do all this in a single Transformation.
Any suggestions ?
I would really appreciate any help you could provide. (A sketch of the correct Transformation would be great)
This entire thing is possible in one job not in one transformation.
Create a job,inside that select two transformation.
First load the dimension table using first transformation and in next transformation load the fact table.

Using HBase in place of Hive

Today we are using Hive as our data warehouse, mainly used for batch/bulk data processing - hive analytics queries/joins etc - ETL pipeline
Recently we are facing a problem where we are trying to expose our hive based ETL pipeline as a service. The problem is related to the fixed table schema nature of hive. We have a situation where the table schema is not fixed, it could change ex: new columns could be added (at any position in the schema not necessarily at the end), deleted, and renamed.
In Hive, once the partitions are created, I guess they can not be changed i.e. we can not add new column in the older partition and populate just that column with data. We have to re-create the partition with new schema and populate data in all columns. However new partitions can have new schema and would contain data for new column (not sure if new column can be inserted at any position in the schema?). Trying to read value of new column from older partition (un-modified) would return NULL.
I want to know if I can use HBase in this scenario and will it solve my above problems?
1. insert new columns at any position in the schema, delete column, rename column
2. backfill data in new column i.e. for older data (in older partitions) populate data only in new column without re-creating partition/re-populating data in other columns.
I understand that Hbase is schema-less (schema-free) i.e. each record/row can have different number of columns. Not sure if HBase has a concept of partitions?
You are right HBase is a semi schema-less database (column families still fixed)
You will be able to create new columns
You will be able to populate data only in new column without re-creating partition/re-populating data in other columns
but
Unfortunately, HBase does not support partitions (talking in Hive terms) you can see this discussion. That means if partition date will not be a part of row key, each query will do a full table scan
Rename column is not trivial operation at all
Frequently updating existing records between major compaction intervals will increase query response time
I hope it is helpful.