Pyspark work with both Hive and HBase togethe - capstone

I have 2 tables on Hive (patient vital and patient contact) and one table in HBase (threshold). I am required to compare the patient vital data using both threshold and contact in a pyspark application. The question - is it OK to have threshold data table in HBase or is it better to bring it into a Hive external table so that pyspark can work, in other words can pyspark work with tables in different formats

Related

Create New Bigquery Table Partitioned on Different Column

I have some data streaming to a Bigquery table partitioned by timestamp column A (defined in streaming service). Now, for analysis, we want to query data with filters on timestamp column B. So, it would be great be there is someway we can create a view or table(that is in sync with source table) partitioned on Column B. I looked into materialized views but they only support same partitioning column as in the source table.
Any workaround or suggestion is appreciated.
Thanks in advance.

How to maintain history data whose schema changes quarterly using Hadoop

I have json input file which stores survey data(feedback from the customers).
The columns in json file can vary
for e.g. in first quarter there can
be 70 columns and in next quarter it can have 100 columns and so on.
I want to store all this quarterly data in same table on hdfs.
Is there a way to maintain history either by drop and re-creating the table with changing schema?
How will it behave if the column length goes down let's say in 3rd quarter we get only 30 columns.
First point is that in HDFS you don't store tables just files. You create tables in hive impala etc. on top of files.
Some of the formats support schema merging at read, for example parquet
In general you will be able to recreate your table with a super-set of columns. In Impala you have similar capabilities for schema evolution.

How to partition a datetime column when creating Athena table

I have some log files in S3 with the following csv format (sample data in parenthesis):
userid (15678),
datetime (2017-09-14T00:21:10),
tag1 (some random text),
tag2 (some random text)
I want to load into Athena tables and partition the data based on datetime in a day/month/year format. Is there a way to split the datetime on table creation or do I need to run some job before to separate the columns and then import?
Athena supports only External tables of Hive. In external tables to partition the data you data must be in different folders.
There are two ways in which you can do that. Both are mentioned here.

Using HBase in place of Hive

Today we are using Hive as our data warehouse, mainly used for batch/bulk data processing - hive analytics queries/joins etc - ETL pipeline
Recently we are facing a problem where we are trying to expose our hive based ETL pipeline as a service. The problem is related to the fixed table schema nature of hive. We have a situation where the table schema is not fixed, it could change ex: new columns could be added (at any position in the schema not necessarily at the end), deleted, and renamed.
In Hive, once the partitions are created, I guess they can not be changed i.e. we can not add new column in the older partition and populate just that column with data. We have to re-create the partition with new schema and populate data in all columns. However new partitions can have new schema and would contain data for new column (not sure if new column can be inserted at any position in the schema?). Trying to read value of new column from older partition (un-modified) would return NULL.
I want to know if I can use HBase in this scenario and will it solve my above problems?
1. insert new columns at any position in the schema, delete column, rename column
2. backfill data in new column i.e. for older data (in older partitions) populate data only in new column without re-creating partition/re-populating data in other columns.
I understand that Hbase is schema-less (schema-free) i.e. each record/row can have different number of columns. Not sure if HBase has a concept of partitions?
You are right HBase is a semi schema-less database (column families still fixed)
You will be able to create new columns
You will be able to populate data only in new column without re-creating partition/re-populating data in other columns
but
Unfortunately, HBase does not support partitions (talking in Hive terms) you can see this discussion. That means if partition date will not be a part of row key, each query will do a full table scan
Rename column is not trivial operation at all
Frequently updating existing records between major compaction intervals will increase query response time
I hope it is helpful.

Import CSV to partitioned table on BigQuery using specific timestamp column?

I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.
In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables
The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples