partitions in hive interview questions - hive

1) If the partitioned column doesn't have data, so when you query on that, what error will you get?
2)If some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss?
3)Why bucketing needs to be done with numeric column? Can we use string column also? what is the process and on what basis you will choose the bucketing column?
4) Will the internal table details will also be stored in the metastore? Or only external table details will be stored?
5) What type of queries, that runs only at mapper side not in reducer and vice versa?

Short answers:
1. if the partitioned column doesn't have data, so when u query on that, what error will you get?
Partitioned column in Hive is a folder named key=value with data files inside. And if it has no data, it means no partitions folders exist and the table is empty, no error displayed, no data returned.
When you inserting null in partitioned column using dynamic partitioning all NULL values within the partitioning column (and all values which do not conform to the field type) loaded as __HIVE_DEFAULT_PARTITION__ If the column type is numeric in this case then the type cast error will be thrown during select. Something like cannot cast textWritable to IntWritable for example
2. if some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss?
If "does not have" means NULLs, then loaded as HIVE_DEFAULT_PARTITION Actually it is still possible to get data, no loss happened
3. Why bucketing needs to be done with numeric column? -it does not need to be numeric can we use string column also? Yes. what is the process and on what basis you will choose the bucketing column.?
Columns for bucketing should be chosen based on joins/filter columns. Values are being hashed, distributed and sorted(clustered) and the same hashes are being written (during insert overwrite) in the same buckets(files). The number of buckets and columns are specified in the table DDL.
Bucketed table and bucket-map-join is a bit outdated concept, you can achieve the same using DISTRIBUTE BY + sort + ORC. This approach is more flexible.
4. will the internal table details will also be stored in the metastore? or only external table details will be stored?
Does not matter external or managed. Table schema/grants/statistics is stored in the metastore.
5. what type of queries ,that runs only at mapper side not in reducer and vice versa?
Queries without aggregations, map-joins(when small table fits in memory), simple columns transformations (simple column UDFs like regexp_replace, split, substr, trim, concat, etc), filters in WHERE, sort by - can be executed on mapper.
Aggregations and analytics, common joins, order by, distribute by, UDAFs are executed on mapper+reducer.
runs only at mapper side not in reducer and vice versa
vice versa is not possible. Mapper is used to read data files, reducer is the next optional step which can not exist without mapper, though map->reduce->reduce... is possible when running on Tez execution engine. Tez can represent complex query as a single DAG and run as a single job and remove unnecessary steps used in MR engine such as writng of intermediate results into hdfs and reading again using mapper. Even in MR map-only jobs are possible.

Related

Is there Vertical Sharding in postgresql? (or other)

I used RDB in aws.
I inserted 5 billion rows in one table and need to add an index varchar type column.(To speed up the insert, I removed the index and proceeded.)
However, the following error is displayed and it takes too long.
ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp707.0.sharedfileset/0.41": No space left on device
(I am using the db.r6g.2xlarge specification.)
So, I'm considering managing the table separately.
However, I want to divide table sharding by row pk id number (nextval('table_id'::regclass) ) instead of dividing by column.
And when querying with sql, I want to use it as one table.
It's like a vertical sharding(?) concept...!
Is this possible in postgresql?
As an additional question...
Is it possible to split a table that already has data on it by sharding?

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

How can we use same partition schema with different partition function?

I'm learning table partitioning.
When I read this page, it said that
The TransactionHistoryArchive table must have the same design schema as the TransactionHistory table. There must also be an empty partition to receive the new data. In this case, TransactionHistoryArchive is a partitioned table that consists of just two partitions.
And with the following picture, we can see that TransactionHistory has 12 partitions, but TransactionHistoryArchive just has 2 partitions.
Illustration http://i.msdn.microsoft.com/dynimg/IC38652.gif
How could it possible? Please help me to understand it.
As long as two individual partitions have identical schema and the same boundary values you can switch them. They don't need to have the same partition scheme or function.
This is because SQL Server ensures that the binary data of those partitions on disk is compatible. That's the magic of partitioning and why you can move arbitrary amounts of data as a quick metadata-only operation.

How do you append to a Hive array?

I have a Hive table where for a user ID I have a ts column, which is a timeseries, stored as array. I want to maintain the timeseries as a recentmost window.
(a) how do I append a new number to the end of each column from another table joined by ID?
(b) how do I drop the leading number?
Data in Hive is typically stored in HDFS. HDFS has limited append capabilities. If the constant modification of data is at the core of your analytics systems, then perhaps you should consider using alternatives like HBase or Cassandra.
However, if the data updates are a small part of your workflow, I would encourage you to continue using Hive (in order to make use of it's SQL like functionality) but reconsider your design for storing these updates.
A quick solution to your above problem would be to have more than one record per user ID in your table. Each record would have a timeseries corresponding to the User ID. When you want to do your last N analysis on the timeseries, you should do a select from the table by using by Distribute By on User ID column. Your custom reducer will simply pick out the last N (or less, if the size of the timeseries is less than N) timestamps and return them.
Harish Butani also did some work on Windowing functions in Hive. You can also take a look at his work and associated documentation to gain some more insight. Good luck, Alexy!