How can I actually check the bucket version in Hive? - hive

As far as I know, the Bucket version information is inserted into the actual Bucket file as binary.
And I know that when hive actually reads that file, it checks the version through Codec, is this correct?
* Represents format of "bucket" property in Hive 3.0.
* top 3 bits - version code.
* next 1 bit - reserved for future
* next 12 bits - the bucket ID
* next 4 bits reserved for future
* remaining 12 bits - the statement ID - 0-based numbering of all statements within a
* transaction. Each leg of a multi-insert statement gets a separate statement ID.
* The reserved bits align it so that it easier to interpret it in Hex.
* ... omitted
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/BucketCodec.java
This is a part of the Hive BucketCodec source code. Is what I know correct? I'm curious how you actually check the bucket information.
Also, if the bucket file is saved in ORC format, the top 3 bits are ORC format information. In this case, how do I check the bucket version information?

Related

BigQuery: Limitation on length of list inside "IN" operator?

Is there any limitation on the number of elements that can be inserted on the IN operator?
I am asking this because I have a hive partitioned table (connected to a bucket with JSON) that I need to query hourly to extract some information. In order to not re-process already processed files, I use one of the partition fields to as identifier on which IDs I already processed, so I can query with a NOT IN only the new ones.
I'll show an example.
This is an example of the content of the bucket:
date=2021-05-15/id=ad9isjiodpa/file.jsonl
date=2021-05-15/id=sda0u9dsapo/file.jsonl
date=2021-05-15/id=adsi9ojdsds/file.jsonl
so I can make a query like this, to exclude those I already processed:
SELECT * FROM hive_table where id NOT IN ('ad9isjiodpa', 'sda0u9dsapo')
Usually this query process around 30GB per run, And everything works great, everyone is happy. The list usually don't have more than 2k elements.
Usually ...
last time the number of elements exceedeed 4k elements and this resulted in 2.6 TB of data processed. That was extremely unlikely and made me think that it actually processed ALL the files in the bucket (inside the timerange).
Is there some scenario, or documentation I didn't pay enough attention to? Do you know why it did process so much data? What did I do wrong?
The current fix I did is to split the list of elements in smaller chunks and do something like
SELECT * FROM hive_table where id NOT IN (<chunked_elemens1>) AND id NOT IN (<chunked_elemens2>) ...
Will this work?
Thank you very much in advance

Save large BigQuery results to another project's BigQuery

I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.

Increasing Spark Read and Parquet Conversion Performance for Gzipped Text File

Use case:
A> Have Text Gzipped files in AWS s3 location
B> Hive Table created on top of the file, to access the data from the file as Table
C> Using Spark Dataframe to read the table and converting into Parquet Data with Snappy Compression
D> Number of fields in the table is 25, which includes 2 partition columns. Data Type is String except for two fields which has Decimal as data type.
Used following Spark Option: --executor-memory 37G --executor-cores 5 --num-executors 20
Cluster Size - 10 Data Nodes of type r3.8xLarge
Found the number of vCores used in AWS EMR is always equal to the number of files, may be because gzip files are not splittable. Gzipped files are coming from different system and size of files are around 8 GB.
Total Time taken is more than 2 hours for Parquet conversion for 6 files with total size 29.8GB.
Is there a way to improve the performance via Spark, using version 2.0.2?
Code Snippet:
val srcDF = spark.sql(stgQuery)
srcDF.write.partitionBy("data_date","batch_number").options(Map("compression"->"snappy","spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version"->"2","spark.speculation"->"false")).mode(SaveMode.Overwrite).parquet(finalPath)
It doesn't matter how many nodes you ask for, or how many cores there are, if you have 6 files, six threads will be assigned to work on them. Try to do one of
save in a splittable format (snappy)
get the source to save their data is many smaller files
do some incremental conversion into a new format as you go along (e.g a single spark-streaming core polling for new gzip files, then saving elsewhere into snappy files. Maybe try with AWS-Lambda as the trigger for this, to save dedicating a single VM to the task.

when unloading a table from amazon redshift to s3, how do I make it generate only one file

When I unload a table from amazon redshift to S3, it always splits the table into two parts no matter how small the table. I have read the redshift documentation regarding unloading, but no answers other than it says sometimes it splits the table (I've never seen it not do that). I have two questions:
Has anybody every seen a case where only one file is created?
Is there a way to force redshift to unload into a single file?
Amazon recently added support for unloading to a single file by using PARALLEL OFF in the UNLOAD statement. Note that you still can end up with more than one file if it is bigger than 6.2GB.
By default, each slice creates one file (explanation below). There is a known workaround - adding a LIMIT to the outermost query will force the leader node to process whole response - thus it will create only one file.
SELECT * FROM (YOUR_QUERY) LIMIT 2147483647;
This only works as long as your inner query returns fewer than 2^31 - 1 records, as a LIMIT clause takes an unsigned integer argument.
How files are created? http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html
Amazon Redshift splits the results of a select statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data.
So now we know that at least one file per slice is created. But what is a slice? http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
The number of slices is equal to the number of processor cores on the node. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices.
It seems that the minimal number of slices is 2, and it will grow larger when more nodes or more powerful nodes is added.
As of May 6, 2014 UNLOAD queries support a new PARALLEL options. Passing PARALLEL OFF will output a single file if your data is less than 6.2 gigs (data is split into 6.2 GB chunks).

How should I partition data in s3 for use with hadoop hive?

I have a s3 bucket containing about 300gb of log files in no particular order.
I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:
s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3
etc
What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?
I'm sure there's a good way to do this using hadoop, could someone tell me what that is?
What I've tried:
I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)
Does anyone have any ideas how best to approach this?
If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set mapred.map.tasks.speculative.execution to false. If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. Check out Hadoop: The Definitive Guide for more information.
Why not create an external table over this data, then use hive to create the new table?
create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;
In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.