Parquet compression type via Impala - impala

We have quite a few impala tables defined, and assume we are using Snappy compression. (parquet files)
However nobody really knows what compression type we are actually using on existing tables.
The impala docs don't seem to specify how to get the compression type from an existing table.
Is there a way to find the used compression type via impala?

As of right now there is no command in Impala that would tell you the type of compression being used in a table stored as parquet, but there is a work around. What you can do is look at one of the parquet files within the table and then use the parquet-tools meta command in order to see the compression being used.
-- step1) run hdfs dfs -ls to determine the location and name for a parquet file
hdfs dfs -ls /yourTableLocationPath
-- step2) parquet-tools really only works locally right now so you will need to copy the file to a local directory
hdfs dfs -get /yourTableLocationPath/yourFileName /yourLocalPath
-- step3) run parquet-tools meta command
parquet-tools meta /yourLocalPath/yourFileName
The output of the parquet-tools meta command will show you the type of compression being used under the row group output.

Related

hive not adding file extensions to file names

I've run the hive query
create table my_schema.my_table stored as parquet as select ...
It created the table and the files, but i do not see .parq file extension next to the files, which is a bit of a problem for me since i wanted to be able to run something like hdfs -ls -R /path/to/directory | grep .parq to list all parquet files in a directory.
Is there either a way to filter parquet files regardless of file extension or a way to make hive include the extension?
I have a similar query using impala and there i can see the .parq files without any issue.
Hive will not add extension to the file. You need to do it manually:
$ hadoop fs -put /path/to/directory/000000_0 /path/to/directory/data.parquet

Extract BigQuery partitioned table

Is there a way to extract the complete BigQuery partitioned table with one command so that data of each partition is extracted into a separate folder of the format part_col=date_yyyy-mm-dd
Since Bigquery partitioned table can read files from the hive type partitioned directories, is there a way to extract the data in a similar way. I can extract each partition separately, however that is very cumbersome when i an extracting a lot of partitions
You could do this programmatically. For instance, you can export partitioned data by using the partition decorator such as table$20190801. And then on the bq extract command you can use URI Patterns (look the example of the workers pattern) for the GCS objects.
Since all objects will be within the same bucket, the folders are just an hierarchical illusion, so you can specify URI patterns on the folders as well, but not on the bucket.
So you would do a script where you loop over the DATE value, with something like:
bq extract
--destination_format [CSV, NEWLINE_DELIMITED_JSON, AVRO]
--compression [GZIP, AVRO supports DEFLATE and SNAPPY]
--field_delimiter [DELIMITER]
--print_header [true, false]
[PROJECT_ID]:[DATASET].[TABLE]$[DATE]
gs://[BUCKET]/part_col=[DATE]/[FILENAME]-*.[csv, json, avro]
You can't do it automatically with just a bq command. For this it would be better to raise a feature request as suggested by Felipe.
Set the project as test_dataset using gcloud init before running the below command.
bq extract --destination_format=CSV 'test_partitiontime$20210716' gs://testbucket/20210716/test*.csv
This will create a folder with the name 20210716 inside testbucket and write the file there.

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala.
I want to load data from the HDFS to my datalake using impala.
So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the following:
LOAD DATA INPATH 'user/myuser/this_is.* INTO TABLE my_table
This is, a string starting with this_is and whatever follows.
If you need some additional information, please let me know. Thanks in advance.
The documentation says:
You can specify the HDFS path of a single file to be moved, or the
HDFS path of a directory to move all the files inside that directory.
You cannot specify any sort of wildcard to take only some of the files
from a directory.
The workaround is to put your files into table directory using mv or cp command. Check your table directory using DESCRIBE FORMATTED command and run mv or cp command (in a shell, not Impala of course):
hdfs dfs -mv "user/myuser/this_is.*" "/user/cloudera/mytabledir"
Or put files you need to load into some directory first then load all the directory.

Spark HDFS Direct Read vs Hive External table read

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.
Directory structure:
/data/table1/INGEST_DATE=20180101
/data/table1/INGEST_DATE=20180102
/data/table1/INGEST_DATE=20180103 etc.
Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).
I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance
If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.
With Spark SQL you can get Partition pruning automatically.

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.