I have a parquet file created from text /dat file using Pig Script.
Now i would like to know how many records in the parquet file without reading the file?
Is there anyway, Parquet file stores the number of rows somewhere in meta-data?
Read from the path using parquet.pig.ParquetLoader. Then the parqet file will be a normal file and then you can go for a count of the records.
LOGS = LOAD '/X/Y/abc.parquet' USING parquet.pig.ParquetLoader ;
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);
dump LOG_COUNT;
Related
How can I have impala or hive return the file format of the underlying files on HDFS for a table?
I tried:
SHOW FILES database.table_name
This ilst the files, but the problem is that some people stored parquet files as .parq and others .parquet. Is there anyway to return the file format, such that one could use it in a new create statement?
Use good old show create table mytable.
You can check the output and it clearly mentions file format. It also shows folder inside which file are stored - you should not try to use file name - let impala decide the name. below is a sample result from impala.
result
CREATE TABLE edh.mytable (
column1 STRING
)
STORED AS PARQUET --file format
LOCATION 's3a://cc-mys3/edh/user/hive/warehouse/edh.db/mytable' --folder location
I have a directory in which I have 2 parquet files with same schema but columns order are different
I want to know how spark decides column order when reading the directory
Input directory
Dataframe 1 while reading 1.parquet file
Dataframe 2 while reading 2.parquet file
When reading complete directory
Column order depend of schema metadata , you can use a parquet viewer to inspect each file.
You can also provide a schema when reading parquet file to get all the time the same columns order.
val parquetSchema: Structype = new structype()
.add("id",IntegerType,true)
.add("login",StringType,true)
spark.read.schema(parquetSchema).parquet(...)
I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).
I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.
While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.
Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?
Thanks
If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.
If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.
In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.
(or)
You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.
In hive managed tables is there anyway to input/specify the filename for the data files getting created?
For example, the below data file ends with "000000_0", is it possible to get that file generated with specific name?
hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders_partitioned/order_month=Apr/000000_0
There is no way to specify the file name when you input the data using hive cli or sqoop. But there is a way to input the specified file using copy command
hadoop fs -cp <src_file> <dest_folder>
In this case you have to be careful the data in this source file is to be matched exactly with the partition condition of destination directory.