My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.
Related
I'm trying to compare performance of SELECT Vs. CTAS.
The reason CTAS is faster for bigger data is b.c. data format and its ability to write query results in distributed manner into multiple parquet files.
All athena queries are written to S3 then read from there (I may be wrong), is there way to distributed writing query result of regular select into single file? So without bucketing nor partioning.
I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.
Is there a way to PXF select only the column used in the query, apart from Hive partition filtering.
I have data stored in Hive-ORC format and using pxf external table to execute queries in HAWQ. The biggest tables are stored in Hive and we cannot make another copy of data in HAWQ.
Thanks--
P.S - Does the query optimizer collect stats on external tables in HAWQ 2.0?
You can always run a select foo from bar type query on external tables in HAWQ. However, if your question is whether PXF actually does column projection to avoid reading all the columns then the answer is No. Currently PXF will read all columns from an ORC file and return the records to HAWQ which then does the projection filtering on its end. However, https://issues.apache.org/jira/browse/HAWQ-583, is actively being worked on and should be released in an upcoming version of HAWQ which will pushdown column projections down to ORC to improve read performance of ORC files
Yes, the query optimizer does collect statistics on external tables, this is also handled by PXF. However, this is only for some data sources: https://issues.apache.org/jira/browse/HAWQ-44
Like code below, insert a dataframe into a hive table. The output hdfs files of hive have too many small files. How to merge them when save on hive?
myDf.write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
When there are 100 tasks, it will produce 100 small files.
Is using coalesce on dataframe a good idea?
myDf.coalesce(3).write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
Why hive configures as below do not work?
sqlContext.sql("set hive.merge.mapfiles=true")
sqlContext.sql("set hive.merge.sparkfiles=false")
sqlContext.sql("set hive.merge.smallfiles.avgsize=16000000")
sqlContext.sql("set hive.merge.size.per.task=256000000")
Thanks a lot for any help.
I encounterd this problem and find issue-24940
Use/*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query will control output file numbers.
In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage.
That's because SparkSQL returns number of files which coresponds to the number of the spark partitions. Even if dynamic partitioning config is on.
I faced the same problem. In my view, configurations mentioned above are only applicable to Hive on MapReduce engine: in my case, HiveQL commands work well (small files are being merged).
See Hive architecture for more detail.
I am currently having Hadoop-2, PIG, HIVE and HBASE.
I have an inputdata. I have loaded that data in HDFS.
I want to create staging data in this environment.
My query is -
In which BigData component, I should create Staging Table(Pig/HIVE/HBASE) ; this will have data coming in based on a condition? Later, we might want to run MapReduce Jobs with complex logic on it.
Please assist
Hive: If you have OLAP kind of workload and dont need realtime read/write.
HBase: If you have OLTP kind of workload. You need to do realtime/streaming read/write. Some batch or OLAP processing can be done by using MapReduce. SQL-like querying is possible by using Apache Phoenix.
You can run MapReduce job on HIVE and HBase both.
Anywhere you want. Pig is not an option as it does not have a metastore. Hive if you want SQL Like queries. HBase based on your access patterns.
When you run a Hive query on top of data it is converted into MR.
When you create it in Hive use Hive Queries & not MR. If you are using MR then use Pig. You will not benefit creating a Hive table on top of data.