We're using apache zeppelin to analyse our datasets. We have some queries that we would like to run that have a large number of results that come back from them and would like to run the query in zeppelin but save the results (display is limited to 1000). Is there an easy way to get zeppelin save all the results of a query to s3 bucket maybe?
I managed to whip up a notebook that effectively does what i want using the scala interpreter.
z.load("com.databricks:spark-csv_2.10:1.4.0")
val df= sqlContext.sql("""
select * from table
""")
df.repartition(1).write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://amazon.bucket.com/csv_output/")
Its worth mentioning that the z.load function seemed to work for me one day, but then i tried it again and for some reason i had to declare it in its own paragraph with the %dep interpreter, then the remaining code in the standard scala interpreter
Related
I have the following script to run a SQL query:
val df_joined_sales_partyid = spark.sql(""" SELECT a.sales_transaction_id, b.customer_party_id, a.sales_tran_dt
FROM df_sales_tran a
JOIN df_sales_tran_party_xref b
ON a.sales_transaction_id = b.sales_transaction_id
Limit 3""")
I want to know how I can save the output of this query as a permanent data-frame table. I noticed that every time that I run display(df_joined_sales_partyid), it seems to run the query again. How do I avoid running the query multiple times and save the results to a data-frame table. I am new to writing Scala so forgive me if this is an easy question, but I couldn't find a solution online.
// caches results in memory
df_joined_sales_partyid.cache()
// or
// memory and disk, see https://spark.apache.org/docs/2.4.0/api/java/index.html?org/apache/spark/storage/StorageLevel.html for other possible values
df_joined_sales_partyid.persist(StorageLevel.MEMORY_AND_DISK)
I am new to Spark I am trying to access Hive table to Spark
1) Created Spark Context
val hc=new HiveContext(sc)
val hivetable= hc.sql("Select * from test_db.Table")
My Question is I got the table into Spark.
1) Why we need to register the Table ?
2) We can Perform Directly SQL operations still why do we need Dataframe functions
Like Join, Select, Filter...etc ?
What makes difference in both operations between SQL Query` and Dataframe Operations
3) What is Spark Optimization ? How does it works?
You don't need to register temporary table if you are accessing Hive table using Spark HiveContext. Registering a DataFrame as a temporary table allows you to run SQL queries over its data.Suppose a scenario that you are accessing data from a file from some location and you want to run SQL queries over this data.
then you need to createDataframe from the Row RDD and you will register temporary table over this DataFrame to run the SQL operations. To perform SQL queries over that data, you need to use Spark SQLContext in your code.
Both methods use exactly the same execution engine and internal data structures. At the end of the day all boils down to the personal preferences of the developer.
Arguably DataFrame queries are much easier to construct programmatically and
provide a minimal type safety.
Plain SQL queries can be significantly more concise an easier to understand.
There are also portable and can be used without any modifications with every supported language. With HiveContext these can be also used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers
Reference: Spark sql queries vs dataframe functions
Here is a good read reference on performance comparison between Spark RDDs vs DataFrames vs SparkSQL
Apparently I don't have answer for it and will keep it on you to do some research over net and find out solution :)
Like code below, insert a dataframe into a hive table. The output hdfs files of hive have too many small files. How to merge them when save on hive?
myDf.write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
When there are 100 tasks, it will produce 100 small files.
Is using coalesce on dataframe a good idea?
myDf.coalesce(3).write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
Why hive configures as below do not work?
sqlContext.sql("set hive.merge.mapfiles=true")
sqlContext.sql("set hive.merge.sparkfiles=false")
sqlContext.sql("set hive.merge.smallfiles.avgsize=16000000")
sqlContext.sql("set hive.merge.size.per.task=256000000")
Thanks a lot for any help.
I encounterd this problem and find issue-24940
Use/*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query will control output file numbers.
In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage.
That's because SparkSQL returns number of files which coresponds to the number of the spark partitions. Even if dynamic partitioning config is on.
I faced the same problem. In my view, configurations mentioned above are only applicable to Hive on MapReduce engine: in my case, HiveQL commands work well (small files are being merged).
See Hive architecture for more detail.
I'm running Hive 1.0, trying to compute column statistics using the built-in analyze command. HQL script looks like:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
use db;
analyze table tbl compute statistics for columns;
Which kicks off a map-only MR task as expected. The job runs to 100% for both map and reduce, then reports:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
But the job is registered as a SUCCESS.
Googling led me to this JIRA ticket, but the resolution says the problem is resolved in Hive 0.14. Is there something simple I'm missing in the query?
EDIT: Five and a half years later, I've changed jobs and industries twice, picked up Spark and then abandoned Hadoop altogether in all my workflows, and the world aligned around efficient cloud data lakes that don't require a new query language. Hive is a distant memory for me, but I hope the other answer seekers found sufficient workarounds. I don't think I ever did.
I've run Hive on elastic mapreduce in interactive mode:
./elastic-mapreduce --create --hive-interactive
and in script mode:
./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q
I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.
I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:
create Hive table customer_orders
run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
wait for result
parse result in PHP
run Hive query "SELECT MAX(id) FROM customer_orders"
wait for result
parse result in PHP
...
Does anyone have any recommendations on how I might do this?
You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.
An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.
HiPy enables grouping together in a single script of query
construction, transform scripts and post-processing. This assists in
traceability, documentation and re-usability of scripts. Everything
appears in one place and Python comments can be used to document the
script.
Hive queries are constructed by composing a handful of Python objects,
representing things such as Columns, Tables and Select statements.
During this process, HiPy keeps track of the schema of the resulting
query output.
Transform scripts can be included in the main body of the Python
script. HiPy will take care of providing the code of the script to
Hive as well as of serialization and de-serialization of data to/from
Python data types. If any of the data columns contain JSON, HiPy takes
care of converting that to/from Python data types too.
Check out the Documentation for details!