I have a data frame in pyspark with data like below
node 1:
node 3: 1
node 5: 1
node 2: 3
node 4: 2
Now i need to know all the parents of node 4 , such that i get an output like
1,3,2
Is this possible using a Hive SQL query?
SQL itself does not support iterations, nor recursion...
with a CONNECT BY clause you could just let the database handle the recursion, and pretend it was easy >> not available in Hive
with a procedural langage wrapper (e.g. T-SQL, PL/SQL, PgSQL) you could iterate until each leaf is connected to the root (...verbose code, no fun for testing...) >> not available in Hive either, unless you use Python to manage the iterations and run an INSERT query on each iteration, then collect the results
Related
I have built a data engineering pipeline in snowflake SQL environment. Output of these pipelines is used in our data science model.
Code is organized as ( Step 1,2,3 are SQL queries) :-
Step 1 query - input is raw data table
Step 2 query - input is output of Step 1 view
Step 3 query - input is output of Step 2 view
Output of Step 3 is final output.
For automation, I am planning to create SQL view for step 1,2 and 3.
Question- if I use SQL view , when I do select * from step3 view, would it run step 1 and 2 ? I want the code to run step 1 and 2 code everytime I pull data from step 3.Is there any other way of automating this? I am new to snowflake environment, do I have to take care of materialized or non-materialized view?
Views are evaluated at query time. If you set up the logic using views, then the queries will be processed in the ordered needed to generate the correct result set.
One downside to views is that if they are referenced more than once, then the logic may be repeated.
I have two hive select statements:
select * from ode limit 5;
This successfully pulls out 5 records from the table 'ode'. All the columns are included in the result. However, This following query caused an error:
select content from ode limit 5;
Where 'content' is one column in the table. The error is:
hive> select content from ode limit 5;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
The second query should be a lot cheaper and why does it cause a memory issue? How to fix this?
When you select the whole table, Hive triggers Fetch task instead of MR that involves no parsing (it is like calling hdfs dfs -cat ... | head -5).
As far as I can see in your case, the hive client tries to run map locally.
You can choose one of the two ways:
Force remote execution with hive.fetch.task.conversion
Increase hive client heap size using HADOOP_CLIENT_OPTS env variable.
You can find more details regarding fetch tasks here.
There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:
Query results may differ from the original query's
Brings meaningless calculation load on the cluster
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task
It prompts me the following two options:
set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
set hive.fetch.task.conversion to "none"
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.
Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.
I've run Hive on elastic mapreduce in interactive mode:
./elastic-mapreduce --create --hive-interactive
and in script mode:
./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q
I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.
I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:
create Hive table customer_orders
run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
wait for result
parse result in PHP
run Hive query "SELECT MAX(id) FROM customer_orders"
wait for result
parse result in PHP
...
Does anyone have any recommendations on how I might do this?
You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.
An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.
HiPy enables grouping together in a single script of query
construction, transform scripts and post-processing. This assists in
traceability, documentation and re-usability of scripts. Everything
appears in one place and Python comments can be used to document the
script.
Hive queries are constructed by composing a handful of Python objects,
representing things such as Columns, Tables and Select statements.
During this process, HiPy keeps track of the schema of the resulting
query output.
Transform scripts can be included in the main body of the Python
script. HiPy will take care of providing the code of the script to
Hive as well as of serialization and de-serialization of data to/from
Python data types. If any of the data columns contain JSON, HiPy takes
care of converting that to/from Python data types too.
Check out the Documentation for details!
I'm using a Java driver for SQLite 3 and wanted to find out if there was any means to get data from multiple databases (db1.db, db2.db) in one query?
Does any driver for SQLite3 support this at the moment?
Say db1 has 100 rows and db2 has 100 rows, my requirement is to get 200 rows by querying these by a single query.
You'll probably want to look at the ATTACH DATABASE command:
http://www.sqlite.org/lang_attach.html
The ATTACH DATABASE statement adds another database file to the current database connection.
Here's a tutorial on how to use it:
http://longweekendmobile.com/2010/05/29/how-to-attach-multiple-sqlite-databases-together/