select
concat_ws(',', collect_list(cast(col1 as string))) as col1_list,
concat_ws(',', collect_list(cast(col2 as string))) as col2_list,
concat_ws(',', collect_list(cast(col3 as string))) as col3_list,
concat_ws(',', collect_list(cast(col4 as string))) as col4_list
-- about 100 columns here to concat_ws
from
table
group by
id
I run the hive table, the yarn always raise error although I increase the memory size, attempt_1619436551724_1645902_4_01_000014_0:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space.
If I dont use concat_ws statement, I can run the hive table sucessfully. If it concat_ws need much memory?
Related
I am executing SQL that creates two columns using the LISTAGG function and it works just fine in Aginity.
SELECT po_num::numeric AS po_num,
MIN(tran_date) AS tran_date ,
listagg(tran_date, ',') within
GROUP (
ORDER BY tran_date DESC) AS all_rm_dates,
listagg(rm_num, ',') within
GROUP (
ORDER BY tran_date DESC) AS all_rm_numbers
FROM main.curr_rm
GROUP BY 1
I tried a few variations on the above SQL and none would return data for the LISTAGG columns in the ADODB recorset.open dataset. My standard process is to save the Aginity SQL file as a text file and assign the file contents to a string variable within the VBA code. This process has worked great for the past two years. Problem is the LISTAGG columns are all empty (but are not when SQL run in Aginity). ADODB recordset.open statement:
rs.Open sql, CN, adOpenStatic, adLockOptimistic, adCmdText
The recorset.open with the matching SQL executes and does not cause an error. Resulting records match between Aginity and the ADODB recorset except the LISTAGG columns. The recordset.open LISTAGG columns are empty. Can't figure our why. Unsure if it is a limitation with the Amazon Redshift x64 driver, ADODB, or my criteria in the recordset.open statement.
I have 70 columns in my hive table i want to fetch all the rows which have exactly all the 70 matching columns.i.e. if two rows contain same data in all the column then i need to find that row and count as '2'. I'm writing below query.
SELECT (all 70 columns),COUNT(*) AS CountOf FROM tablename GROUP BY (all 70 columns)
HAVING COUNT(*)>1;
but its showing
Error: Error while compiling statement: FAILED: SemanticException [Error 10411]:
Grouping sets size cannot be
greater than 64 (state=42000,code=10411)
is there any way to find the exact duplicate rows's count from hive table?
It's a bug HIVE-21135 in Hive 3.1.0 version, it is fixed in the Hive 4.0.0, see HIVE-21018, not backported.
Try to concatenate columns using delimiter in the subquery before aggregation as a workaround, I'm not sure will it help or not.
like this, using concat() or concat_ws or || operator:
select concat_ws ('~', col1, col2, col3, col4)
...
group by concat_ws ('~', col1, col2, col3, col4)
or
col1||'~'||col2||'~'||...||colN
NULLs should be taken care also. Replace nulls with empty strings before concatenation using NVL function.
I wish to migrate from Legacy SQL to Standard SQL
I had the following code in Legacy SQL
SELECT
hits.page.pageTitle
FROM [mytable]
WHERE hits.page.pageTitle contains '%'
And I tried this in Standard SQL:
SELECT
hits.page.pageTitle
FROM `mytable`
WHERE STRPOS(hits.page.pageTitle, "%")
But it gives me this error:
Error: Cannot access field page on a value with type
ARRAY> at [4:21]
Try this one:
SELECT
hits.page.pageTitle
FROM `table`,
UNNEST(hits) hits
WHERE REGEXP_CONTAINS(hits.page.pageTitle, r'%')
LIMIT 1000
In ga_sessions schema, "hits" is an ARRAY (that is, REPEATED mode). You need to apply the UNNEST operation in order to work with arrays in BigQuery.
Is it possible to append the results of running a query to a table using the bq command line tool? I can't see flags available to specify this, and when I run it it fails and states "table already exists"
bq query --allow_large_results --destination_table=project:DATASET.table "SELECT * FROM [project:DATASET.another_table]"
BigQuery error in query operation: Error processing job '':
Already Exists: Table project:DATASET.table
Originally BigQuery did not support the standard SQL idiom
INSERT foo SELECT a,b,c from bar where d>0;
and you had to do it their way with --append_table
But according to #Will's answer, it works now.
Originally with bq, there was
bq query --append_table ...
The help for the bq query command is
$ bq query --help
And the output shows an append_table option in the top 25% of the output.
Python script for interacting with BigQuery.
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
query Execute a query.
Examples:
bq query 'select count(*) from publicdata:samples.shakespeare'
Usage:
query <sql_query>
Flags for query:
/home/paul/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_large_results: Enables larger destination table sizes.
--[no]append_table: When a destination table is specified, whether or not to
append.
(default: 'false')
--[no]batch: Whether to run the query in batch mode.
(default: 'false')
--destination_table: Name of destination table for query results.
(default: '')
...
Instead of appending two tables together, you might be better off with a UNION ALL which is sql's version of concatenation.
In big query the comma or , operation between two tables as in SELECT something from tableA, tableB is a UNION ALL, NOT a JOIN, or at least it was the last time I looked.
Just in case someone ends up finding this question in Google, BigQuery has evolved a lot since this post and now it does support Standard.
If you want to append the results of a query to a table using the DML syntax feature of the Standard version, you could do something like:
INSERT dataset.Warehouse (warehouse, state)
SELECT *
FROM UNNEST([('warehouse #1', 'WA'),
('warehouse #2', 'CA'),
('warehouse #3', 'WA')])
As presented in the docs.
For the command line tool it follows the same idea, you just need to add the flag --use_legacy_sql=False, like so:
bq query --use_legacy_sql=False "insert into dataset.table (field1, field2) select field1, field2 from table"
According to the current documentation (March 2018): https://cloud.google.com/bigquery/docs/loading-data-local#appending_to_or_overwriting_a_table_using_a_local_file
You should add:
--noreplace or --replace=false
Let's say I have a large table partitioned by dt field. I want to query this table for data after specific date. E.g.
select * from mytab where dt >= 20140701;
The tricky part is that date is not a constant, but comes from a subquery. So basically I want something like this:
select * from mytab where dt >= (select min(dt) from activedates);
Hive can't do it, however, giving me ParseException on subquery (from docs I'm guessing it's just not supported yet).
So how do I restrict my query based on dynamic subquery?
Note, that performance is key point here. So the faster, the better, even if it looks uglier.
Also note, that we haven't switched to Hive 0.13 yet, so solutions without in query are preferred.
Hive decides on the partition pruning when building the execution plan and thus has to have the value of the max(dt) prior to execution.
Currently the only way to accomplish something like this is breaking the query into two parts, when the first will be select min(dt) from activedates, its results will be put into a variable.
2nd query will be : select * from mytab where dt >=${hiveconf:var}.
Now this is a bit tricky.
You could either execute the 1st query into OS variable like so :
a=`hive -S -e "select min(dt) from activedates"`
And then run the 2nnd query like so :
hive -hiveconf var=$a -e "select * from mytab where dt >=${hiveconf:var}"
or event just :
hive -e "select * from mytab where dt >=$a"
Or, if you are using some other scripting language you can replace the variable in the code.