HQL/Hive Missing EOF in LIMIT query - sql

i'm new to HQL and was wondering the reason of the error below:
I was selecting the whole database which had ~9 millions of records so I was trying to get it chunk by chunk. Therefore I tried:
Everything worked fine when I used:
SELECT * FROM tableABC ORDER BY tableABC.ID LIMIT 10; //Select everything from the table with total 10 rows
However, when I tried to get them with:
SELECT * FROM tableABC ORDER BY tableABC.ID LIMIT 0,10; //Select everything from the table from row 0 to total 10 rows
I kept getting the error of "FAILED: ParseException line 1:111 missing EOF at ',' near '0')". I tried using LIMIT with OFFSET, and it still showed the same error about EOF.
May I know what would be the problem?

Limit with two arguments should work in hive 2.0.0 or higher version. Could you please check your hive version using select version() and find the root cause for yourself?
If your hive is lower than 2, you can use below SQL to get the data you want. I am using row_number() to generate sequential numbers and then putting a filter on it. This may be little slower than limit x,y but shouldn't be too much different.
select id,col_1, col_2...
from (
select id, col_1, col_2, ... , row_number() OVER (ORDER by id) as rownum from tableABC
) rs
where rownum between 0 and 50

Related

Athena pagination and performance issue

I have huge data set in S3 and using AWS Athena I am trying to query it, below 3 parameters are input for my query.
marketplaceId
startIndex
endIndex
but it's took 16 seconds to query just 50 records. ( I am using python to query data from Athena --> S3)
What I am doing wrong here? and the way I implemented pagination is right or not?
SQL Query which I am executing.
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_num
FROM
(
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_number() over (
order by
browse_node_id,
gl_product_group,
glance_view_count_t12m desc
) as row_num
from
(
select
*
from
category_info
WHERE
marketplace_id = '<marketplaceId>'
)
)
WHERE
row_num between '<startIndex>'
and '<endIndex>';
Update
After debugging my issue with timestamp I found It's taking 6 second for 1 query. and I am running two query.
1st - to get data , query which I mentioned above.
2nd - to get count of total number or rows in my table.
so that's why it's taking 12-16 sec.
So is there any way to get total number of rows without second query (select count(*) from category_info).

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

SQL Query Limit for DB2 AS/400 Version 4

I know the version is way too old (yea version 4!), but I have no choice.
How to limit my query for example 100 rows only for DB2 AS400?
FETCH FIRST n ROWS ONLY
and
ROW_NUMBER()
don't work.
Any ideas or workaround?
Here is a sample SQL query (does not work):
SELECT POLNOP FROM ZICACPTF.POLHDR FETCH FIRST 10 ROWS ONLY
It says
[SQL0199] Keyword FETCH not expected. Valid tokens: FOR WITH ORDER UNION OPTIMIZE.
There is no dbms support for this operation, check Version 4 DB2 UDB for AS/400 SQL Reference: No Limit, Top, First, ... reserved words.
You can try to limit rows via where clause, where sequence between 100 and 200. But this is an unreal scenario.
First work around is via cursor:
DECLARE ITERROWS INTEGER;
...
SET ITERROWS = 0;
DO WHILE (SUBSTR(SQLSTATE,1,2) = '00' and ITERROWS < 100
DO
...
SET ITERROWS = ITERROWS + 1;
second one, in your middleware language.
I hope someone post a clever workaround, but, in my opinion, they are not.
Solution only for > V4R4
Using FETCH FIRST [n] ROWS ONLY:
SELECT LASTNAME, FIRSTNAME, EMPNO, SALARY
FROM EMP
ORDER BY SALARY DESC
FETCH FIRST 10 ROWS ONLY;
Reference: publib.boulder.ibm.com
The difference I can see from your query to this example is that here we are using a ORDER BY clause - do you have the possibility to add a ORDER BY - it should do the trick. Referencing to: https://stackoverflow.com/a/16858430/1581725
To get ranges or also only the first 10 rows, you'd have to use ROW_NUMBER() (since v5r4):
SELECT
*
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY {{table field}}) AS ROWNUM, * {{yourtable}}
) AS {{yourcursor}}
WHERE
{{yourcursor}}.ROWNUM>0 AND
{{yourcursor}}.ROWNUM<=10
Reference: blog.zanclus.com

How can I get a specific chunk of results?

Is it possible to retrieve a specific range of results? I know how to do TOP x but the result I will retrieve is WAY too big and will time out. I was hoping to be able to pick say the first 10,000 results then the next 10,000 and so on. Is this possible?
WITH Q AS (
SELECT ROW_NUMBER() OVER (ORDER BY ...some column) AS N, ...other columns
FROM ...some table
) SELECT * FROM Q WHERE N BETWEEN 1 AND 10000;
Read more about ROW_NUMBER() here: http://msdn.microsoft.com/en-us/library/ms186734.aspx
Practically all SQL DB implementations have a way of specifying the starting row to return, as well as the number of rows.
For example, in both mysql and postgres it looks like:
SELECT ...
ORDER BY something -- not required, but highly recommended
LIMIT 100 -- only get 100 rows
OFFSET 500; -- start at row 500
Note that normally you would include an ORDER BY to make sure your chunks are consistent
MS SQL Server (being a "pretend" DB) don't support OFFSET directly, but it can be coded using ROW_NUMBER() - see this SO post for more detail.

ms-access: runtime error 3354

i'm having a problem running an sql in ms-access. im using this code:
SELECT readings_miu_id, ReadDate, ReadTime, RSSI, Firmware, Active, OriginCol, ColID, Ownage, SiteID, PremID, prem_group1, prem_group2
INTO analyzedCopy2
FROM analyzedCopy AS A
WHERE ReadTime = (SELECT TOP 1 analyzedCopy.ReadTime FROM analyzedCopy WHERE analyzedCopy.readings_miu_id = A.readings_miu_id AND analyzedCopy.ReadDate = A.ReadDate ORDER BY analyzedCopy.readings_miu_id, analyzedCopy.ReadDate, analyzedCopy.ReadTime)
ORDER BY A.readings_miu_id, A.ReadDate ;
and before this i'm filling in the analyzedCopy table from other tables given certain criteria. for one set of criteria this code works just fine but for others it keeps giving me runtime error '3354'. the only diference i can see is that with the criteria that works, the table is around 4145 records long where as with the criteria that doesn't work the table that im using this code on is over 9000 records long. any suggestions?
is there any way to tell it to only pull half of the information and then run the same select string on the other half of the table im pulling from and add those results to the previous results from the first half?
The full text for run-time error '3354' is that it is "At most one record can be returned by this subquery."
I just tried to run this query on the first 4000 records and it failed again with the same error code so it can't be the ammount of records i would think.
See this:
http://allenbrowne.com/subquery-02.html#AtMostOneRecord
What is happening is your subquery is returning two identical records (based on the ORDER BY) and the TOP 1 actually returns two records (yes that's how access does the TOP statement). You need to add fields to the ORDER BY to make it unique - preferable an unique ID (you do have an unique PK don't you?)
As Andomar below stated DISTINCT TOP 1 will work as well.
What does MS-ACCESS return when you run the subquery?
SELECT TOP 1 analyzedCopy.ReadTime
FROM analyzedCopy
WHERE analyzedCopy.readings_miu_id = A.readings_miu_id
AND analyzedCopy.ReadDate = A.ReadDate
ORDER BY analyzedCopy.readings_miu_id, analyzedCopy.ReadDate,
analyzedCopy.ReadTime
If it returns multiple rows, maybe it can be fixed with DISTINCT:
SELECT DISTINCT TOP 1 analyzedCopy.ReadTime
FROM ... rest of query ...
I don't know if this would work or not (and I no longer have a copy of Access to test on), so I apologize up front if I'm way off.
First, just do a select on the primary key of analyzedCopy to get the mid-point ID. Something like:
SELECT TOP 4500 readings_miu_id FROM analyzedCopy ORDER BY readings_miu_id, ReadDate;
Then, when you have the mid-point ID, you can add that to the WHERE statement of your original statement:
SELECT ...
INTO ...
FROM ...
WHERE ... AND (readings_miu_id <= {ID from above}
ORDER BY ...
Then SELECT the other half:
SELECT ...
INTO ...
FROM ...
WHERE ... AND (readings_miu_id > {ID from above}
ORDER BY ...
Again, sorry if I'm way off.