Athena pagination and performance issue - sql

I have huge data set in S3 and using AWS Athena I am trying to query it, below 3 parameters are input for my query.
marketplaceId
startIndex
endIndex
but it's took 16 seconds to query just 50 records. ( I am using python to query data from Athena --> S3)
What I am doing wrong here? and the way I implemented pagination is right or not?
SQL Query which I am executing.
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_num
FROM
(
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_number() over (
order by
browse_node_id,
gl_product_group,
glance_view_count_t12m desc
) as row_num
from
(
select
*
from
category_info
WHERE
marketplace_id = '<marketplaceId>'
)
)
WHERE
row_num between '<startIndex>'
and '<endIndex>';
Update
After debugging my issue with timestamp I found It's taking 6 second for 1 query. and I am running two query.
1st - to get data , query which I mentioned above.
2nd - to get count of total number or rows in my table.
so that's why it's taking 12-16 sec.
So is there any way to get total number of rows without second query (select count(*) from category_info).

Related

HQL/Hive Missing EOF in LIMIT query

i'm new to HQL and was wondering the reason of the error below:
I was selecting the whole database which had ~9 millions of records so I was trying to get it chunk by chunk. Therefore I tried:
Everything worked fine when I used:
SELECT * FROM tableABC ORDER BY tableABC.ID LIMIT 10; //Select everything from the table with total 10 rows
However, when I tried to get them with:
SELECT * FROM tableABC ORDER BY tableABC.ID LIMIT 0,10; //Select everything from the table from row 0 to total 10 rows
I kept getting the error of "FAILED: ParseException line 1:111 missing EOF at ',' near '0')". I tried using LIMIT with OFFSET, and it still showed the same error about EOF.
May I know what would be the problem?
Limit with two arguments should work in hive 2.0.0 or higher version. Could you please check your hive version using select version() and find the root cause for yourself?
If your hive is lower than 2, you can use below SQL to get the data you want. I am using row_number() to generate sequential numbers and then putting a filter on it. This may be little slower than limit x,y but shouldn't be too much different.
select id,col_1, col_2...
from (
select id, col_1, col_2, ... , row_number() OVER (ORDER by id) as rownum from tableABC
) rs
where rownum between 0 and 50

Difference of values in the same column ms access sql (mdb)

I have a table which contains two column with values that are not unique, those values are generated automatically and I have no way to do anything about it, cannot edit the table, db nor make custom functions.
With that in mind I've solved this problem in sql server, but it contains some functions that does not exist in ms-access.
The columns are Volume and ComponentID, here is my code in sql:
with rows as (
select row_number() over (order by volume) as rownum, volume
from test where componentid = 'S3')
select top 10
rowsMinusOne.volume, coalesce(rowsMinusOne.volume - rows.volume,0) as diff
from rows as rowsMinusOne
left outer join rows
on rows.rownum = rowsMinusOne.rownum - 1
Sample data:
58.29168
70.57396
85.67902
97.04888
107.7026
108.2022
108.3975
108.5777
109
109.8944
Expected results:
Volume
diff
58.29168
0
70.57396
12.28228
85.67902
15.10506
97.04888
11.36986
107.7026
10.65368
108.2022
0.4996719
108.3975
0.1952896
108.5777
0.1801834
109
0.4223404
109.8944
0.89431
I have solved the part of the coalesce by replacing it with NZ, I have tryed to use the DCOUNT to solve the row_number (How to show the record number in a MS Access report table?) but I reveive the error that it cannot find the function (I am reading the data by code, that is the only thing I can do).
I also tryed this but, as the answer says I need a column with a unique value which I do not have nor can create Microsoft Access query to duplicate ROW_NUMBER
Consider:
SELECT TOP 10 Table1.ComponentID,
DCount("*","Table1","ComponentID = 'S3' AND Volume<" & [Volume])+1 AS Seq, Table1.Volume,
Nz(Table1.Volume -
(SELECT Top 1 Dup.Volume FROM Table1 AS Dup
WHERE Dup.ComponentID = Table1.ComponentID AND Dup.Volume<Table1.Volume
ORDER BY Volume DESC),0) AS Diff
FROM Table1
WHERE (((Table1.ComponentID)="S3"))
ORDER BY Table1.Volume;
This will likely perform very slowly with large dataset.
Alternative solutions:
build query that calculates difference, use that query as source for a report, use textbox RunningSum property to calculate sequence number
VBA looping through recordset and saving results to a 'temp' table
export to Excel

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

Oracle sapling of data in parallel

I have X million records in a table TABLE_A and want to process these records one by one.
How can I divide the population equally among 10 instances of same PL/SQL scripts to process in parallel?
See below query
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM
(
SELECT CUSTOMER_ID, NAME, HOME_TELNO, DEPT_NAME, ROWNUM AS RNUM
FROM TABLE_A ORDER BY CUSTOMER_ID ASC
) CBR
WHERE CBR.RNUM < :sqli_end_rownum AND CBR.RNUM >= :sqli_start_rownum ;
Values will be incremented in each iteration of loop. In next iteration sqli_start_rownum will become sqli_end_rownum.
This query is taking much time. Does someone has better way to do it
You could look into DBMS_PARALLEL_EXECUTE:
http://docs.oracle.com/cd/E11882_01/appdev.112/e40758/d_parallel_ex.htm#ARPLS67331
For example:
https://oracle-base.com/articles/11g/dbms_parallel_execute_11gR2
The poor man's version of this is basically to run a query to generate ranges of rowids. You can then access the rows in the table within a given range.
Step1: create the number of "buckets" you want to divide the table into and get a range of rowids for each bucket. Here's an 8-bucket example:
select bucket_num, min(rid) as start_rowid, max(rid) as end_rowid, count(*)
from (select rowid rid
, ntile(8) over (order by rowid) as bucket_num
from table_a
)
group by bucket_num
order by bucket_num;
You'd get an output that looks like this (I'm using 12c - rowids may look different in 11g):
BUCKET_NUM START_ROWID END_ROWID COUNT(*)
1 AABetTAAIAAB8GCAAA AABetTAAIAAB8u5AAl 82792
2 AABetTAAIAAB8u5AAm AABetTAAIAAB9RrABi 82792
3 AABetTAAIAAB9RrABj AABetTAAIAAB96vAAU 82792
4 AABetTAAIAAB96vAAV AABetTAAIAAB+gKAAs 82792
5 AABetTAAIAAB+gKAAt AABetTAAIAAB+/vABv 82792
6 AABetTAAIAAB+/vABw AABetTAAIAAB/hbAB1 82791
7 AABetTAAIAAB/hbAB2 AABetTAAIAACARDABf 82791
8 AABetTAAIAACARDABg AABetTAAIAACBGnABq 82791
(The sum of the counts will be the total number of rows in the table at the time of the query.)
Step2: can grab a set of rows from the table for a given range:
SELECT <whatever you need>
FROM <table>
WHERE rowid BETWEEN 'AABetTAAIAAB8GCAAA' and 'AABetTAAIAAB8u5AAl'
...
Step3: repeat step2 for the given ranges.
so instead of this:
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM
(
SELECT CUSTOMER_ID, NAME, HOME_TELNO, DEPT_NAME, ROWNUM AS RNUM
FROM TABLE_A ORDER BY CUSTOMER_ID ASC
) CBR
WHERE CBR.RNUM < :sqli_end_rownum AND CBR.RNUM >= :sqli_start_rownum ;
you'll just have this:
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM table_a
WHERE rowid BETWEEN :start_rowid and :end_rowid
You can use this to run the same job in parallel but you'll need a separate session for each run (e.g. multiple SQL Plus sessions. You can also use something like DBMS_JOBS/DBMS_SCHEDULER to launch background jobs.
(Note: always be aware if your table is being updated between the time the buckets are calculated and the time you access the tables as you can miss rows.)

How can I get a specific chunk of results?

Is it possible to retrieve a specific range of results? I know how to do TOP x but the result I will retrieve is WAY too big and will time out. I was hoping to be able to pick say the first 10,000 results then the next 10,000 and so on. Is this possible?
WITH Q AS (
SELECT ROW_NUMBER() OVER (ORDER BY ...some column) AS N, ...other columns
FROM ...some table
) SELECT * FROM Q WHERE N BETWEEN 1 AND 10000;
Read more about ROW_NUMBER() here: http://msdn.microsoft.com/en-us/library/ms186734.aspx
Practically all SQL DB implementations have a way of specifying the starting row to return, as well as the number of rows.
For example, in both mysql and postgres it looks like:
SELECT ...
ORDER BY something -- not required, but highly recommended
LIMIT 100 -- only get 100 rows
OFFSET 500; -- start at row 500
Note that normally you would include an ORDER BY to make sure your chunks are consistent
MS SQL Server (being a "pretend" DB) don't support OFFSET directly, but it can be coded using ROW_NUMBER() - see this SO post for more detail.