dbGetQuery retrieves less number of rows - sql

I am trying to fetch a large dataset into R environment using ODBC Connection
When I try to retrieve data from a large dataset using dbGetQuery() function, the number of rows are less than what is see on hive. Sometimes, the same code fetches me correct number of rows. Could some help me if i should clear any buffer before fetching the data?
hive_con <- dbConnect(odbc::odbc(),.connection_string=Connection_String)
qry<-"select * from mytable"
rslt<-dbGetQuery(hive_con,qry)
I have tried changing n parameter of the function dbGetQuery(). But still the problem persists

Finally, i found that i was using HTTP protocol for data extraction. This protocol doesn't show any warning if the data loss happens.

Related

Connection Timeout Error while reading the table having more than 100 columns in Mosaic Decisions

I am reading a table via snowflake reader node having less number of columns/attributes(around 50-80),the table is getting read on the Mosaic decisions Canvas. But when the attributes of table increases (approx 385 columns),Mosaic reader node fails. As a workaround I tried using the where clause with 1=2,in that case it is pulling the structure of the Table. But when I am trying to read the records even by applying the limit (only 10 records) to the query, it is throwing connection timeout Error.
Even I faced similar issue while reading (approx. 300 columns) table and I managed it with the help of input parameters available in Mosaic. In your case you will have to change the copy field variable to 1=1 used in the query at run time.
Below steps can be referred to achieve this -
Create a parameter (e.g. copy_variable) that will contain the default value 2 for the copy field variable
In reader node, write the SQL with 1 = $(copy_variable) So while validating, it’s same as 1=2 condition and it should validate fine.
Once validated and schema is generated, update the default value of $(copy_variable) to 1 so that while running, you will still get all records.

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?
There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

BigQuery paging issues with tableData.list()

We're trying to load 120,000 rows from a BigQuery table using tableData.list. Our first call,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000
returns a pageToken as expected and the first 22482 (1 to 22482) rows. We assume this is due to the 10mb serialized JSON limit. Our second call however,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000&pageToken=CIDBB777777QOGQIBCIL6BIQSC7QK===
returns not the next rows, but 22137 rows starting at row 900001 to the 112137, without a pageToken.
Strangely, If we change maxResults to 100,000, we get rows starting from 100,000.
We are able to work around this using startRowIndex to page. In this case, we start with the first call being startRowIndex =0 and we never get a pageToken in the response. We keep making calls until all rows are retrieved. However, the concern is without pageTokens, if the row order changes while the calls are being made, the data could be invalid.
We are able to reproduce this behavior on multiple tables with different sizes and structures.
Is there any issue with paging, or should we be structuring our calls differently?
This is a known, high priority bug. That has been fixed.

SQL - When data are transfered

i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye
First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.
You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.
according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.

Slow Datareader with ODP.NET

we are using (ODP.NET) Oracle.DataAccess version 1.102.3.0 with Oracle 11g client. I am having some problems with reading data using the datareader,
my procedure is returning a ref_cursor around 10000 records. However fetching the data takes around 30 to 40 sec.
Are there any possibilities to improve the performace ?
Try setting the FetchSize.
Since this is a Procedure and a RefCusor you can perform these operations.
ExecuteReader
Set the FetchSize -> do this before you start reading
Read the Results
If you follow the above sequence of actions you will be able to obtain the size of a row.
e.g.
int NumberOfRowsToFetchPerRoundTrip = 2000;
var reader =cmd.ExecuteReader();
reader.FetchSize = reader.RowSize * NumberOfRowsToFetchPerRoundTrip;
while(reader.Read())
{
//Do something
}
This will reduce the amount of round trips. The fetch size is bytes so you could also just use an arbitrary fetchSize eg 1024*1024 (1mb). However I would recommend basing you fetch on the size of a row * the number of rows you want to fetch per request.
In addition I would set the parameters on the connection string
Enlist=false;Self Tuning=False
You seem to get more consistent performance with these settings, although there may be some variation from one version of ODP/NET to the next.