I am trying to download data from bigquery table which has 3 million records. I get the error
"response too large to return, try will allow_large_results = true"
I tried with the below command:
df = bq.Query('SELECT * FROM [Test.results]', allow_large_results = True).to_dataframe()
Any help would be greatly appreciated.
The way to retrieve result of query that is expected to be bigger than ~128MB is to issue query insert job api with destination table and allow large result flag. After result is stored in that table you can retrieve it using tabledata.list job. Of course than you can delete that [intermediate] table
Hope you can identify respective syntax in client you are using
This is quite old, but for those who land here, the way to do it is:
from google.cloud import bigquery
...
client = bigquery.Client()
job_config = bigquery.job.QueryJobConfig(allow_large_results=True)
q = client.query("""SELECT * FROM [Test.results]""", job_config=job_config)
r = q.result()
df = r.to_dataframe()
From the docs here.
Related
I have a table as shown
ID (int) | DATA (bytea)
1 | \x800495356.....
The contents of the data column have been stored via a python script
result_dict = {'datapoint1': 100, 'datapoint2': 2.334'}
table.data = pickle.dumps(result_dict)
I can easily read the data back using
queried_dict = pickle.loads(table.data)
But I don't know how to query it directly as a json or even as plain text in postgres alone. I have tried the following query and many versions of it but it doesn't seem to work
-- I don't know what should come between SELECT and FROM
SELECT encode(data, 'escape') AS res FROM table WHERE id = 1;
-- I need to get this or somewhere close to this as the query result
res |
{"datapoint1": 100, "datapoint2": 2.33}
Thanks a lot in advance to everyone trying to help.
We have a rather simple query that gets stuck:
UPDATE BRI_PRINT_DOCUMENT pd
SET PRINT_STATUS_ID = 6
WHERE pd.PRINT_STATUS_ID = 1
AND pd.DOCUMENT_TYPE_ID IN (
SELECT jd.DOCUMENT_TYPE_ID
FROM BRI_JOBTYPE_DOCUMENTTYPE jd
WHERE jd.JOB_TYPE_ID = 2);
COMMIT;
The following query doesn't help:
UPDATE (SELECT *
from BRI_PRINT_DOCUMENT PD
INNER JOIN BRI_JOBTYPE_DOCUMENTTYPE BJD
on PD.DOCUMENT_TYPE_ID = BJD.DOCUMENT_TYPE_ID
AND JOB_TYPE_ID = 2
AND PRINT_STATUS_ID = 1) joined
SET joined.PRINT_STATUS_ID = 6;
We have problems understanding the problem as the following query is fast:
SELECT * FROM BRI_PRINT_DOCUMENT pd
WHERE PRINT_STATUS_ID = 1
AND pd.DOCUMENT_TYPE_ID IN (
SELECT jd.DOCUMENT_TYPE_ID
FROM BRI_JOBTYPE_DOCUMENTTYPE jd
WHERE jd.JOB_TYPE_ID = 2);
Any idea what causes the problem?
Although me and my colleague had the problem of the hanging query through sql developer / Datagrip, the problem got magically resolved when the db admin executed the exact same query with Toad.
There was no table lock, the db admin didn't receive any alert of problems and we have no idea what caused the problem.
Thanks Oracle :-)
Note: there are only a few thousands of records in our dev database.
Explain plan:
For starter I'll admit that I'm quite new to dataframes/databricks having worked with them for only a few months.
I have two dataframes read from parquet files (full format). In reviewing the documentation it appears that what in pandas is called merge is in fact only a join.
in SQL I would write this step as:
ml_RETURNS_U = sqlContext.sql("""
MERGE INTO U2 as target
USING U as source
ON (
target.ITEMNUMBER = source.ITEMNUMBER
and target.PRODUCTCOLORID = source.PRODUCTCOLORID
and target.WEEK_ID = source.WEEK_ID
)
WHEN MATCHED THEN
UPDATE SET target.RETURNSALESQUANTITY = target.RETURNSALESQUANTITY + source.QTY_DELIVERED
WHEN NOT MATCHED THEN
INSERT (ITEMNUMBER, PRODUCTCOLORID, WEEK_ID, RETURNSALESQUANTITY)
VALUES (source.ITEMNUMBER, source.PRODUCTCOLORID, source.WEEK_ID, source.QTY_DELIVERED)
""")
When I run this command I get the following error: u'MERGE destination only supports Delta sources.\n;'
So I have two questions: Is there a way I can preform this operation using pandas or pySpark?
if not, how can I resolve this error?
You can create your tables using DELTA and perform this operation
see: https://docs.databricks.com/delta/index.html
So you can do a upsert using merge like this: https://docs.databricks.com/delta/delta-batch.html#write-to-a-table
I am using BigQuery Java API for populating the data from one table into another, following is the code which i am using
Job insertJob = new Job();
JobConfiguration insertJobConfig = new JobConfiguration();
TableReference destinationTable = new TableReference();
destinationTable.setProjectId(projectId);
destinationTable.setDatasetId(datasetId);
destinationTable.setTableId(destinationBQTable);
JobConfigurationQuery queryConfig = new JobConfigurationQuery();
queryConfig.setQuery("select * from " + datasetId + Constant.PERIOD +sourceBQTable);
queryConfig.setDestinationTable(destinationTable);
queryConfig.setWriteDisposition("WRITE_TRUNCATE");
queryConfig.setPriority("BATCH");
insertJob.setConfiguration(insertJobConfig.setQuery(queryConfig));
Bigquery.Jobs.Insert request = bigqueryService.jobs().insert(projectId, insertJob);
Job response = request.execute();
return response.getJobReference().getJobId();
I am facing one intermittent issue where my destination table is not populating fully for e.g., source table has 189,856 rows but destination table has only 41,721 rows. I am not seeing any error in the logs.
Has anyone experienced this before with BigQuery - a query appearing to run successfully but if a destination/reference table was specified the results were not fully populated?
Note: we again faced this problem on July 21 and this time I also logged the Job Id which is: job_8zt5hHdsPhizl2RFZ9g57EMcgy0
Thanks,
Aman
please consider this model
it's for a fitness center management app
ADHERANT is the members table
INSCRIPTION is the subscription table
SEANCE is the individual sessions table
the seance table contain very fews rows (around 7000)
now the query :
var q = from n in ctx.SEANCES
select new SeanceJournalType()
{
ID_ADHERANT = n.INSCRIPTION.INS_ID_ADHERANT,
ADH_NOM = n.INSCRIPTION.ADHERANT.ADH_NOM,
ADH_PRENOM = n.INSCRIPTION.ADHERANT.ADH_PRENOM,
ADH_PHOTO = n.INSCRIPTION.ADHERANT.ADH_PHOTO,
SEA_DEBUT = n.SEA_DEBUT
};
var h = q.ToList();
this take around 3 seconds wich is an eternity,
the same generated SQL query is almost instantaneous
SELECT
1 AS "C1",
"C"."INS_ID_ADHERANT" AS "INS_ID_ADHERANT",
"E"."ADH_NOM" AS "ADH_NOM",
"E"."ADH_PRENOM" AS "ADH_PRENOM",
"E"."ADH_PHOTO" AS "ADH_PHOTO",
"B"."SEA_DEBUT" AS "SEA_DEBUT"
FROM "TMP_SEANCES" AS "B"
LEFT OUTER JOIN "INSCRIPTIONS" AS "C" ON "B"."INS_ID_INSCRIPTION" = "C"."ID_INSCRIPTION"
LEFT OUTER JOIN "ADHERANTS" AS "E" ON "C"."INS_ID_ADHERANT" = "E"."ID_ADHERANT"
any idea on what's going on please, or how to fix that ?
thanks
it needs some research to optimize this :
if you neglect the data transfer from the db to the server then
as Ivan Stoev Suggested calling the ToList method is the expensive part
as for improving the performance it depends on your needs:
1.if you need add-delete functionality at the server side it is probably best to stick with the list
2.if no need for add-delete then consider ICollection ,, or even better
3.if you have more conditions which will customize the query even more best use IQuerable
customizing the query like selecting a single record based on a condition :
var q = from n in ctx.SEA.... // your query without ToList()
q.where(x=>"some condition") //let`s say x.Id=1
only one record will be transferred from the database to the server
but with the ToList Conversion all the records will be transferred to the server then the condition will be calculated
although it is not always the best to use IQuerable it depends on your business need
for more references check this and this