I am using BigQuery Java API for populating the data from one table into another, following is the code which i am using
Job insertJob = new Job();
JobConfiguration insertJobConfig = new JobConfiguration();
TableReference destinationTable = new TableReference();
destinationTable.setProjectId(projectId);
destinationTable.setDatasetId(datasetId);
destinationTable.setTableId(destinationBQTable);
JobConfigurationQuery queryConfig = new JobConfigurationQuery();
queryConfig.setQuery("select * from " + datasetId + Constant.PERIOD +sourceBQTable);
queryConfig.setDestinationTable(destinationTable);
queryConfig.setWriteDisposition("WRITE_TRUNCATE");
queryConfig.setPriority("BATCH");
insertJob.setConfiguration(insertJobConfig.setQuery(queryConfig));
Bigquery.Jobs.Insert request = bigqueryService.jobs().insert(projectId, insertJob);
Job response = request.execute();
return response.getJobReference().getJobId();
I am facing one intermittent issue where my destination table is not populating fully for e.g., source table has 189,856 rows but destination table has only 41,721 rows. I am not seeing any error in the logs.
Has anyone experienced this before with BigQuery - a query appearing to run successfully but if a destination/reference table was specified the results were not fully populated?
Note: we again faced this problem on July 21 and this time I also logged the Job Id which is: job_8zt5hHdsPhizl2RFZ9g57EMcgy0
Thanks,
Aman
Related
I have an SSAS cube that imports a view of data form a source system. Each time this processes it imports the full view, however i want to improve performance by only processing the rows that are new or have changed since the last process. Can anyone advise the best way of doing this?
The view has an ID column, along with a created date and a modified date if this helps?
Have not known what to try, even after googling
What is your model processing strategy? Why are you doing a Process Full?
There is an option called Process Add which loads only new data which has not been provisioned in the cube based on the condition which allows you to segment all modified data which has not been added in the model.
Here's a quick snippet of process add logic:
[Reflection.Assembly]::LoadWithPartialName("Microsoft.AnalysisServices")
$server = New-Object Microsoft.AnalysisServices.Server
$server.connect("localhostK12")
$db = $server.Databases.Item("AdventureWorks Tabular Model SQL 2012")
$dsv = $db.DataSourceViews.GetByName("Sandbox")
$cube = $db.Cubes.GetByName("Model")
$measureGroup = $cube.MeasureGroups.GetByName("Internet Sales")
$partition = $measureGroup.Partitions.GetByName("Internet Sales")
$queryBinding = New-Object Microsoft.AnalysisServices.QueryBinding( $dsv.DataSourceID, "SELECT * FROM FactInternetSales **WHERE OrderDateKey >= 20120215**" )
$partition.Process( "ProcessAdd", $queryBinding )
$server.Disconnect()
Use the below query to filter records which have been modified recently
Select * from table where id,
modified_date IN (select id,
max(modified_date) from
table group by id) or created_date IS
NULL;
I am trying to download data from bigquery table which has 3 million records. I get the error
"response too large to return, try will allow_large_results = true"
I tried with the below command:
df = bq.Query('SELECT * FROM [Test.results]', allow_large_results = True).to_dataframe()
Any help would be greatly appreciated.
The way to retrieve result of query that is expected to be bigger than ~128MB is to issue query insert job api with destination table and allow large result flag. After result is stored in that table you can retrieve it using tabledata.list job. Of course than you can delete that [intermediate] table
Hope you can identify respective syntax in client you are using
This is quite old, but for those who land here, the way to do it is:
from google.cloud import bigquery
...
client = bigquery.Client()
job_config = bigquery.job.QueryJobConfig(allow_large_results=True)
q = client.query("""SELECT * FROM [Test.results]""", job_config=job_config)
r = q.result()
df = r.to_dataframe()
From the docs here.
I have problem I don't know how to solve it. I have two tables Staging and Operation.I create Data Flow Task using SSIS to move data columns from Staging([Account_Num],[MergeFlag],[MergeTo],[StartDate],[EndDate]) to operation([ID],[Account_Num],[MergeFlag],[ID_MergeTo],[StartDate],[EndDate]. I want to run 2nd data flow task to update Opreation.[ID_MergeTo] in Opreation table using OpreationID If the " MergeFlag " is true and If the MergeTo is defined
my logic with screenshot for 2nd data flow task update ID_MergeTo
1- extract data From Staging where " MergeFlag " is true and If the MergeTo is defined sql
SELECT Account_Num, MergeFlag, MergeTo, StartDate, EndDate
FROM Tmp_SourceTable
WHERE (MergeFlag = 1) AND (MergeTo IS NOT NULL)
2- create Lookup use Staging.[Account_Num] and operation.[Account_Num] to get data from 2 tables
3- data view after lookup task
my Question what task should I use to update opration.[ID_MergeTo] where ID =opration.[ID] update using opration.[ID]
List item
Oledb Command Transformation will help you in updating Operation.[ID_MergeTo].
I have a data adapter with 4 tables in a dataset. When I update a new record to the table it appears in the SQL database and the associated datagridivew has been reloaded with the dat in the table, but when I try and read the new record using the following code it can't find the record.
Dim row As DataRow = dsSrvAV.Tables("ServiceAvailability").Select("ID = " & intRecordID).FirstOrDefault()
The same code is used to read other records that were in the database when the application opened, it's just new records that it can't read.
This is the code that writes the new records
Dim newAvailability As DataRow = dsSrvAV.Tables("ServiceAvailability").NewRow()
'Add some data to it
newAvailability("Service_ID") = cboServices.SelectedValue
newAvailability("Date") = Format(dtpDate.Value.ToString, "Short Date")
newAvailability("Downtime") = nudDowntime.Value
newAvailability("Notes") = txtNotes.Text
newAvailability("MajorIncident") = txtMajorIncident.Text
newAvailability("ActionsTaken") = txtActionsTaken.Text
newAvailability("Type") = cboType.SelectedValue
newAvailability("Root_Cause") = txtRootCause.Text
'Add it to the table
dsSrvAV.Tables("ServiceAvailability").Rows.Add(newAvailability)
'Update the adapter
daSrvAv.Update(dsSrvAV, "ServiceAvailability")
dsSrvAV.Tables("ServiceAvailability").AcceptChanges()
Can anyone offer any thoughts as to why this won't allow new records to be read back.
Thanks
Rich
Per comments - this solved the issue.
Close your dsSrvAv dataset, and then re-open it, and then do the select.
Regardng performance: are you adding 1 record per second of 1,000,000. If its 1,000,000 then yes there's an overhead. If its 1 per second there isn't any noticable overhead.
I have an SQL database, which is a "feeder" table. I put records in said table, a 3rd party package consumes (and deletes) them. All hunky dory - until the 3rd party package isn't running. In thinking about how to detect that, I thought to myself... "well... what if I read all the keys in the table (its not very big - max a few dozen records), and kept them, and then in, say, 5 minutes, I checked if any were still in the table ?"
It may not be a brilliant solution, but it sent me off thinking about Linq and whether you could do such a thing (I haven't used Linq before).
So, if I read all the record keys into a DataTable object and then, five minutes later, read all the records into another DataTable object, I can do a Linq select, joining the two DataTable objects on the key column, and then look at the results of "Count", and if one or more, chances are the data in the table isn't being consumed.
Or... is there a "cleverer" way than that ?
Create a DELETE trigger which records in a separate table the timestamp of the last delete. An additional INSERT trigger would record the timestamp of the last insert statement.
Compare the two timestamps.
You could return the identity column value (assuming there is one) after your insert and record it in a separate table along with its commit datetime they just pull outstanding records with;
SELECT * FROM feeder_table F
INNER JOIN other_table T ON (F.id = T.id)
WHERE DATEDIFF(MINUTE, T.commitdate, GETDATE()) > 5
That way your not persisting data in memory so it will work between application restarts/across machines.
(If this is just for fault detection you would only need to store the last inserted id.)
This is one way:
DataTable t1 = GetData(); // returns a datatable with an Int16 "Id" column
// time passes... a shabby man enters and steals your lamp
DataTable t2 = GetData();
// some data changes have occurred
t2.Rows.Add(null, DateTime.Now.AddSeconds(10), "Some more");
t2.Rows[1].Delete();
EnumerableRowCollection<DataRow> rows1 = t1.AsEnumerable();
EnumerableRowCollection<DataRow> rows2 = t2.AsEnumerable();
var keys1 = rows1.Select(row => (Int16)row["Id"]).ToList();
var keys2 = rows2.Select(row => (Int16)row["Id"]).ToList();
// how many keys from t1 are still in t2
Console.WriteLine("{0} rows still there", keys1.Count(id => keys2.Contains(id)));
But this is more what I had in mind:
DataTable t1 = GetData(); // returns a datatable with an Int16 "Id" column
// time passes... your lamp is getting dim
DataTable t2 = GetData();
// some data changes have occurred
t2.Rows.Add(null, DateTime.Now.AddSeconds(10), "Some more");
t2.Rows[1].Delete();
EnumerableRowCollection<DataRow> rows1 = t1.AsEnumerable();
EnumerableRowCollection<DataRow> rows2 = t2.AsEnumerable();
// how many rows from r1 are still in r2
int n = (from r1 in rows1
join r2 in rows2 on (Int16)r1["Id"] equals (Int16)r2["Id"]
select r1).Count();
...which is the "linq/join" method I alluded to in the original question.