I want to process csv file present in cloud bucket and insert its data in a BQ table. I found following piece of code but I am not sure how I can instantiate com.google.cloud.bigquery.Table for a given table name
com.google.cloud.bigquery.Table table = null;
com.google.cloud.bigquery.Job job = table.load(FormatOptions.csv(), sourceUri);
com.google.cloud.bigquery.Job completedJob = job.waitFor(WaitForOption.checkEvery(1, TimeUnit.SECONDS),
WaitForOption.timeout(3, TimeUnit.MINUTES));
if (!(completedJob != null && completedJob.getStatus().getError() == null)) {
throw new InterruptedException("Unable to load file from bucket into BQ");
}
return job;
Snippet taken from here.
[imports]
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableId tableId = TableId.of("dataset", "table");
Table table = bigquery.getTable(tableId);
[..]
Side note - that is an Alpha client library you are using. Just so you know.
Related
I am trying to upload json data to one of the table created under the dataset in Bigquery but fails with "Google.GoogleApiException: 'Google.Apis.Requests.RequestError
Not found: Table currency-342912:sampleDataset.currencyTable [404]"
Service account is created with roles BigQuery.Admin/DataEditor/DataOwner/DataViewer.
The roles are also applied to the table also.
Below is the snippet
public static void LoadTableGcsJson(string projectId = "currency-342912", string datasetId = "sampleDataset", string tableId= "currencyTable ")
{
//Read the Serviceaccount key json file
string dir = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.Parent.FullName + "\\" + "currency-342912-ae9b22f23a36.json";
GoogleCredential credential = GoogleCredential.FromFile(dir);
string toFileName = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.Parent.FullName + "\\" + "sample.json";
BigQueryClient client = BigQueryClient.Create(projectId,credential);
var dataset = client.GetDataset(datasetId);
using (FileStream stream = File.Open(toFileName, FileMode.Open))
{
// Create and run job
BigQueryJob loadJob = client.UploadJson(datasetId, tableId, null, stream); //This throws error
loadJob.PollUntilCompleted();
}
}
Permissions for the table, using the service account "sampleservicenew" from the screenshot
Any leads on this , much appreciated
Your issue might reside in your user credentials. Please follow this steps to check your code:
Please check if the user you are using to execute your application have access to the table you want to insert data.
If your json tags matchs your table columns.
If you json inputs are correct ( table name, dataset name ).
Use a dummy table to perform a quick test of your credentials and data integrity.
These steps will help you identifying what could be missing on your side. I perform the following operations to reproduce your case:
I created a table on BigQuery based on the values of your json data:
create or replace table `projectid.datasetid.tableid` (
IsFee BOOL,
BlockDateTime timestamp,
Address STRING,
BlockHeight INT64,
Type STRING,
Value INT64
);
Created a .json file with your test data
{"IsFee":false,"BlockDateTime":"2018-09-11T00:12:14Z","Address":"tz3UoffC7FG7zfpmvmjUmUeAaHvzdcUvAj6r","BlockHeight":98304,"Type":"OUT","Value":1}
{"IsFee":false,"BlockDateTime":"2018-09-11T00:12:14Z","Address":"tz2KuCcKSyMzs8wRJXzjqoHgojPkSUem8ZBS","BlockHeight":98304,"Type":"IN","Value":18}
Build & Run below code.
using System;
using Google.Cloud.BigQuery.V2;
using Google.Apis.Auth.OAuth2;
using System.IO;
namespace stackoverflow
{
class Program
{
static void Main(string[] args)
{
String projectid = "projectid";
String datasetid = "datasetid";
String tableid = "tableid";
String safilepath ="credentials.json";
var credentials = GoogleCredential.FromFile(safilepath);
BigQueryClient client = BigQueryClient.Create(projectid,credentials);
using (FileStream stream = File.Open("data.json", FileMode.Open))
{
BigQueryJob loadJob = client.UploadJson(datasetid, tableid, null, stream);
loadJob.PollUntilCompleted();
}
}
}
}
output
Row
IsFee
BlockDateTime
Address
BlockHeight
Type
value
1
false
2018-09-11 00:12:14 UTC
tz3UoffC7FG7zfpmvmjUmUeAaHvzdcUvAj6r
98304
OUT
1
2
false
2018-09-11 00:12:14 UTC
tz2KuCcKSyMzs8wRJXzjqoHgojPkSUem8ZBS
98304
IN
18
Note: You can use above code to perform your quick tests of your credentials and the integrity of the data to insert.
I also make use of the following documentation:
Load Credentials from a file
Google.Cloud.BigQuery.V2
Load Json data into a new table
I am using databricks spark-avro to convert a dataframe schema into avro schema.The returned avro schema fails to have a default value. This is causing issues when i am trying to create a Generic record out of the schema. Can, any one help with the right way of using this function ?
Dataset<Row> sellableDs = sparkSession.sql("sql query");
SchemaBuilder.RecordBuilder<Schema> rb = SchemaBuilder.record("testrecord").namespace("test_namespace");
Schema sc = SchemaConverters.convertStructToAvro(sellableDs.schema(), rb, "test_namespace");
System.out.println(sc.toString());
System.out.println(sc.getFields().get(0).toString());
String schemaString = sc.toString();
sellableDs.foreach(
(ForeachFunction<Row>) row -> {
Schema scEx = new Schema.Parser().parse(schemaString);
GenericRecord gr;
gr = new GenericData.Record(scEx);
System.out.println("Generic record Created");
int fieldSize = scEx.getFields().size();
for (int i = 0; i < fieldSize; i++ ) {
// System.out.println( row.get(i).toString());
System.out.println("field: " + scEx.getFields().get(i).toString() + "::" + "value:" + row.get(i));
gr.put(scEx.getFields().get(i).toString(), row.get(i));
//i++;
}
}
);
This is the df schema:
StructType(StructField(key,IntegerType,true), StructField(value,DoubleType,true))
This is the avro converted schema:
{"type":"record","name":"testrecord","namespace":"test_namespace","fields":[{"name":"key","type":["int","null"]},{"name":"value","type":["double","null"]}]}
The problems is that the class SchemaConverters does not include default values as part of the schema creation. You have 2 options, modify the schema adding default values before Record creation or filling the record before building with some value( it could be actually values from your row). For example null. This is an example how create a Record using your schema
import org.apache.avro.generic.GenericRecordBuilder
import org.apache.avro.Schema
var schema = new Schema.Parser().parse("{\"type\":\"record\",\"name\":\"testrecord\",\"namespace\":\"test_namespace\",\"fields\":[{\"name\":\"key\",\"type\":[\"int\",\"null\"]},{\"name\":\"value\",\"type\":[\"double\",\"null\"]}]}")
var builder = new GenericRecordBuilder(schema);
for (i <- 0 to schema.getFields().size() - 1 ) {
builder.set(schema.getFields().get(i).name(), null)
}
var record = builder.build();
print(record.toString())
this is my first question and I hope you can help me. I make a script in Groovy (in Oracle Data Integrator 12c) to automate mappings. Here is the description of my prodecure:
1 step: removing old mapping if exists.
2 step: looking for the project and the folder (if doesn't exist: create new one).
3 step: create new mapping
4 step: implement source and target table
5 step: create expression
6 step: link every column
Now my question: Can someone help me to make this script with a dynamic expression? Like this:
step 1: get the data types of the target columns
step 2: get the right data types into the expression
step 3: change the false types (always Varchar) into the right types (Number or Date or still Varchar)
step 4: link every column
My handicap: I have never done something with groovy and in Java I'm not very good. So it is not possible for me to make this dynamic. Almost everything in my Script is placed together from some internet sites. It would be great to find some guys who know something about my problem. And I think it would be a good script for all who will change from OWB to ODI.
Thanks!
//Von ODI Studio erstellt
//
//name of the project
projectName = "SRC_TO_TRG"
//name of the folder
ordnerName = "FEN_TEST"
//name of the mapping
mappingName = "MAP1_FF_TO_TRG"
//name of the model
modelName = "DB_FEN"
//name of the source datastore
sourceDatastoreName = "SRC_TEST_FEN"
//name of the target datastore
targetDatastoreName = "TRG_TEST_FEN"
import oracle.odi.domain.project.finder.IOdiProjectFinder
import oracle.odi.domain.model.finder.IOdiDataStoreFinder
import oracle.odi.domain.project.finder.IOdiFolderFinder
import oracle.odi.domain.project.finder.IOdiKMFinder
import oracle.odi.domain.mapping.finder.IMappingFinder
import oracle.odi.domain.adapter.project.IKnowledgeModule.ProcessingType
import oracle.odi.domain.model.OdiDataStore
import oracle.odi.core.persistence.transaction.support.DefaultTransactionDefinition
//set expression to the component
def createExp(comp, tgtTable, propertyName, expressionText) {
DatastoreComponent.findAttributeForColumn(comp,tgtTable.getColumn(propertyName)) .setExpressionText(expressionText)
}
//delete mapping with the same name
def removeMapping(folder, map_name) {
txnDef = new DefaultTransactionDefinition()
tm = odiInstance.getTransactionManager()
tme = odiInstance.getTransactionalEntityManager()
txnStatus = tm.getTransaction(txnDef)
try {
Mapping map = ((IMappingFinder) tme.getFinder(Mapping.class)).findByName(folder, map_name)
if (map != null) {
odiInstance.getTransactionalEntityManager().remove(map);
}
} catch (Exception e) {e.printStackTrace();}
tm.commit(txnStatus)
}
//looking for a project and folder
def find_folder(project_code, folder_name) {
txnDef = new DefaultTransactionDefinition()
tm = odiInstance.getTransactionManager()
tme = odiInstance.getTransactionalEntityManager()
txnStatus = tm.getTransaction(txnDef)
pf = (IOdiProjectFinder)tme.getFinder(OdiProject.class)
ff = (IOdiFolderFinder)tme.getFinder(OdiFolder.class)
project = pf.findByCode(project_code)
//if there is no project, create new one
if (project == null) {
project = new OdiProject(project_code, project_code)
tme.persist(project)
}
//if there is no folder, create new one
folderColl = ff.findByName(folder_name, project_code)
OdiFolder folder = null
if (folderColl.size() == 1)
folder = folderColl.iterator().next()
if (folder == null) {
folder = new OdiFolder(project, folder_name)
tme.persist(folder)
}
tm.commit(txnStatus)
return folder
}
//name of the project and the folder
folder = find_folder(projectName,ordnerName)
//delete old mapping
removeMapping(folder, mappingName)
txnDef = new DefaultTransactionDefinition()
tm = odiInstance.getTransactionManager()
tme = odiInstance.getTransactionalEntityManager()
txnStatus = tm.getTransaction(txnDef)
dsf = (IOdiDataStoreFinder)tme.getFinder(OdiDataStore.class)
mapf = (IMappingFinder) tme.getFinder(Mapping.class)
//create new mapping
map = new Mapping(mappingName, folder);
tme.persist(map)
//insert source table
boundTo_emp = dsf.findByName(sourceDatastoreName, modelName)
comp_emp = new DatastoreComponent(map, boundTo_emp)
//insert target table
boundTo_tgtemp = dsf.findByName(targetDatastoreName, modelName)
comp_tgtemp = new DatastoreComponent(map, boundTo_tgtemp)
//create expression-operator
comp_expression = new ExpressionComponent(map, "EXPRESSION")
// define expression
comp_expression.addExpression("LAND_KM", "TO_NUMBER(SRC_TEST_FEN.LAND_KM)", null,null,null);
comp_expression.addExpression("DATE_OF_ELECTION", "TO_DATE(SRC_TEST_FEN.DATE_OF_ELECTION, 'DD.MM.YYYY')", null,null,null);
//weitere Transformationen anhängen möglich
//link source table with expression
comp_emp.connectTo(comp_expression)
//link expression with target table
comp_expression.connectTo(comp_tgtemp)
createExp(comp_tgtemp, boundTo_tgtemp, "ABBR", "SRC_TEST_FEN.ABBR")
createExp(comp_tgtemp, boundTo_tgtemp, "NAME", "SRC_TEST_FEN.NAME")
createExp(comp_tgtemp, boundTo_tgtemp, "LAND_KM", "EXPRESSION.LAND_KM")
createExp(comp_tgtemp, boundTo_tgtemp, "DATE_OF_ELECTION", "EXPRESSION.DATE_OF_ELECTION")
tme.persist(map)
tm.commit(txnStatus)
You can pass the Datatype as the third argument of the method addExpression.
You can also pass the size and the scale as fourth and fifth arguments.
For instance, for the LAND_KM expression, replace your line by this :
MapAttribute map_attr = DatastoreComponent.findAttributeForColumn(comp_tgtemp,boundTo_tgtemp.getColumn("LAND_KM"))
comp_expression.addExpression("LAND_KM", "TO_NUMBER(SRC_TEST_FEN.LAND_KM)", map_attr.getDataType(),map_attr.getSize(),map_attr.getScale());
It retrieves the target column for LAND_KM thanks to findAttributeForColumn, then retrieves the datatype, the size and the scale, and use that when adding the new expression in the Expression component.
If you want to auto map it based on the name, David Allan wrote a post on the official Oracle blog about how to do it and he provides his code : https://blogs.oracle.com/dataintegration/entry/odi_12c_mapping_sdk_auto
I'm using Bigquery's Java API. I'm running a select query and want the result saved to a destination table.
I've set the loadConfig.setDestinationTable() but I am getting "Load configuration must specify at least one source URI".
Could you please explain what am I doing wrong?
You don't want to set the loadConfig destination table, but the queryConfig.setDestinationTable() instead (since this isn't a load job -- it is a query job). As Fh said, if you share the code you're using we can give more detailed help.
this is the code i am using to do this:
public static String copyTable(String project, String dataSet, String table) {
String newTableName = table + "_copy_"+System.currentTimeMillis();;
try {
Job copyJob = new Job();
TableReference source = new TableReference();
source.setProjectId(project);
source.setDatasetId(dataSet);
source.setTableId(table);
TableReference destination = new TableReference();
destination.setProjectId(project);
destination.setDatasetId(dataSet);
destination.setTableId(newTableName);
JobConfiguration configuration = new JobConfiguration();
JobConfigurationTableCopy copyConf = new JobConfigurationTableCopy();
copyConf.setSourceTable(source);
copyConf.setDestinationTable(destination);
configuration.setCopy(copyConf);
copyJob.setConfiguration(configuration);
bigquery.jobs().insert(project, copyJob).execute();
return newTableName;
} catch (Exception e) {
e.printStackTrace();
logger.warn("unable to copy table :" + project + "."
+ dataSet + "." + table, e);
throw new RuntimeException(e);
}
}
please contact me if you have any more questions
Assuming you are running an interactive asynchronous query, you essentially want to pass the query, destination projectId, destination dataSetId and destination tableId in one request body. Refer to the Java API example here: https://developers.google.com/bigquery/querying-data#asyncqueries
I've tried two methods and both fall flat...
BULK INSERT TEMPUSERIMPORT1357081926
FROM 'C:\uploads\19E0E1.csv'
WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n')
You do not have permission to use the bulk load statement.
but you cannot enable that SQL Role with Amazon RDS?
So I tried... using openrowset but it requires AdHoc Queries to be enabled which I don't have permission to do!
I know this question is really old, but it was the first question that came up when I searched bulk inserting into an aws sql server rds instance. Things have changed and you can now do it after integrating the RDS instance with S3. I answered this question in more detail on this question. But overall gist is that you setup the instance with the proper role, put your file on S3, then you can copy the file over to RDS with the following commands:
exec msdb.dbo.rds_download_from_s3
#s3_arn_of_file='arn:aws:s3:::bucket_name/bulk_data.csv',
#rds_file_path='D:\S3\seed_data\data.csv',
#overwrite_file=1;
Then BULK INSERT will work:
FROM 'D:\S3\seed_data\data.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
AWS doc
You can enable ad hoc distributed queries via heading to your Amazon Management Console, navigating to your RDS menu and then creating a DB Parameter group with ad hoc distributed queries set to 1, and then attaching this parameter group to your DB instance.
Don't forget to reboot your DB once you have made these changes.
Here is the source of my information:
http://blogs.lessthandot.com/index.php/datamgmt/dbadmin/turning-on-optimize-for-ad/
Hope this helps you.
2022
I'm adding for anyone like me who wants to quickly insert data into RDS from C#
While RDS allows csv bulk uploads directly from S3 instances, there are times when you just want to directly upload data straight from your program.
I've written a C# utility method which does inserts using a StringBuilder to concatenate statements to do 2000 inserts per call, which is way faster than an ORM like dapper which does one insert per call.
This method should handle date, int, double, and varchar fields, but I haven't had to use it for character escaping or anything like that.
//call as
FastInsert.Insert(MyDbConnection, new object[]{{someField = "someValue"}}, "my_table");
class FastInsert
{
static int rowSize = 2000;
internal static void Insert(IDbConnection connection, object[] data, string targetTable)
{
var props = data[0].GetType().GetProperties();
var names = props.Select(x => x.Name).ToList();
foreach(var batch in data.Batch(rowSize))
{
var sb = new StringBuilder($"insert into {targetTable} ({string.Join(",", names)})");
string lastLine = "";
foreach(var row in batch)
{
sb.Append(lastLine);
var values = props.Select(prop => CreateSQLString(row, prop));
lastLine = $"select '{string.Join("','", values)}' union all ";
}
lastLine = lastLine.Substring(0, lastLine.Length - " union all".Length) + " from dual";
sb.Append(lastLine);
var fullQuery = sb.ToString();
connection.Execute(fullQuery);
}
}
private static string CreateSQLString(object row, PropertyInfo prop)
{
var value = prop.GetValue(row);
if (value == null) return "null";
if (prop.PropertyType == typeof(DateTime))
{
return $"'{((DateTime)value).ToString("yyyy-MM-dd HH:mm:ss")}'";
}
//if (prop.PropertyType == typeof(string))
//{
return $"'{value.ToString().Replace("'", "''")}'";
//}
}
}
static class Extensions
{
public static IEnumerable<T[]> Batch<T>(this IEnumerable<T> source, int size) //split an IEnumerable into batches
{
T[] bucket = null;
var count = 0;
foreach (var item in source)
{
if (bucket == null)
bucket = new T[size];
bucket[count++] = item;
if (count != size)
continue;
yield return bucket;
bucket = null;
count = 0;
}
// Return the last bucket with all remaining elements
if (bucket != null && count > 0)
{
Array.Resize(ref bucket, count);
yield return bucket;
}
}
}