Bigquery: Extract Job does not create file - google-bigquery

I am working on a Java application which uses Bigquery as the analytics engine. Was able to run query jobs (and get results) using the code on Insert a Query Job. Had to modify the code to use service account using this comment on stackoverflow.
Now, need to run an extract job to export a table to a bucket on GoogleStorage. Based on Exporting a Table, was able to modify the Java code to insert extract jobs (code below). When run, the extract job's status changes from PENDING to RUNNING to DONE. The problem is that no file is actually uploaded to the specified bucket.
Info that might be helpful:
The createAuthorizedClient function returns a Bigquery instance and works for query jobs, so probably no issues with the service account, private key etc.
Also tried creating and running the insert job manually on google's api-explorer and the file is successfully created in the bucket. Using the same values for project, dataset, table and destination uri as in code so these should be correct.
Here is the code (pasting the complete file in case somebody else finds this useful):
import java.io.File;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.Arrays;
import java.util.List;
import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.http.javanet.NetHttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson.JacksonFactory;
import com.google.api.services.bigquery.Bigquery;
import com.google.api.services.bigquery.Bigquery.Jobs.Insert;
import com.google.api.services.bigquery.BigqueryScopes;
import com.google.api.services.bigquery.model.Job;
import com.google.api.services.bigquery.model.JobConfiguration;
import com.google.api.services.bigquery.model.JobConfigurationExtract;
import com.google.api.services.bigquery.model.JobReference;
import com.google.api.services.bigquery.model.TableReference;
public class BigQueryJavaGettingStarted {
private static final String PROJECT_ID = "123456789012";
private static final String DATASET_ID = "MY_DATASET_NAME";
private static final String TABLE_TO_EXPORT = "MY_TABLE_NAME";
private static final String SERVICE_ACCOUNT_ID = "123456789012-...#developer.gserviceaccount.com";
private static final File PRIVATE_KEY_FILE = new File("/path/to/privatekey.p12");
private static final String DESTINATION_URI = "gs://mybucket/file.csv";
private static final List<String> SCOPES = Arrays.asList(BigqueryScopes.BIGQUERY);
private static final HttpTransport TRANSPORT = new NetHttpTransport();
private static final JsonFactory JSON_FACTORY = new JacksonFactory();
public static void main (String[] args) {
try {
executeExtractJob();
} catch (Exception e) {
e.printStackTrace();
}
}
public static final void executeExtractJob() throws IOException, InterruptedException, GeneralSecurityException {
Bigquery bigquery = createAuthorizedClient();
//Create a new Extract job
Job job = new Job();
JobConfiguration config = new JobConfiguration();
JobConfigurationExtract extractConfig = new JobConfigurationExtract();
TableReference sourceTable = new TableReference();
sourceTable.setProjectId(PROJECT_ID).setDatasetId(DATASET_ID).setTableId(TABLE_TO_EXPORT);
extractConfig.setSourceTable(sourceTable);
extractConfig.setDestinationUri(DESTINATION_URI);
config.setExtract(extractConfig);
job.setConfiguration(config);
//Insert/Execute the created extract job
Insert insert = bigquery.jobs().insert(PROJECT_ID, job);
insert.setProjectId(PROJECT_ID);
JobReference jobId = insert.execute().getJobReference();
//Now check to see if the job has successfuly completed (Optional for extract jobs?)
long startTime = System.currentTimeMillis();
long elapsedTime;
while (true) {
Job pollJob = bigquery.jobs().get(PROJECT_ID, jobId.getJobId()).execute();
elapsedTime = System.currentTimeMillis() - startTime;
System.out.format("Job status (%dms) %s: %s\n", elapsedTime, jobId.getJobId(), pollJob.getStatus().getState());
if (pollJob.getStatus().getState().equals("DONE")) {
break;
}
//Wait a second before rechecking job status
Thread.sleep(1000);
}
}
private static Bigquery createAuthorizedClient() throws GeneralSecurityException, IOException {
GoogleCredential credential = new GoogleCredential.Builder()
.setTransport(TRANSPORT)
.setJsonFactory(JSON_FACTORY)
.setServiceAccountScopes(SCOPES)
.setServiceAccountId(SERVICE_ACCOUNT_ID)
.setServiceAccountPrivateKeyFromP12File(PRIVATE_KEY_FILE)
.build();
return Bigquery.builder(TRANSPORT, JSON_FACTORY)
.setApplicationName("My Reports")
.setHttpRequestInitializer(credential)
.build();
}
}
Here is the output:
Job status (337ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: PENDING
...
Job status (9186ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: PENDING
Job status (10798ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: RUNNING
...
Job status (53952ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: RUNNING
Job status (55531ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: DONE
It is a small table (about 4MB) so the job taking about a minute seems ok. Have no idea why no file is created in the bucket OR how to go about debugging this. Any help would be appreciated.
As Craig pointed out, printed the status.errorResult() and status.errors() values.
getErrorResults(): {"message":"Backend error. Job aborted.","reason":"internalError"}
getErrors(): null

It looks like there was an access denied error writing to the path: gs://pixalate_test/from_java.csv. Can you make sure that the user that was performing the export job has write access to the bucket (and that the file doesn't already exist)?
I've filed an internal bigquery bug on this issue ... we should give a better error in this situation.
.

I believe the problem is with the bucket name you're using -- mybucket above is just an example, you need to replace that with a bucket you actually own in Google Storage. If you've never used GS before, the intro docs will help.
Your second question was how to debug this -- I'd recommend looking at the returned Job object once the status is set to DONE. Jobs that end in an error still make it to DONE state, the difference is that they have an error result attached, so job.getStatus().hasErrorResult() should be true. (I've never used the Java client libraries, so I'm guessing at that method name.) You can find more information in the jobs docs.

One more difference, I notice is you are not passing job type as config.setJobType(JOB_TYPE);
where constant is private static final String JOB_TYPE = "extract";
also for json, need to set format as well.

I had the same problem. But it turned out was that I typed the name of the table wrong. However, Google did not generate an error message saying that "the table does not exists." That would have helped me locate my problem.
Thanks!

Related

Netty client server login, how to have channelRead return a boolean

I'm writing client server applications on top of netty.
I'm starting with a simple client login server that validates info sent from the client with the database. This all works fine.
On the client-side, I want to use If statements once the response is received from the server if the login credentials validate or not. which also works fine. My problem is the ChannelRead method does not return anything. I can not change this. I need it to return a boolean which allows login attempt to succeed or fail.
Once the channelRead() returns, I lose the content of the data.
I tried adding the msg to a List but, for some reason, the message data is not stored in the List.
Any suggestions are welcome. I'm new... This is the only way I've figured out to do this. I have also tried using boolean statements inside channelRead() but these methods are void so once it closes the boolean variables are cleared.
Following is the last attempt I tried to insert the message data into the list I created...
import io.netty.channel.ChannelHandlerContext;
import io.netty.channel.ChannelInboundHandlerAdapter;
import java.util.Collection;
import java.util.Iterator;
import java.util.List;
import java.util.ListIterator;
public class LoginClientHandler extends ChannelInboundHandlerAdapter {
Player player = new Player();
String response;
public volatile boolean loginSuccess;
// Object message = new Object();
private Object msg;
public static final List<Object> incomingMessage = new List<Object>() {
#Override
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
// incomingMessage.clear();
response = (String) msg;
System.out.println("channel read response = " + response);
incomingMessage.add(0, msg);
System.out.println("incoming message = " + incomingMessage.get(0));
}
How can I get the message data "out" of the channelRead() method or use this method to create a change in my business logic? I want it to either display a message to tell the client login failed and try again or to succeed and load the next scene. I have the business logic working fine but I can't get it to work with netty because none of the methods return anything I can use to affect my business logic.
ChannelInitializer
import io.netty.channel.ChannelInitializer;
import io.netty.channel.ChannelPipeline;
import io.netty.channel.socket.SocketChannel;
import io.netty.handler.codec.DelimiterBasedFrameDecoder;
import io.netty.handler.codec.Delimiters;
import io.netty.handler.codec.string.StringDecoder;
import io.netty.handler.codec.string.StringEncoder;
public class LoginClientInitializer extends ChannelInitializer <SocketChannel> {
#Override
protected void initChannel(SocketChannel ch) throws Exception {
ChannelPipeline pipeline = ch.pipeline();
pipeline.addLast("framer", new DelimiterBasedFrameDecoder(8192, Delimiters.lineDelimiter()));
pipeline.addLast("decoder", new StringDecoder());
pipeline.addLast("encoder", new StringEncoder());
pipeline.addLast("handler", new LoginClientHandler());
}
}
To get the server to write data to the client, call ctx.write here is a basic echo server and client example from the Netty in Action book. https://github.com/normanmaurer/netty-in-action/blob/2.0-SNAPSHOT/chapter2/Server/src/main/java/nia/chapter2/echoserver/EchoServerHandler.java
There are several other good examples in that repo.
I highly recommend reading the "netty in action" book if you're starting out with netty. It will give you a solid foundational understanding of the framework and how it's intended to be used.

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved.
There are various factors that writing to BigQuery fails:
Table ID format is wrong.
Dataset does not exist.
Dataset does not allow the pipeline to access.
Network failure.
When one of the failures occurs, a streaming job will retry the task and stall. I tried using WriteResult.getFailedInserts() in order to rescue the bad data and avoid stalling, but it did not work well. Is there any good way?
Here is my code:
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public class MyData implements Serializable {
String table_id;
}
public interface MyOptions extends PipelineOptions {
#Description("PubSub topic to read from, specified as projects/<project_id>/topics/<topic_id>")
#Validation.Required
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<MyData> input = p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(MyData.class))
.via((String text) -> new Gson().fromJson(text, MyData.class)));
WriteResult writeResult = input
.apply("WriteToBigQuery", BigQueryIO.<MyData>write()
.to(new SerializableFunction<ValueInSingleWindow<MyData>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<MyData> input) {
MyData myData = input.getValue();
return new TableDestination(myData.table_id, null);
}
})
.withSchema(new TableSchema().setFields(new ArrayList<TableFieldSchema>() {{
add(new TableFieldSchema().setName("table_id").setType("STRING"));
}}))
.withFormatFunction(new SerializableFunction<MyData, TableRow>() {
#Override
public TableRow apply(MyData myData) {
return new TableRow().set("table_id", myData.table_id);
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry()));
writeResult.getFailedInserts()
.apply("LogFailedData", ParDo.of(new DoFn<TableRow, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
LOG.info(row.get("table_id").toString());
}
}));
p.run();
}
}
There is no easy way to catch exceptions when writing to output in a pipeline definition. I suppose you could do it by writing a custom PTransform for BigQuery. However, there is no way to do it natively in Apache Beam. I also recommend against this because it undermines Cloud Dataflow's automatic retry functionality.
In your code example, you have the failed insert retry policy set to never retry. You can set the policy to always retry. This is only effective during something like an intermittent network failure (4th bullet point).
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
If the table ID format is incorrect (1st bullet point), then the CREATE_IF_NEEDED create disposition configuration should allow the Dataflow job to automatically create a new table without error, even if the table ID is incorrect.
If the dataset does not exist or there is an access permission issue to the dataset (2nd and 3rd bullet points), then my opinion is that the streaming job should stall and ultimately fail. There is no way to proceed under any circumstances without manual intervention.

BigQuery in Dataflow fails to load data from Cloud Storage: JSON object specified for non-record field

I have a Dataflow pipeline running locally on my machine writing to BigQuery. BigQuery in this batch job, requires a temporary location. I have provided one in my Cloud Storage. The relevant parts are:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://folder/temp");
Pipeline p = Pipeline.create(options);
....
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("uuid").setType("STRING"));
fields.add(new TableFieldSchema().setName("start_time").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("end_time").setType("TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to("myproject:db.table"));
Where for FormatAsTableRowFn I have:
static class FormatAsTableRowFn extends DoFn<KV<String, String>, TableRow>
implements RequiresWindowAccess{
#Override
public void processElement(ProcessContext c) {
TableRow row = new TableRow()
.set("uuid", c.element().getKey())
// include a field for the window timestamp
.set("start_time", ((IntervalWindow) c.window()).start().toInstant()) //NOTE: I tried both with and without
.set("end_time", ((IntervalWindow) c.window()).end().toInstant()); // .toInstant receiving the same error
c.output(row);
}
}
If I print out row.toString() I will get legit timestamps:
{uuid=00:00:00:00:00:00, start_time=2016-09-22T07:34:38.000Z, end_time=2016-09-22T07:39:38.000Z}
When I run this code JAVA says: Failed to create the load job beam_job_XXX
Manually inspecting the temp folder in GCS, the objects look like:
{"mac":"00:00:00:00:00:00","start_time":{"millis":1474529678000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false},"end_time":{"millis":1474529978000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false}}
Looking at the failed job report in BigQuery, the Error says:
JSON object specified for non-record field: start_time (error code: invalid)
This is very strange, because I am pretty sure I said this is a TIMESTAMP, and I am 100% sure my schema in BigQuery conforms with the TableSchema in the SDK. (NOTE: setting the withCreateDisposition...CREATE_IF_NEEDEDyields the same result)
Could someone please tell me how I need to remedy this to get the data inside BigQuery?
Don't use Instant objects. Try using milliseconds/seconds.
https://cloud.google.com/bigquery/data-types
A positive number specifies the number of seconds since the epoch
So, something like this should work:
.getMillis() / 1000

Multipart Upload Amazon S3

I'm trying to upload a file on Amazon S3 using their APIs. I tried using their sample code and it creates various parts of files. Now, the problem is, how do I pause the upload and then resume it ? See the following code as given on their documentation:
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.AbortMultipartUploadRequest;
import com.amazonaws.services.s3.model.CompleteMultipartUploadRequest;
import com.amazonaws.services.s3.model.InitiateMultipartUploadRequest;
import com.amazonaws.services.s3.model.InitiateMultipartUploadResult;
import com.amazonaws.services.s3.model.PartETag;
import com.amazonaws.services.s3.model.UploadPartRequest;
public class UploadObjectMPULowLevelAPI {
public static void main(String[] args) throws IOException {
String existingBucketName = "*** Provide-Your-Existing-BucketName ***";
String keyName = "*** Provide-Key-Name ***";
String filePath = "*** Provide-File-Path ***";
AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
// Create a list of UploadPartResponse objects. You get one of these
// for each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new
InitiateMultipartUploadRequest(existingBucketName, keyName);
InitiateMultipartUploadResult initResponse =
s3Client.initiateMultipartUpload(initRequest);
File file = new File(filePath);
long contentLength = file.length();
long partSize = 5242880; // Set part size to 5 MB.
try {
// Step 2: Upload parts.
long filePosition = 0;
for (int i = 1; filePosition < contentLength; i++) {
// Last part can be less than 5 MB. Adjust part size.
partSize = Math.min(partSize, (contentLength - filePosition));
// Create request to upload a part.
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(existingBucketName).withKey(keyName)
.withUploadId(initResponse.getUploadId()).withPartNumber(i)
.withFileOffset(filePosition)
.withFile(file)
.withPartSize(partSize);
// Upload part and add response to our list.
partETags.add(
s3Client.uploadPart(uploadRequest).getPartETag());
filePosition += partSize;
}
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(
existingBucketName,
keyName,
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
}
catch (Exception e)
{
s3Client.abortMultipartUpload(new AbortMultipartUploadRequest(
existingBucketName, keyName, initResponse.getUploadId()));
}
}
}
I have also tried the TransferManager example which takes an Upload object and calls a tryPause(forceCancel) method. But the problem here is, it gets cancelled everytime I try and pause it.
My question is, how do I use the above code with pause and resume functionalities ? Also, just to note that I would also like to upload multiple files with same functionalities.... Help would be much appreciated.
I think you should use the Transfer Manager sample if you can. If it's being canceled, it's likely that it just isn't possible to pause it(with the given configuration of the TransferManager you are using).
This might be because you paused it too early to make "pausing" mean anything besides canceling, you are trying to use encryption, or the file isn't big enough. I believe the default minimum file size is 16MB. However, you can change the configuration of the TransferManager to allow you to pause depending on tryPause is failing, except in the case of encryption where I don't think there's anything you can do.
If you want to enable pause/resume for a file smaller than that size, you can call the setMultipartUploadThreshold(long) method in TransferManagerConfiguration. If you want to be able to pause earlier, you can use setMinimumUploadPartSize to set it to use smaller chunks.
In any case, I would advise you to use the TransferManager if possible, since it's made to do this kind of thing for you. It might be helpful to see why the transfer is not being paused when you use tryPause.
TransferManager performs the upload and download asynchronously and doesn't block the current thread. When you call the resumeUpload, TransferManager returns immediately with a reference to Upload. You can use this reference to enquire on the status of the upload.

Simple currency observer

I am trying to use the cryptsy.com's API to get the current price of doge. This is my code.
package main;
import java.text.DecimalFormat;
import java.util.Date;
import java.util.concurrent.TimeUnit;
import main.Cryptsy.CryptsyException;
import main.Cryptsy.PublicMarket;
public class Main {
public static void main (String [] args) throws CryptsyException, InterruptedException{
Cryptsy cryptsy = new Cryptsy();
while(true){
PublicMarket[] markets = cryptsy.getPublicMarketData();
for(PublicMarket market : markets) {
DecimalFormat df = new DecimalFormat("#.########");
if(market.label.equals("DOGE/BTC"))
System.out.println(new Date() + " " + market.label + " " + df.format(market.lasttradeprice));
}
TimeUnit.SECONDS.sleep(30);
}
}
}
the problem is that the price get updated too rear (30 mins or something) and only if I restart my program. Anyone to know how to get the current price? Also there is connection errors sometimes.
Actually the connection problems are normal with the Cryptsy API. It's slow and often disconnects without an answer. They are overcrowded like all the times.
There is a new API location that should be faster and solve the connection issies, here:
http://pubapi.cryptsy.com/api.php?method=marketdatav2
And also, if you are only interested in one single currency, you can get the market data of only that currency. The whole Answer from Cryptsy for all Currencies is like 300k, so you would waste bandwith, if you poll that every minute or so.
For only one currency it will be like:
http://pubapi.cryptsy.com/api.php?method=singlemarketdata&marketid={MARKET ID}
Where the market ID can be gathered inside the answer of the first URL. But you just need the int ID of the market once, from then on you can always use the direct call..
Every Detail is BTW available here:
https://www.cryptsy.com/pages/api