BigQuery - Uploading GZIP compressed files using Java Client library - google-bigquery

I am trying to upload gzip compressed files using Google's BigQuery Java client API. I am able to upload normal files without any issue. But gzip fails with the error "Invalid content type 'application/x-gzip'. Uploads must have content type 'application/octet-stream'".
Below is my code.
val pid = "****"
val dsid = "****"
val tid = "****"
val br = Source.fromFile(new File("****")).bufferedReader()
val mapper = new ObjectMapper()
val schemaFields = mapper.readValue(br, classOf[util.ArrayList[TableFieldSchema]])
val tschema = new TableSchema().setFields(schemaFields)
val tr = new TableReference().setProjectId(pid).setDatasetId(dsid).setTableId(tid)
val jc = new JobConfigurationLoad().setDestinationTable(tr)
.setSchema(tschema)
.setSourceFormat("NEWLINE_DELIMITED_JSON")
.setCreateDisposition("CREATE_IF_NEEDED")
.setWriteDisposition("WRITE_APPEND")
.setIgnoreUnknownValues(true)
val fmr = new SimpleDateFormat("dd-MM-yyyy_HH-mm-ss-SSS")
val now = fmr.format(new Date())
val loadJob = new Job().setJobReference(new JobReference().setJobId(Joiner.on("-")
.join("INSERT", pid, dsid, tid, now))
.setProjectId(pid))
.setConfiguration(new JobConfiguration().setLoad(jc))
// val data = new FileContent(MediaType.OCTET_STREAM.toString, new File("/Users/jegan/sessions/34560-6")) // This works.
val data = new FileContent(MediaType.GZIP.toString, new File("/Users/jegan/sessions/34560-6"))
val bq = BQHelper.createAuthorizedClientWithDefaultCredentials()
val job = bq.jobs().insert(pid, loadJob, data).execute()
And from this link, I see that we need to use resumable upload to achieve this.
https://cloud.google.com/bigquery/loading-data-post-request#resumable
But the issue is, I am using the Java Client library from Google. How to do resumable upload using this library? There seems to be not much information on this regard or I am missing something. Has anyone ever done this? Please point me to some documentation/samples. Thanks.

If application/octet-stream works, just use that. We don't use the media type for anything important.
That said, I thought I changed it so that we'd accept any media type. Are you using the most recent version of the Java client library?

Related

AWS s3 in Rust: Get and store a file - Invalid file header when opening

What I want to do: Download an S3 file (pdf) in a lambda and extract its text, using Rust.
The Error:
ERROR PDF error: Invalid file header
I checked the pdf file in the bucket, downloaded it from the console and everything looks correct, so something is breaking in the way I store the file.
How I am doing it:
let config = aws_config::load_from_env().await;
let client = s3::Client::new(&config);
// Get uploaded object in raw bucket (serde derived the json)
let key = event.records.get(0).unwrap().s3.object.key.clone();
let key = key.replace('+', " ");
let key = percent_encoding::percent_decode_str(&key).decode_utf8().unwrap().to_string();
let content = client
.get_object()
.bucket(raw_bucket_name)
.key(&key)
// .response_content_type("application/pdf") // this did not make any difference
.send()
.await?;
let mut bytes = content.body.into_async_read();
let file = tempfile::NamedTempFile::new()?;
let path = file.into_temp_path();
let mut file = tokio::fs::File::create(&path).await?;
tokio::io::copy(&mut bytes, &mut file).await?;
let content = pdf_extract::extract_text(path)?; // this line breaks
Versions:
tokio = { version = "1", features = ["macros"] }
aws-sdk-s3 = "0.21.0"
aws-config = "0.51.0"
pdf-extract = "0.6.4"
I feel like I misunderstood something in how to store the bytestream, but e.g. https://stackoverflow.com/a/62003659/4986655 do it in the same way afaiks.
Any help or pointers on what the issue might be or how to debug this are very welcome.

Scalding Unit Test - How to Write A Local File?

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)
Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

how to apply password on the zip file or on the csv in nodejs or javascript

var csvString = ['rest','test','age'];
var fileName_CSV = "Report_1.csv";
var fileName_ZIP = "Report_1.zip";
var blob = new Blob(dd,{type: application/zip"});
var zip = new JSZip();
zip.file(fileName_CSV,csvString),{type:"blob"};
var content = zip.generate({type:"blob"});
saveAs(content,fileName_ZIP);
I have the json data i have converted it to fit in csv format so i created the csv file with the data then saves it in memory and now zipped the csv file and now i want to apply password on it .. so when we open the zip and try to open the csv it should ask for the user defined password.. and either i want to use java script or nodejs for it... please help
The mini-zip-asm package supports creating zip archives with passwords.
https://www.npmjs.com/package/minizip-asm.js
From the docs:
npm install minizip-asm.js
Example Usage:
var Minizip = require('minizip-asm.js');
var fs = require("fs");
var csvString = new Buffer("Abc~~~");
var mz = new Minizip();
mz.append("Report_1.csv", csvString, {password: "insert-password"});
fs.writeFileSync("Report_1.zip", new Buffer(mz.zip()));

Programatically creating dstreams in apache spark

I am writing some self contained integration tests around Apache Spark Streaming.
I want to test that my code can ingest all kinds of edge cases in my simulated test data.
When I was doing this with regular RDDs (not streaming). I could use my inline data and call "parallelize" on it to turn it into a spark RDD.
However, I can find no such method for creating destreams. Ideally I would like to call some "push" function once in a while and have the tupple magically appear in my dstream.
ATM I'm doing this by using Apache Kafka: I create a temp queue, and I write to it. But this seems like overkill. I'd much rather create the test-dstream directly from my test data without having to use Kafka as a mediator.
For testing purpose, you can create an input stream from a queue of RDDs.
Pushing more RDDs in the queue will simulate having processed more events in the batch interval.
val sc = SparkContextHolder.sc
val ssc = new StreamingContext(sc, Seconds(1))
val inputData: mutable.Queue[RDD[Int]] = mutable.Queue()
val inputStream: InputDStream[Int] = ssc.queueStream(inputData)
inputData += sc.makeRDD(List(1, 2)) // Emulate the RDD created during the first batch interval
inputData += sc.makeRDD(List(3, 4)) // 2nd batch interval
// etc
val result = inputStream.map(x => x*x)
result.foreachRDD(rdd => assertSomething(rdd))
ssc.start() // Don't forget to start the streaming context
In addition to Raphael solution I think you like to also either can process one batch a time or everything available approach. You need to set oneAtATime flag accordingly on queustream's optional method argument as shown below:
val slideDuration = Milliseconds(100)
val conf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[8]")
val sparkSession: SparkSession = SparkSession.builder.config(conf).getOrCreate()
val sparkContext: SparkContext = sparkSession.sparkContext
val queueOfRDDs = mutable.Queue[RDD[String]]()
val streamingContext: StreamingContext = new StreamingContext(sparkContext, slideDuration)
val rddOneQueuesAtATimeDS: DStream[String] = streamingContext.queueStream(queueOfRDDs, oneAtATime = true)
val rddFloodOfQueuesDS: DStream[String] = streamingContext.queueStream(queueOfRDDs, oneAtATime = false)
rddOneQueuesAtATimeDS.print(120)
rddFloodOfQueuesDS.print(120)
streamingContext.start()
for (i <- (1 to 10)) {
queueOfRDDs += sparkContext.makeRDD(simplePurchase(i))
queueOfRDDs += sparkContext.makeRDD(simplePurchase((i + 3) * (i + 3)))
Thread.sleep(slideDuration.milliseconds)
}
Thread.sleep(1000L)
I found this base example:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala
The key here is calling the "store" command. Replace the contents of store with whatever you want.

Put an ObjectOutputStream on AWS s3

I use spark mllib to create a linear regression model. I then tried to save the model in an ObjectOutputStream, so I could put it on s3 and read it later. The following is my code:
val algorithm = new LinearRegressionWithSGD()
val model = algorithm.run(trainingData)
val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val oos = new ObjectOutputStream(new FileOutputStream("myModelFile"));
oos.writeObject(model);
oos.close();
s3Client.putObject("myBucket", "myPath", oos)
Then I got complains at line:
s3Client.putObject("myBucket", "myPath", oos)
What did I miss and how to fix it? Thanks a lot!