we know that when a hive query runs it generates map reduce jobs. can someone please explain in detail that what actually happens at back end.
Thanks in advance.
Related
I can't seem to find any documentation about this. I have an apache-beam pipeline that takes some information, formats it into TableRows and then writes to BigQuery.
[+] The problem:
The rows are not written to BigQuery until the Dataflow job finishes. If I have a Dataflow job that takes a long time I'd like to be able to see the rows being inserted into BigQuery, can anybody point me the right direction?
Thanks in advance
Since you're working in batch mode, data need to be written at the same time in the same table. If you're working with partitions, all data belonging to a partition need to be written at the same time. That's why the insertion is done last.
Please note that the WriteDisposition is very important when you work in batches because either you append data, or truncate. But does this distinction make sense for streaming pipelines ?
In java, you can specify the Method of insertion with the following function :
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS))
I've not tested it, but I believe it should work as expected. Also note that streaming inserts to BigQuery are not free of charge.
Depending on how complex your initial transform+load operation is, you could just use the big query driver to do streaming inserts into the table from your own worker pool, rather than loading it in via a dataflow job explicitly.
Alternatively, you could do smaller batches:
N Independent jobs each loading TIME_PERIOD/N amounts of data
I have a use case
We have java framework to parse realtime data from Kinesis to Hive table in every half an hour.
I need to access this hive table and do some processing near realtime. An hour delay is fine, as I dont have permission to access Kinesis stream.
Once processing is done in spark (pyspark preferably), I have to create a new kinesys stream and push the data.
I will then use Splunk and pull it near realtime.
Question is, any one has done spark streaming from hive using python ? I have to do a POC and then the actual work.
Any help will be highly appreciated.
Thanks in advance!!
There are 2 ways to go ahead on this:
Use spark-streaming to obtain messages drectly from Kinesis. That will give you something that is real time.
Once the file drops into your staging area ( either your hive ware-house OR your some HDFS location ), you can pick it up for processing using spark-streaming for files.
Do let us know which approch worked best for you.
Im testing my setup and i need to move the data in hdfs to a sql DB and that too when the data is generated. What i mean is.. once the mapreduce job is completed, it will send a ActivMQ message. I need to move it to sql automatically once i receive a ActivMQ message using Sqoop. Can some one help how to acheive this.
Can someone let me know whether MQ & Sqoop work together..?
Thank You..
I am not entirely clear about the use-case but you can set up a Ooizie Work-Flow.The Sqoop job will only start once the map-reduce job is complete.You can actually create a complex DAG using Oozie.The Oozie work flow can inturn be invoked from a remote java client.
Hope this helped.
For the BQ team, queries that usually work, are failing sometimes.
Could you please look into what could be the issue, there is just this:
Error: Unexpected. Please try again.
Job ID: aerobic-forge-504:job_DTFuQEeVGwZbt-PRFbMVE6TCz0U
Sorry for the slow response.
BigQuery replicates data to multiple datacenters and may run queries from any of them. Usually this is transparent ... if your data hasn't replicated everywhere, we will try to find a datacenter that has all of the data necessary to run a query and execute the query from there.
The error you hit was due to BigQuery not being able to find a single datacenter that had all of the data for one of your tables. We try very hard to make sure this doesn't happen. In principle, it should be very rare (we've got solution designed to make sure that it never happens, but haven't finished the implementation yet). We saw an uptick in this issue this morning, and have a bug filed and are currently investigating the issue.
Was this a transient error? If you retry the operation, does it work now? Are you still getting errors on other queries?
I am currently analyzing a database and happened to find two datasets whose origin is unknown to me. The problem is that they shouldn't even be in there...
I checked all insertion scripts and I'm sure that these datasets are not inserted. They are also not in the original dumpfile.
The only explanation I can come up with at the moment is, that they are inserted via some procedure call in the scripts.
Is there any way for me to track down the origin of a single dataset?
best regards,
daZza
yes, you can,
assuming your database is running in archivelog mode you can use logminer to find which transactions did the inserts. See Using LogMiner to Analyze Redo Log Files It will take some serious time but that might be worth it.