I am sending data to PubSub from where I am trying to create a DataFlow job to put data into BigQuery.
I have a column in the data for unique that I want to do HLL_COUNT.INIT
Is there an equivalent method on the DataFlow side so that I can directly store the HLL version of the column in BigQuery?
No, DataFlow doesn't have support for BigQuery HLL sketches format, but it is clearly something that would be useful. I created feature request for it in DataFlow issue tracker: https://issuetracker.google.com/62153424.
Update: A BigQuery-compatible implementation of HyperLogLog++ has been open-sourced to github.com/google/zetasketch, and a design doc (docs.google.com/document/d/…) about integrating it into Apache Beam has been sent out to dev#beam.apache.org.
Related
I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.
I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-
Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
Use Cloud DataFlow for ETL
Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.
Please let me know your opinions.
Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...
If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.
If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).
Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.
How big are your REST call results in record count?
I have a table of urls in BigQuery on which I would like to perform a check of PageSpeed Insights score (or even include the whole response from API in the BigQuery table). I tried to use UDFs for this purpuse, but for now no luck with this. Is there a way of getting the response from:
https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url=https://google.com/&strategy=mobile&key=yourAPIKey
to BigQuery table?
You cannot make API calls from BigQuery UDFs for several reasons. See here for more details about that.
Although there are a few ways to achieve what you want to do, I'd recommend using a Cloud Dataflow pipeline:
Read your BigQuery using BigQueryIO.Read source
In your ParDo, call the API you want
Write your results back to BigQuery using the BigQueryIO.Write sink
I understand from the documentation for Dataproc its possible to read data from BigQuery using pyspark but is there an advantage when running kmeans clustering on ndarrays with a shape (xxxxxxx,) over say reading a file representation from CloudStorage instead
If you are not intending to do any other manipulation of your data in BigQuery, then you absolutely wouldn't gain anything from storing your data in BigQuery for this use case.
Per https://cloud.google.com/hadoop/bigquery-connector,
The BigQuery connector for Hadoop downloads data into your Google
Cloud Storage bucket before running a Hadoop job.
In other words, the connector doesn't do predicate push-down or otherwise leverage BigQuery for computation; this connector is just a convenience method to provide access to data that you're already storing or generating in BigQuery.