Can we upload text file in big query? - google-bigquery

I have couple of questions regarding big query:
1. Can we upload text files in google cloud platform and retrieve required data?? If yes how is it possible.
2.My main aim is to upload large amount of data to cloud platform analyse the data and retrieve the desired information whenever required. Data can be both structured and unstructured.

Can we upload text files in google cloud platform and retrieve
required data??
yes. You can
If yes how is it possible
it is simple enough to get to this - You just need first to read below
BigQuery Home: https://cloud.google.com/bigquery/
Quickstarts: https://cloud.google.com/bigquery/docs/quickstarts
How-to Guides: https://cloud.google.com/bigquery/docs/how-to

Related

Loading data into Google Colab from a central data repository?

What's the best place to upload data for use in Google Colaboratory notebooks? I'm planning to make some notebooks that load netCDF data using python, and I'd like to be able to send the notebooks to other people and have them load the same data without difficulty.
I know I can load data from my own Google Drive, but if I sent other people the notebooks, then I'd have to send them the data files too, right?
Is it possible to have a central data repository that multiple people can load data from? The files that I'd like to use are ~10-100 MB. I only need to read data, not write it. Thanks!

Handling dynamic keys while loading json data in bigquery

I have a very big gz zipped file of JSON data. Due to some limitations, I am not able to extract and transform the data. Now the JSON data itself is very dynamic in nature.
For example:
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name1': 'span of years studied'}}
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name2': 'span of years studied'}}
The problem is the high-school-name field is a dynamic one, which will be different for different sets of users.
Now when I am uploading to bigquery, I am not able to determine which type I should specify for the schooling field or how to handle this upload to bigquery.
I am using Cloud function to automate the flow, so as soon as the file is uploaded to Cloud Storage it will trigger the function. As the cloud function has very low memory storage, there is no way to transform the data there. I have looked into dataprep for it, but I am trying to understand if I am missing something which could make what I am trying to do possible without using any other services.
According to the documentation Loading JSON data from Cloud Storage and Specifying nested and repeated columns I think you, in deed, need a process step that could be well covered either with Dataproc or Dataflow.
You can implement a pipeline to transform your dynamic data as needed and to write to BigQuery. This doc might be of your interest. There is a template source cpde in which you can address to put a json into a BigQuery table. Here is the documentation about loading json data from cloud storage.
Please note that one if the limitations is:
If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
This is one of the reasons why I think you have to implement your solution with an additional product, as you mentioned.

Getting DoubleClick data from Google Analytics exported to BigQuery

I have GA360 and I have exported raw Google Analytics data to BigQuery through the integration. I'm just wondering if the DCM dimensions and metrics are part of the export?
Ideally these link
I can't find them as part of the export, what would be the best way to access these dimensions for all my raw data? Core reporting API?
This Google Analytics documentation page shows the BigQuery Export schema for the Google Analytics data that is exported to BigQuery. As you will see, the DoubleClick Campaign Manager (DCM) data is not part of the BigQuery export, unlike the DoubleClick for Publishers (DFP), which does appear (these are all the hits.publisher.<DFP_METRIC> metrics).
Therefore, as explained by #BenP in the comments, you may be interested in the BigQuery Data Transfer service, which does have a feature for DoubleClick Campaign Manager Transfers. Unfortunately, I am not an expert in Google Analytics or DCM, and therefore cannot add much relevant info to the process of linking both sets of data, but maybe you can try combining them yourself and then post a new question for that matter if you do not success in doing so.

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

How to download all data in a Google BigQuery dataset?

Is there an easy way to directly download all the data contained in a certain dataset on Google BigQuery? I'm actually downloading "as csv", making one query after another, but it doesn't allow me to get more than 15k rows, and rows i need to download are over 5M.
Thank you
You can run BigQuery extraction jobs using the Web UI, the command line tool, or the BigQuery API. The data can be extracted
For example, using the command line tool:
First install and auth using these instructions:
https://developers.google.com/bigquery/bq-command-line-tool-quickstart
Then make sure you have an available Google Cloud Storage bucket (see Google Cloud Console for this purpose).
Then, run the following command:
bq extract my_dataset.my_table gs://mybucket/myfilename.csv
More on extracting data via API here:
https://developers.google.com/bigquery/exporting-data-from-bigquery
Detailed step-by-step to download large query output
enable billing
You have to give your credit card number to Google to export the output, and you might have to pay.
But the free quota (1TB of processed data) should suffice for many hobby projects.
create a project
associate billing to a project
do your query
create a new dataset
click "Show options" and enable "Allow Large Results" if the output is very large
export the query result to a table in the dataset
create a bucket on Cloud Storage.
export the table to the created bucked on Cloud Storage.
make sure to click GZIP compression
use a name like <bucket>/prefix.gz.
If the output is very large, the file name must have an asterisk * and the output will be split into multiple files.
download the table from cloud storage to your computer.
Does not seem possible to download multiple files from the web interface if the large file got split up, but you could install gsutil and run:
gsutil -m cp -r 'gs://<bucket>/prefix_*' .
See also: Download files and folders from Google Storage bucket to a local folder
There is a gsutil in Ubuntu 16.04 but it is an unrelated package.
You must install and setup as documented at: https://cloud.google.com/storage/docs/gsutil
unzip locally:
for f in *.gz; do gunzip "$f"; done
Here is a sample project I needed this for which motivated this answer.
For python you can use following code,it will download data as a dataframe.
from google.cloud import bigquery
def read_from_bqtable(bq_projectname, bq_query):
client = bigquery.Client(bq_projectname)
bq_data = client.query(bq_query).to_dataframe()
return bq_data #return dataframe
bigQueryTableData_df = read_from_bqtable('gcp-project-id', 'SELECT * FROM `gcp-project-id.dataset-name.table-name` ')
yes steps suggested by Michael Manoochehri are correct and easy way to export data from Google Bigquery.
I have written a bash script so that you do not required to do these steps every time , just use my bash script .
below are the github url :
https://github.com/rajnish4dba/GoogleBigQuery_Scripts
scope :
1. export data based on your Big Query SQL.
2. export data based on your table name.
3. transfer your export file to SFtp server.
try it and let me know your feedback.
to help use ExportDataFromBigQuery.sh -h