How to save query results in a Google Cloud Platform? - google-bigquery

I have never used SQL before and I am trying to do something that should be simple, but it is taking hours to solve. I would like to download a table that is in a project at Google Cloud Platform. It is asked: "Choose where to save the results data from the query" and then I choose: "CSV(Google Drive) Save up to 1GB...". However, I get this message:
Table dataset_reference { project_reference { project_id: "escolas-259115" gaia_id: 777399094185 } dataset_id: "_31e2c29542f3fa3caf4d6d069271a277dce8d215" dataset_uuid: "9872dc9f-2c66-4088-b7b1-b949e7541f07" } table_id: "anon76eccaf0_a96b_4617_8a3c_c2d5ee734662" table_uuid: "76eccaf0-a96b-4617-8a3c-c2d5ee734662" too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data.
Here is the code that I am using:
SELECT
ano,
estado_abrev,
id_municipio,
causa_basica,
idade,
genero,
raca_cor,
numero_obitos
FROM
`basedosdados.br_ms_sim.municipio_causa_idade_genero_raca`
As I said, I have never worked with SQL before.

What you try to achieve is for exporting the table. Here, you want to export a query result.
You can achieve this like that
EXPORT DATA OPTIONS(
uri='gs://mybucket/transformed/sales-*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT
ano,
estado_abrev,
id_municipio,
causa_basica,
idade,
genero,
raca_cor,
numero_obitos
FROM
`basedosdados.br_ms_sim.municipio_causa_idade_genero_raca`

Related

BigQuery: Convert a text column to UTF-8

I want to start a Vertex AI AutoML Text Entity Extraction Batch Prediction Job, but in my own experience, texts ("content" field in the JSONL structure), must also accomplish the following two features:
Every text's size in bytes, must be between 10 and 10000 bytes: DONE
Every text encoding must be UTF-8: UNKNOWN
My original data is stored in BigQuery, so I'll have to export it to Google Cloud Storage for later batch prediction. To take advantage of BigQuery optimization, I want to accomplish the 2 previous tasks in the BigQuery data source table itself. I have checked Google's official documentation, and the closest I have got to some related information, is this; however not accurate VS what I want. BTW, the query looks as follows:
WITH mydata AS (
SELECT
CASE
WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
ELSE posting
END AS posting
FROM `my-project.Machine_Learning_Datasets.sample-data-source` -- Modified for data protection
)
SELECT
posting as content, -- Something needs to be done here
"text" as mimeType
FROM mydata
And my-project.Machine_Learning_Datasets.sample-data-source schema looks as follows:
Field name
Type
Mode
Records
posting
STRING
NULLABLE
100M
Any ideas?
The following answer did the job, FYI:
WITH
mydata AS (
SELECT
CASE
WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
ELSE
posting
END
AS posting
FROM
`my-project.Machine_Learning_Datasets.sample-data-source` )
SELECT
REGEXP_REPLACE(posting, r'[^\x00-\x7F]+', '') AS content,
"text/plain" AS mimeType
FROM
mydata
UPDATE: This case has been considered, for an improved workaround.
Thanks!

GCP Export Data Options gives Option 'uri' value must be a wild card URI

I am trying to export query output to a file in cloud storage.
The query output is always <1GB but the export data options is creating multiple smaller files.
Example:
EXPORT DATA OPTIONS(
uri='gs://test_bucket/test_file_*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `test.test_table`;
When I provide filename without a wildcard (gs://test_bucket/test_file_1.csv), i see an error "Invalid uri specification. Option 'uri' value must be a wild card URI."
Is there anyway to generate only ONE file always using export data options?
Cross reference: the code
EXPORT DATA OPTIONS(
uri='gs://test_bucket/test_file_*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT DISTINCT * FROM `test.test_table`;
as answered by nicolas noziere https://stackoverflow.com/a/66388650/4985705 (add distinct to force all data to be loaded into one worker.) generates only ONE file always using export data options.

Extract incident details from Service Now in Excel

I am trying to extract ticket details from Service Now. Is there a way to extract the details without ODBC ? I have also tried the solution mentioned in [1]: https://community.servicenow.com/docs/DOC-3844, but I am receiving an error 9 -subscript out of range.
Is there a better way to extract details efficiently? I tried asking this in the service now forum but I thought I might get other opinions from here.
It's been a while since this question is asked. Hopefully following is still useful.
I am extracting change data (not incident) , but the process still should be same. You will need to gather incident table and column information. Then there are couple of ways to approach the problem.
1) If the data you are extracting has fixed parameters , such as fixed period or fixed column or group etc., then you can create a report within servicenow and then use REST/SOAP API to get the data in text/csv format. You can use different python modules to convert from csv to xls or xlsx depending on you need. I used openpyXL ,csv , xlsreader ,xlswriter etc.
See here for a example
ServiceNow - How to use SOAP to download reports
2) If the data has dynmaic parameters where you need to change columns, dates or filter etc, you can still use soap / REST API but form query within python scripts instead of having static report. This way you can change it based on your requirement on the fly.
Here is an example query for DB. you can use example for above. Just switch url with following.
table_name = 'u_change_table_name' #SN DB holding change/INCIDENT info
table_limit = 800
table_query = 'active=true&sysparm_display_value=true&planned_start_date=today'
date_query = 'chg_start_date>=javascript:gs.daysAgoStart(1)^active=true^chg_type=normal'
table_fields = 'chg_number,chg_start_date,chg_duration,chg_end_date' #Actual column names from DB and not from SN report.
url= (
'https://yourcompany.service-now.com/api/now/table/' +table_name +\
'?sysparm_query=' + date_query + '&sysparm_fields=' \
+ table_fields + '&sysparm_limit=' + str(table_limit)
)

Export Data from SQL to CSV

I'm using EntityFramework to access a sql server to return data. The data needs to be formatted into a tab delimited file. I then want to compress the data to return to the user.
I can do the select, and then iterate over the EF objects and format all the data into one big string- but this takes forever (I'm returning abouit 800k rows). The query itself is quite fast, but its just the creating of the csv file in memory that is killing it.
I found this post that describes how to use sqlcmd to do this directly as an export (but with csv) with sql which seems very promising, but I'm unclear how to pass the -E and other parameters to ExecuteSqlCommand()... or if it is even meant for this.
I tried to do something like this:
var test = context.Database.ExecuteSqlCommand("select Chromosome c,
StartLocation sl, Endlocation el, GeneName gn from Gencode where c = chr1",
"-E", "-Q", new SqlParameter("-s", "\t"));
But of course that didn't work...
Any suggestions as to how to go about this? I'm using EF 6.1 if that matters.
Alternate option using simple method.
F5-->store result--> keep file name

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.