BigQuery: Convert a text column to UTF-8 - google-bigquery

I want to start a Vertex AI AutoML Text Entity Extraction Batch Prediction Job, but in my own experience, texts ("content" field in the JSONL structure), must also accomplish the following two features:
Every text's size in bytes, must be between 10 and 10000 bytes: DONE
Every text encoding must be UTF-8: UNKNOWN
My original data is stored in BigQuery, so I'll have to export it to Google Cloud Storage for later batch prediction. To take advantage of BigQuery optimization, I want to accomplish the 2 previous tasks in the BigQuery data source table itself. I have checked Google's official documentation, and the closest I have got to some related information, is this; however not accurate VS what I want. BTW, the query looks as follows:
WITH mydata AS (
SELECT
CASE
WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
ELSE posting
END AS posting
FROM `my-project.Machine_Learning_Datasets.sample-data-source` -- Modified for data protection
)
SELECT
posting as content, -- Something needs to be done here
"text" as mimeType
FROM mydata
And my-project.Machine_Learning_Datasets.sample-data-source schema looks as follows:
Field name
Type
Mode
Records
posting
STRING
NULLABLE
100M
Any ideas?

The following answer did the job, FYI:
WITH
mydata AS (
SELECT
CASE
WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
ELSE
posting
END
AS posting
FROM
`my-project.Machine_Learning_Datasets.sample-data-source` )
SELECT
REGEXP_REPLACE(posting, r'[^\x00-\x7F]+', '') AS content,
"text/plain" AS mimeType
FROM
mydata
UPDATE: This case has been considered, for an improved workaround.
Thanks!

Related

Convert a spool into text format

I want to send the spool generated by a Smart Form, by email as attachment in TXT format.
The issue is to get the spool in a TXT format, without technical stuff, just the characters in the form.
I have used the function module RSPO_RETURN_SPOOLJOB for getting it, but it returns a technical format like this:
//XHPLJIIID 0700 00000+00000+
IN01ES_CA930_DEMO_3 FIRST
OPINCH12 P 144 240 1728020160000010000100001
IN02MAIN
MT0100808400
CP11000000E
FCCOURIER 120 00144 SF001SF001110000144E
UL +0000000000000
ST0201614Dear Customer,
MT0214209000
ST0864060We would like to take this opportunity to confirm the flight
MT0100809360
ST0763253reservations listed below. Thank you for your custom.
...
I want something as follows, without the technical stuff:
Dear Customer,
We would like to take this opportunity to confirm the flight
reservations listed below. Thank you for your custom.
...
This is the code I have used :
PARAMETERS spoolnum type TSP01-RQIDENT.
DATA spool_contents type soli_tab.
CALL FUNCTION 'RSPO_RETURN_SPOOLJOB'
exporting
rqident = spoolnum
tables
buffer = spool_contents
exceptions
others = 1.
If the parameter DESIRED_TYPE is not passed or has the value 'OTF', and the spool is of type SAPscript/Smart Form, the function module returns the technical format you have experienced.
Instead, you should use the parameter DESIRED_TYPE = 'RAW' so that all the technical stuff is interpreted and the form is returned as text, the way you request, as follows :
CALL FUNCTION 'RSPO_RETURN_SPOOLJOB'
exporting
rqident = spoolnum
desired_type = 'RAW'
tables
buffer = spool_contents
exceptions
others = 1.

UPDATE query produces no change

I wrote the following UPDATE statement:
UPDATE print_archive
SET pages = pages*2
WHERE printer ILIKE '%plot%' AND paper_size = 'Arch D';
It will look in a table named print_archive for any printers with "plot" in their name and the paper size 'Arch D'.
If it finds any it is suppose to multiply the page count by 2.
Below is a sample of the print_archive table data -
(column names)
"name_id","printer","pages","source","date","file_name","duplex","paper_size"
Sample Data:
jane, \\PRINTSRV\plot9, 1, \\COMP-01, 01/21/2017 14:30:39, hello_world.pdf, No, Arch D,
billy, \\PRINTSRV\Plot13, 1, \\COMP-02, 02/20/2016 10:37:23, bye_world.doc, No, Arch D,
But no matter what I change or how many times I run the UPDATE statement it always returns 0.
What am I doing wrong?
So initially I had added the latin1 encoding when importing the CSV file as the import process kept producing encoding error messages when using the default encode. I switched the import of the CSV file to 'latin2' and that worked better. There does seem to be a leading space as well in some of the fields so adding '%' helped as well (though this was not working in latin1). pgAdmin4 (which is how my users will interact) handles latin2 pretty good. Though from psql cmd I need to set the client encoding to latin2 before it can work (SET client_encoding to 'latin2';)
Thanks guys. All your recommendations and leads helped me.

Extract incident details from Service Now in Excel

I am trying to extract ticket details from Service Now. Is there a way to extract the details without ODBC ? I have also tried the solution mentioned in [1]: https://community.servicenow.com/docs/DOC-3844, but I am receiving an error 9 -subscript out of range.
Is there a better way to extract details efficiently? I tried asking this in the service now forum but I thought I might get other opinions from here.
It's been a while since this question is asked. Hopefully following is still useful.
I am extracting change data (not incident) , but the process still should be same. You will need to gather incident table and column information. Then there are couple of ways to approach the problem.
1) If the data you are extracting has fixed parameters , such as fixed period or fixed column or group etc., then you can create a report within servicenow and then use REST/SOAP API to get the data in text/csv format. You can use different python modules to convert from csv to xls or xlsx depending on you need. I used openpyXL ,csv , xlsreader ,xlswriter etc.
See here for a example
ServiceNow - How to use SOAP to download reports
2) If the data has dynmaic parameters where you need to change columns, dates or filter etc, you can still use soap / REST API but form query within python scripts instead of having static report. This way you can change it based on your requirement on the fly.
Here is an example query for DB. you can use example for above. Just switch url with following.
table_name = 'u_change_table_name' #SN DB holding change/INCIDENT info
table_limit = 800
table_query = 'active=true&sysparm_display_value=true&planned_start_date=today'
date_query = 'chg_start_date>=javascript:gs.daysAgoStart(1)^active=true^chg_type=normal'
table_fields = 'chg_number,chg_start_date,chg_duration,chg_end_date' #Actual column names from DB and not from SN report.
url= (
'https://yourcompany.service-now.com/api/now/table/' +table_name +\
'?sysparm_query=' + date_query + '&sysparm_fields=' \
+ table_fields + '&sysparm_limit=' + str(table_limit)
)

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.