Selecting common parts of a string in GBQ - google-bigquery

I have the following column_x in a table.
I would like to write a query to extract only the common parts of the string (it is not a split between numeric and non numeric characters), in other words abc, cbrd abd, abd from this column.
For this I do not want to use a dimensional table, as new strings could emerge in the future. Would you know how I could approach this extraction?

Related

How to split a column value to multiple columns based on delimiter in Snowflake

Need to split column value into multiple columns based on delimiter,
Also need to create columns dynamically based on no. of delimiters, delimiter could be comma or so. Thanks,
You might be better off working with an array in this case. You haven't specified whether each record will have the same number of delimiters, so dynamically creating columns for tables will take some scripting. If they are, you could potentially use a SPLIT, LATERAL FLATTEN, and PIVOT, but the pivot needs static column names, so you'll likely need a stored procedure to deal with that.
To answer your first question, you can use SPLIT and/or SPLIT_PART to split the column into values. The SPLIT function creates an array for you, while the SPLIT_PART function creates the array, but outputs a single value from the array.
https://docs.snowflake.com/en/sql-reference/functions/split.html
https://docs.snowflake.com/en/sql-reference/functions/split_part.html
https://docs.snowflake.com/en/sql-reference/functions/flatten.html
https://docs.snowflake.com/en/sql-reference/constructs/pivot.html

Convert a comma separated string into differents rows in a single query in InfluxDB

I have a table in InfluxDB which, for optimization, contains a field that can either come only as it would be for example "FR204" or it can come in a string concatenated by commas as would be the case with "FR204, FR301".
The fact is that I would like to be able to do exactly what is reflected in this case: https://rstopup.com/convertir-una-cadena-separada-por-comas-en-cada-una-de-las-filas.html
Is it possible to do this in InfluxDB? Thanks.

Cant Migrate to bigquery because bigquery column names allow only english characters

Bigquery column names (fields) can only contain English letters, numbers, and underscores.
I am using python and I want to create a script to migrate my data from Postgres to Bigquery and the Postgres tables have many non-english column names.
I will probably need to encode the column names to some format that Bigquery accepts, but I will need the ability to later decode it back to the original.
what is the best way to do this?
You can encode the column names to something like base64 and replace the +=/ characters to some kind of place holder.
If you don't care about fields length you can encode to base32 (its about 20% longer then base64 but don't use '+' or '/' and the '=' is used only for padding so you can discard it and it wont affect the string)
Except that you can make small conversion table for each non English character in your language to some combination in English chars, this will work only if you have small amount of non-english characters.

Determine if substring corresponds to specific code (character types) in SQL

I have a collection of strings and want to filter out those where the last four characters are: (alpha)(alpha)(number)(number).
I know I can make a substring of each of these and separately, but what is the method to determine the types of the characters in the sequence?
This is for SQL in Hive.
You can use regular expressions. Something like:
where col regexp '[a-zA-Z]{2}[0-9]{2}$'

Postgres select rows that have a percentage match, and sort by percentage

I have a table of sentences in a postgres database. The table is called sentences and the column that stores the sentence for each row is called sentence.
How can I compare the sentences to a given sentence and return the ones in which, say, 60% of the words (or even better the roots of the words) match and then sort the results by the quality of the match?
Ideally a 90% match would come before 70% match and a 50% match wouldn't show at all.
Ideally it would exclude punctuation as well, but that's not a necessity.
Check out the fuzzystrmatch module, especially the levenshtein function. This calculates the "distance" between two strings, with lower values meaning they are more similar. It's generally used between two words, but as long as the sentences aren't too long (max string length for each argument is 255 bytes), you could use them with sentences as well.
Then you would sort by the output of the levenshtein function ascending, with the results going from most to least similar.
If you wanted to exclude punctuation, call regexp_replace on the strings with a regex to remove all characters you would like, replacing it with empty string, and use those return values as the arguments to levenshtein.