spark sql concat and cast operations resulting null values for column - sql

I am doing concat and cast operation inside spark SQL query as follows:
spark.sql ("select cast(concat(df_view.col1," ") as Long) as new_col from df_view")
But I am getting null values in resulting DF. If I just perform the cast or concat operation, I am getting the correct results, but with both operations simultaneously, I get the null values.
Please suggest if I'm missing something in the syntax, I checked other answers but couldn't figure out the issue, also I am using only spark SQL here not DF syntax operations.

If you are writing the file as text then just don't cast it to Long, and preferably use pad functions to make sure you are writing the right width.
I take from the comments that your issue is fixed-width files but as a general thing it makes no sense to concat an empty space and then try to cast the result as a number. You've explicitly made it not a number before.
Ideally you deal with the file format as a file format and not by arbitrarily manipulating each field, however the latter can work if you handle each field correctly.

Related

BigQuery - Inferring Datatypes of Column Values

What is the best way to determine the datatype of a column value if the data has already been loaded and the data has been classified as STRING datatype (i.e. BQ table metadata has "STRING" as the datatype for every column)? I've found a few different methods, but not sure if I'm missing any or any of these is substantially more performant. The result should include statistics on the grain of each value, not just per column.
Using a combination of CASE and SAFE_CAST on the STRING value to sum up all the instances where it successfully was able to CAST to X data type (where X is any datatype, like INT64 or DATETIME and having a few lines in query repeat the SAFE_CAST to cover all potential datatypes)
Similar to above, but using REGEXP_CONTAINS instead of SAFE_CAST on every value and summing up all instances of TRUE (a community UDF also seems to tackle this: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/typeof.sql)
(For above can also use countif(), if statements etc.)
Loading data into a pandas dataframe and using something like pd.api.types.infer_dtype to infer automatically, but this adds overhead and more components
Thanks!

Split multiple points in text format and switch coordinates in postgres column

I have a PostgreSQL column of type text that contains data like shown below
(32.85563, -117.25624)(32.855470000000004, -117.25648000000001)(32.85567, -117.25710000000001)(32.85544, -117.2556)
(37.75363, -121.44142000000001)(37.75292, -121.4414)
I want to convert this into another column of type text like shown below
(-117.25624, 32.85563)(-117.25648000000001,32.855470000000004 )(-117.25710000000001,32.85567 )(-117.2556,32.85544 )
(-121.44142000000001,37.75363 )(-121.4414,37.75292 )
As you can see, the values inside the parentheses have switched around. Also note that I have shown two records here to indicate that not all fields have same number of parenthesized figures.
What I've tried
I tried extracting the column to Java and performing my operations there. But due to sheer amount of records I have, I will run out of memory. I also cannot do this method in batched due to time constraints.
What I want
A SQL query or a sequence of SQL queries that will achieve the result that I have mentioned above.
I am using PostgreSQL9.4 with PGAdmin III as the client
this is a type of problem that should not be solved by sql, but you are lucky to use Postgres.
I suggest the following steps in defining your algorithm.
First part will be turning your strings into a structured data, second will transform structured data back to string in a format that you require.
From string to data
First, you need to turn your bracketed values into an array, which can be done with string_to_array function.
Now you can turn this array into rows with unnest function, which will return a row per bracketed value.
Finally you need to slit values in each row into two fields.
From data to string
You need to group results of the first query with results wrapped in string_agg function that will combine all numbers in rows into string.
You will need to experiment with brackets to achieve exactly what you want.
PS. I am not providing query here. Once you have some code that you tried, let me know.
Assuming you also have a PK or some unique column, and possibly other columns, you can do as follows:
SELECT id, (...), string_agg(point(pt[1], pt[0])::text, '') AS col_reversed
FROM (
SELECT id, (...), unnest(string_to_array(replace(col, ')(', ');('), ';'))::point AS pt
FROM my_table) sub
GROUP BY id; -- assuming id is PK or no other columns
PostgreSQL has the point type which you can use here. First you need to make sure you can properly divide the long string into individual points (insert ';' between the parentheses), then turn that into an array of individual points in text format, unnest the array into individual rows, and finally cast those rows to the point data type:
unnest(string_to_array(replace(col, ')(', ');('), ';'))::point AS pt
You can then create a new point from the point you just created, but with the coordinates reversed, turn that into a string and aggregate into your desired output:
string_agg(point(pt[1], pt[0])::text, '') AS col_reversed
But you might also move away from the text format and make an array of point values as that will be easier and faster to work with:
array_agg(point(pt[1], pt[0])) AS pt_reversed
As I put in the question, I tried extracting the column to Java and performing my operations there. But due to sheer amount of records I have, I will run out of memory. I also cannot do this method in batched due to time constraints.
I ran out of memory here as I was putting everything in a Hashmap of
< my_primary_key,the_newly_formatted_text >. As the text was very long sometimes and due to the sheer number of records that I had, it wasnt surprising that I got an OOM.
Solution that I used:
As suggested my many folks here, this solution was better solved with a code. I wrote a small script that formatted the text as per my liking and wrote the primary key and the newly formatted text to a file in tsv format. Then I imported the tsv in a new table and updated the original table from the new one.

SQL Remove Substring From Query Results

I have a query that is returning data from a database. In a single field there is a rather long text comment with a segment, which is clearly defined with marking tags like !markerstart! and !markerend!. I would like to have a query return with the string segment between the two markers removed (and the markers removed too).
I would normally do this client-side after I get the data back, however, the problem is that the query is an INSERT query that gets it's data from a SELECT statement. I don't want the text segment to be stored in the archival/reporting table (working with an OLTP application here), so I need to find a way to get the SELECT statement to return exactly what is to be inserted, which, in this case, means getting the SELECT statement to strip out the unwanted phrase instead of doing it in post-processing client-side.
My only thought is to use some convoluted combination of SUBSTRING, CHARINDEX, and CONCAT, but I'm hoping there is a better way, but, based on this, I don't see how. Anyone have ideas?
Sample:
This is a long string of text in some field in a database that has a segment that needs to be removed. !markerstart! This is the segment that is to be removed. It's length is unknown and variable. !markerend! The part of this field that appears after the marker should remain.
Result:
This is a long string of text in some field in a database that has a segment that needs to be removed. The part of this field that appears after the marker should remain.
SOLUTION USING STUFF:
I really don't like how verbose this is, but I can put it in a function if I really need to. It isn't ideal, but it is easier and faster than a CLR routine.
SELECT STUFF(CAST(Description AS varchar(MAX)), CHARINDEX('!markerstart!', Description), CHARINDEX('!markerend!', Description) + 11 - CHARINDEX('!markerstart!', Description), '') AS Description
FROM MyTable
You may want to consider implementing a CLR user-defined function that returns the parsed data.
The following link demonstrates how to use a CLR UDF RegEx function for pattern matching and data extraction.
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Regards,
You can use Stuff function or Replace function and replace your unwanted symbols with ''.
STUFF('EXP',START_POS,'NUMBER_OF_CHARS','REPLACE_EXP')

SQL Server - simple select and conversion between int and string

I have a simple select statement like this:
SELECT [dok__Dokument].[dok_Id],
[dok__Dokument].[dok_WartUsNetto],
[dok__Dokument].[dok_WartUsBrutto],
[dok__Dokument].[dok_WartTwNetto],
[dok__Dokument].[dok_WartTwBrutto],
[dok__Dokument].[dok_WartNetto],
[dok__Dokument].[dok_WartVat],
[dok__Dokument].[dok_WartBrutto],
[dok__Dokument].[dok_KwWartosc]
FROM [dok__Dokument]
WHERE [dok_NrPelnyOryg] = 2753
AND [dok_PlatnikId] = 174
AND [dok_OdbiorcaId] = 174
AND [dok_PlatnikAdreshId] = 625
AND [dok_OdbiorcaAdreshId] = 624
Column dok_NrPelnyOryg is of type varchar(30), and not null.
The table contained both integer and string values in this column and this select statement was fired millions of times.
However recently this started crashing with message:
Conversion failed when converting the varchar value 'garbi czerwiec B' to data type int.
Little explanation: the table contains multiple "document" records and the mentioned column contains document original number (which comes from multiple different sources).
I know I can fix this by adding '' around the the number, but I'm rather looking for an explanation why this used to work and while not changing anything now it crashes.
It's possible that a plan change (due to changed statistics, recompile etc) led to this data being evaluated earlier (full scan for example), or that this particular data was not in the table previously (maybe before this started happening, there wasn't bad data in there). If it is supposed to be a number, then make it a numeric column. If it needs to allow strings as well, then stop treating it like a number. If you properly parameterize your statements and always pass a varchar you shouldn't need to worry about whether the value is enclosed in single quotes.
All those equality comparison operations are subject to the Data Type Precedence rules of SQL Server:
When an operator combines two
expressions of different data types,
the rules for data type precedence
specify that the data type with the
lower precedence is converted to the
data type with the higher precedence.
Since character types have lower precedence than int types, the query is basically the same as:
SELECT ...
FROM [dok__Dokument]
WHERE cast([dok_NrPelnyOryg] as int) = 2753
...
This has two effects:
it makes all indexes on columns involved in the WHERE clause useless
it can cause conversion errors.
You're not the first to have this problem, in fact several CSS cases I faced had me eventually write an article about this: On SQL Server boolean operator short-circuit.
The correct solution to your problem is that if the field value is numeric then the column type should be numeric. since you say that the data come from a 3rd party application you cannot change, the best solution is to abandon the vendor of this application and pick one that knows what is doing. Short of that, you need to search for character types on character columns:
SELECT ...
FROM [dok__Dokument]
WHERE [dok_NrPelnyOryg] = '2753'
...
In .Net managed ADO.Net parlance this means you use a SqlCommand like follows:
SqlCommand cmd = new SqlCommand (#" SELECT ...
FROM [dok__Dokument]
WHERE [dok_NrPelnyOryg] = #nrPelnyOryg
... ");
cmd.Parameters.Add("#nrPelnyOryg", SqlDbType.Varchar).Value = "2754";
...
Just make sure you don't fall into he easy trap of passing in a NVARCHAR parameter (Unicode) for comparing with a VARCHAR column, since the same data type precendence rules quoted before will coerce the comparison to occur on the NVARCHAR type, thus rendering indexes, again, useless. the easiest way to fall for this trap is to use the dredded AddWithValue and pass in a string value.
Your query stopped working because someone inserted the text string in to the field you are querying using INT. Up until that time it was possible to implicitly convert the data but now that's no longer the case.
I'd go check your data and, more importantly, the model; as Aaron said do you need to allow strings in that field? If not, change the data type to prevent this happening in the future.

Force numerical order on a SQL Server 2005 varchar column, containing letters and numbers?

I have a column containing the strings 'Operator (1)' and so on until 'Operator (600)' so far.
I want to get them numerically ordered and I've come up with
select colname from table order by
cast(replace(replace(colname,'Operator (',''),')','') as int)
which is very very ugly.
Better suggestions?
It's that, InStr()/SubString(), changing Operator(1) to Operator(001), storing the n in Operator(n) separately, or creating a computed column that hides the ugly string manipulation. What you have seems fine.
If you really have to leave the data in the format you have - and adding a numeric sort order column is the better solution - then consider wrapping the text manipulation up in a user defined function.
select colname from table order by dbo.udfSortOperator(colname)
It's less ugly and gives you some abstraction. There's an additional overhead of the function call but on a table containing low thousands of rows in a not-too-heavily hit database server it's not a major concern. Make notes in the function to optomise later as required.
My answer would be to change the problem. I would add an operatorNumber field to the table if that is possible. Change the update/insert routines to extract the number and store it. That way the string conversion hit is only once per record.
The ordering logic would require the string conversion every time the query is run.
Well, first define the meaning of that column. Is operator a name so you can justify using chars? Or is it a number?
If the field is a name then you will use chars, and then you would want to determine the fixed length. Pad all operator names with zeros on the left. Define naming rules for operators (I.E. No leters. Or the codes you would use in a series like "A001")
An index will sort the physical data in the server. And a properly define text naming will sort them on a query. You would want both.
If the operator is a number, then you got the data type for that column wrong and needs to be changed.
Indexed computed column
If you find yourself ordering on or otherwise querying operator column often, consider creating a computed column for its numeric value and adding an index for it. This will give you a computed/persistent column (which sounds like oxymoron, but isn't).