Issue while converting string data to decimal in proper format in sparksql - sql

I am facing issue in spark sql while converting string to decimal(15,7).
Input data is:
'0.00'
'28.12'
'-39.02'
'28.00'
I have tried converting it into float and then converting into decimal but got unexpected results.
sqlContext.sql("select cast(cast('0.00' as float) as decimal(15,7)) from table").show()
The result I received is as follows
0
But I need to have data in the below format:
0.0000000
28.1200000
-39.0200000
28.0000000

You can try using the format_number method. Something like this.
df.withColumn("num", format_number(col("value").cast("decimal(15,7)"), 7)).show()
The results should be like this.
+------+-----------+
| value| num|
+------+-----------+
| 0.00| 0.0000000|
| 28.12| 28.1200000|
|-39.02|-39.0200000|
| 28.00| 28.0000000|
+------+-----------+

Related

In Hive, how to compare array of string with hivevar list?

In Hive, I have a column date that looks like below, array of string. I have another hivevar that look like this
set hivevar:sunny = ('2022-12-17', '2022-12-21', '2023-01-15');
|date|
|----|
|[["2022-11-14"],["2022-12-14"]]|
|[["2022-11-14","2022-11-17"],["2022-12-14","2022-12-17"]]|
|[["2022-11-21"],["2022-12-21"]]|
|[["2023-01-08"]]|
|[["2022-11-15"],["2022-12-15"],["2023-01-15"]]|
I want to check - for each row, if any of the value is part of the sunny list. So i want to get something like. I thought of using any, array && but they don't work in Hive. Can anyone help?
|result|
|----|
|false|
|true|
|true|
|false|
|true|
SELECT date, (array_contains(sunny, explode(date)) as result
FROM mytable

How to use KQL to format datetime stamp in 'yyyy-MM-ddTHH:mm:ss.fffZ'?

I receive the error format_datetime(): failed to parse format string in argument #2 when trying to format_datetime() using ISO8601 format yyyy-MM-ddTHH:mm:ss.fffZ.
If I leave the T and the Z out, it works,
Surely KQL can format datetimestamps in timezone-aware format. I'm just missing it. I read the docs and it appears that T and Z are not supported formats nor delimiters yet each example in the docs shows the T and Z present(?).
Example:
StorageBlobLogs
| where
AccountName == 'stgtest'
| project
TimeGenerated = format_datetime(TimeGenerated, 'yyyy-MM-ddTHH:mm:ss.fffZ'), //herein lies the issue
AccountName,
RequestBodySize = format_bytes(RequestBodySize)
| sort by TimeGenerated asc
If the code is changed to...
- `TimeGenerated = format_datetime(TimeGenerated, 'yyyy-MM-dd HH:mm:ss.fff')`
...it works, but this is not a timezone-aware timestamp (something I prefer to work in to reduce confusion).
datetime_utc_to_local()
Timezones
I would highly recommend not doing that.
If possible, you would like to let the time zone & format being dealt on the client side.
All datetime values in KQL are UTC. Always.
Even the result of datetime_utc_to_local() is another UTC datetime.
That may lead to (what seems like) unexpected behavior of datetime manipulations (example).
StorageBlobLogs
| sample 10
| project TimeGenerated
| extend Asia_Jerusalem = datetime_utc_to_local(TimeGenerated, "Asia/Jerusalem")
,Europe_London = datetime_utc_to_local(TimeGenerated, "Europe/London")
,Japan = datetime_utc_to_local(TimeGenerated, "Japan")
| extend Asia_Jerusalem_str = format_datetime(Asia_Jerusalem ,"yyyy-MM-dd HH:mm:ss.fff")
,Europe_London_str = format_datetime(Europe_London ,"yyyy-MM-dd HH:mm:ss.fff")
,Japan_str = format_datetime(Japan ,"yyyy-MM-dd HH:mm:ss.fff")
| project-reorder TimeGenerated, Asia_Jerusalem, Asia_Jerusalem_str, Europe_London, Europe_London_str, Japan, Japan_str

Select on CLOB XML DB2

I have this data as CLOB field in DB2. I am converting the data to char using cast:
SELECT CAST(CLOBColumn as VARCHAR(32000))
FROM Schema.MyTable;
Here is how the result XML comes out from the above:
<TreeList TreeNo="ABC">
<Tree ErrorCode="INVALID_TREE" ErrorDescription="Tree doesn’t exist." TreeID="123456"/>
<Tree ErrorCode="INVALID_TREE" ErrorDescription="Tree doesn’t exist." TreeID="1234567"/>
</TreeList>
And this is how I expect my output
|TreeNo | TreeID | ErrorCode | ErrorDescription
|ABC | 123456 | INVALID_TREE | Tree doesn’t exist
|ABC | 1234567 | INVALID_TREE | Tree doesn’t exist
How do I achieve this?
You need to use the XMLTABLE function which allows to map XML data to a table. You can pass in XML-typed data and it works if you directly parse the CLOB to XML. The SELECT would look like the following (you get the idea):
SELECT x.*
FROM schema.mytable, XMLTABLE(
'$CLOBColumn/TreeList'
COLUMNS
TreeNo VARCHAR(10) PATH '#TreeNo',
TreeID INT PATH 'Tree[#TreeID]',
...
) AS x
;

Get the last element from Apache Spark SQL split() Function

I want to get the last element from the Array that return from Spark SQL split() function.
split(4:3-2:3-5:4-6:4-5:2,'-')
I know it can get by
split(4:3-2:3-5:4-6:4-5:2,'-')[4]
But i want another way when i don't know the length of the Array .
please help me.
You can also use SparkSql Reverse() function on a column after Split().
For example:
SELECT reverse(split(MY_COLUMN,'-'))[0] FROM MY_TABLE
Here [0] gives you the first element of the reversed array, which is the last element of the initial array.
Please check substring_index it should work exactly as you want:
substring_index(lit("1-2-3-4"), "-", -1) // 4
You could use an UDF to do that, as follow:
val df = sc.parallelize(Seq((1L,"one-last1"), (2L,"two-last2"), (3L,"three-last3"))).toDF("key","Value")
+---+-----------+
|key|Value |
+---+-----------+
|1 |one-last1 |
|2 |two-last2 |
|3 |three-last3|
+---+-----------+
val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption)
val with_just_last = df.withColumn("Last" , get_last(split(col("Value"), "-")))
+---+-----------+--------+
|key|Value |Last |
+---+-----------+--------+
|1 |one-last1 |last1 |
|2 |two-last2 |last2 |
|3 |three-last3|last3 |
+---+-----------+--------+
Remember that the split function from SparkSQL can be applied to a column of the DataFrame.
use split(MY_COLUMN,'-').getItem(0) if you are using Java

Hive JSON Serde MetaStore Issue

I have an external table with JSON data and I am using JsonSerde to populate data into the table. I am properly getting the data populated and when I query the data I am able to see the results correctly.
But,when I use desc command on that table I am getting from deserializer text for all the column comments.
Below is the table creation ddl.
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
field1 string COMMENT 'This is a field1',
field2 int COMMENT 'This is a field2',
field3 string COMMENT 'This is a field3',
field4 double COMMENT 'This is a field4'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
Location '/user/uszszb6/json_test/data';
Entries in the data file.
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
When I use use the command desc my_table, I get the below output.
+-----------+------------+--------------------+--+
| col_name | data_type | comment |
+-----------+------------+--------------------+--+
| field1 | string | from deserializer |
| field2 | int | from deserializer |
| field3 | string | from deserializer |
| field4 | double | from deserializer |
+-----------+------------+--------------------+--+
JsonSerde is not able to capture the comments properly. I have also tried with other JSONSerde like
org.openx.data.jsonserde.JsonSerDe
org.apache.hive.hcatalog.data.JsonSerDe
com.amazon.elasticmapreduce.JsonSerde
But desc command output is same. There is a JIRA ticket for this bug [https://issues.apache.org/jira/browse/HIVE-6681][1]
According to ticket it's resolved in version 0.13, I am using hive 1.2.1 but still I am facing this issue.
Could anyone share your thoughts on resolving this issue.
Yeah, it looks like it's an hive bug that affects all the Json SerDes, but have you tried using DESCRIBE EXTENDED ?
DESCRIBE EXTENDED my_table;
hive> describe extended json_serde_test;
OK
browser string from deserializer
device_uuid string from deserializer
custom struct<customer_id:string> from deserializer
Detailed Table Information
Table(tableName:json_serde_test,dbName:default, owner:rcongiu,
createTime:1448477902, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:browser, type:string,
comment:hello), FieldSchema(name:device_uuid, type:string, comment:my
name is elder price), FieldSchema(name:custom,
type:struct<customer_id:string>, comment:null)],
location:hdfs://localhost:9000/user/hive/warehouse/json_serde_test,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.openx.data.jsonserde.JsonSerDe, parameters:
{serialization.format=1, mapping.customer_id=Customer ID}),
bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{numFiles=1,
transient_lastDdlTime=1448477903, COLUMN_STATS_ACCURATE=true,
totalSize=128, numRows=0, rawDataSize=0}, viewOriginalText:null,
viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.073 seconds, Fetched: 5 row(s)
Will output a json-ish detailed description that includes comments..kind of hard to read but it is showing me the comments and may be enough for your purposes..or not.