How to handle spaces in Column names in spark.sql in Domino tool - apache-spark-sql

I have a temp table table created out of a parquet file which spaces in its column names. Though DF created successfully when i try to query the temp table am getting error on column names Attribute name "Comm Rule HDR ID" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; I tried backticks but thats not working in Domino tool though it worked in Databricks tool.
spark.read.parquet("s3 bucket").createOrReplaceTempView("test")
spark.sql("""select * from test""").show()
spark.sql("""select 'Comm Rule HDR ID' as sample from test""").show()
spark.sql throws error. If i use backticks alone its not recognizing and giving the same error, if i wrap in single or double quotes then its treating as string and printing the same value as shown below. Like i said backticks are working in Databricks tool, but this time i need to run in Domino tool. Appreciate any inputs
+----------------+`
| sample|
+----------------+
|Comm Rule HDR ID|
|Comm Rule HDR ID|
|Comm Rule HDR ID|
|Comm Rule HDR ID|
|Comm Rule HDR ID|
|Comm Rule HDR ID|

Related

How Splunk field contains double quote

When use Splunk, if we have log
key="hello"
Search in Splunk by
* | table a
we can see hello value
We might print out value with double quote, if we don't escape
key="hel"lo"
We'll see key value is hel. Value breaks before the position of quote
If try to escape double quote with \,
key="hel\"lo"
We'll see key value is hel\
Is use single quote around
key='hel"lo'
We'll see key value include single quotes, it's 'hello"lo'. In this case, search criteria should be
* key="'ab\"c'" | table a
single quotes are parts of value
Question is how to include double quote as part of value?
Ideally, there should be a way to escape double quotes, input
key="hel\"lo"
should match the query
key="hel\"lo"
But it's not.
I have this problem for many years. Splunk value is dynamic, it could contain double quotes. I'm not going to use JSON my log format.
I'm curious why there is no answer in Splunk's official website.
Someone can help? Thanks.
| makeresults
| eval bub="hell\"o"
| table bub
Puts a double-quote mark right in the middle of the bub field
If you want to search for the double-quote mark, use | where match() like this:
| where match(bub,"\"")
Ideally, the data source would not generate events with embedded quotes without escaping them. Otherwise, how would a reader know the quote is embedded and not mismatched? This is the problem Splunk is struggling with.
The fix is to create your own parser using transforms.
In props.conf:
[mysourcetype]
TRANSFORMS-parseKey = parse_key
In transforms.conf:
[parse_key]
REGEX = (\w+)="(.*\".*)"
FORMAT = $1::$2
Of course, this regex is simplified. You'll need to modify it to match your data.

Spark write CSV not writing unicode character

I have string containing unicode character(ctrl-B) as the last character in one column of the dataframe.
After writing it in CSV using spark it doesn't have last unicode character(ctrl-B) in string.
df.show()
+------------+-------+
|a | b|
+------------+-------+
| 25|0^B^B0^B|
+------------+-------+
df.write.format("com.databricks.spark.csv").save("/home/test_csv_data")
vim /home/test_csv_data/part*
25,0^B^B0
It doesn't have the last ctrl-B character.
But if I write it in ORC or parquet format using spark then the last ctrl-B is present.
Please guide me on this, why is it happening. How can I get ctrl-B in csv at the end ?
'^B' is considered as a white space, and the default setting of ignoreTrailingWhiteSpace is true, which will remove it, so you can set it to false:
df.write.option("ignoreTrailingWhiteSpace","false").format("com.databricks.spark.csv").save("/home/test_csv_data")

Getting Error as "Regex: syntax error in subpattern name (missing terminator)." in SPLUNK

I have been extracting fields in Splunk and this looks to be working fine for all headers but for the header l-s-m, I am getting the error as "syntax error in subpattern name (missing terminator)."
I have done similar for other headers and all works but this is the only header with "hypen" sign that is giving this error, I have tried multiple times but this is not helping.
Headers:
Content-Type: application/json
Accept: application/json,application/problem json
l-m-n: txxxmnoltr
Accept-Encoinding:gzip
Regex I am trying is "rex field=u "l-m-n: (?<l-m-n>.*)" in SPLUNK. Could you please guide me here?
rex cannot extract into a field name with hyphens. However, you can solve this with rename
| rex field=u "l-m-n: (?<lmn>.*)" | rename lmn AS "l-m-n"
In general, I would avoid the use of hyphens in a field name, as it can be mistaken for a minus. If you want to use the field l-m-n, you will need to quote it everywhere, like 'l-m-n' . I would strongly suggest you stick with using the field name lmn.
Try running the following to see what I mean
| makeresults | eval l-m-n=10 | eval l=1 | eval m=1 | eval n=1 | eval result_noquote=l-m-n | eval result_quoted='l-m-n'

Copy csv & json data from S3 to Redshift

I have data like below format from s3 bucket.
"2010-9","aws cloud","{"id":1,"name":"test"}"
"2010-9","aws cloud1","{"id":2,"name":"test2"}"
I want to copy data in database like below.
Table
year | env | desc
2010-9 | aws cloud |{"id":1,"name":"test"}
2010-9 | aws cloud1 |{"id":2,"name":"test2"}
I have written this command but not working. Could you please help me?
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
You are almost there - you just need to escape the double quotes inside the 3rd field (desc). Per the
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example: "aaa","b""bb","ccc"
This is per rfc-4180 - https://www.ietf.org/rfc/rfc4180.txt
I've also loaded json into a text field in Redshift and then used the json functions to parse the field. Works great.

parse text file and remove white space

I have a file written from a Cobalt program that produces a pipe delimeted file. The file contains white space and "null" spaces that I need to get rid of and then rewrite the file. What is the best way to do that? I have sql server and visual studio that can be used to write the script, but not sure the best one to use or exactly how. The script will need to read through many different files in a folder. The data is being converted from an old system into a new one. Also, I would need to keep spaces between words, ie a business name or an address. I was going to use sql, but can only find examples reading fields in a database.
Example file (one line):
0000000009|LName |FName | | | | | | |1|1|0|000|000|000000000|
1||null null null| | null null|null null null null| |1|0|
Desired output:
0000000009|LName|Fname|||||||1|1|0|000|000|000000000|1||||||1|0|
Thanks!!
You said you can use visual studio, so this example uses c#.
I suppose you will load your file content into a string, then you can apply some replaces:
s.Replace("null", string.Empty).Replace(" |", "|").Replace("| ", "|").Replace("| |", "||");
I know there are probably a lot of much more elegant solution: this is quick and dirty but it will output the string you need.
Hope this helps.