Select all except particular column in spark sql - apache-spark-sql

I want to select all columns in a table except StudentAddress and hence I wrote following query:
select `(StudentAddress)?+.+` from student;
It gives following error in Squirrel Sql client.
org.apache.spark.sql.AnalysisException: cannot resolve '(StudentAddress)?+.+' given input columns

You can use drop() method in the DataFrame API to drop a particular column and then select all the columns.
For example:
val df = hiveContext.read.table("student")
val dfWithoutStudentAddress = df.drop("StudentAddress")

Using spark sql try with
select * except(<columns to be excluded>) from tablename
Example:
select * from tmp
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|a |b |c |d |
#+----+----+----+----+
#exclude col1,col2
select * except(col1,col2) from table_name
#+----+----+
#|col3|col4|
#+----+----+
#|c |d |
#+----+----+

Related

pyspark hive sql convert array(map(varchar, varchar)) to string by rows

I would like to transform a column of
array(map(varchar, varchar))
to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.
example
user_id sport_ids
'aca' [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]
expected results
user_id. sport_ids
'aca'. '5815'
'aca'. '5712'
'aca'. '1065'
I have tried
sql_q= """
select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
from tab """
spark.sql(sql_q)
but got error:
'->' cannot be resolved
I have also tried
sql_q= """
select distinct, user_id, sport_ids
from tab"""
spark.sql(sql_q)
but got error:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;
Did I miss something ?
I have tried this, but helpful
hive convert array<map<string, string>> to string
Extract map(varchar, array(varchar)) - Hive SQL
thanks
Lets try use higher order functions to find map values and explode into individual rows
df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()
+-------+---------+
|user_id|sport_ids|
+-------+---------+
| aca| 5818|
| aca| 6712|
| aca| 1065|
+-------+---------+
You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:
-- sample data
WITH dataset(user_id, sport_ids) AS (
VALUES
('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
)
-- query
select user_id,
json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
unnest(cast(json_parse(sport_ids) as array(json))) as t(record)
Output:
user_id
sport_id
aca
5818
aca
6712
aca
1065

Issue using JSON_MODIFY with JSON_QUERY in a CASE statement

Here's the issue. I have a column in my database (type nvarchar(max)) that I am storing JSON in. I am storing both plain strings or objects in that column like the following:
JsonTable
|--|-------------------|
|Id|JsonValue |
|--|-------------------|
|0 |{"sample":"object"}|
|--|-------------------|
|1 |"plain-string" |
|--|-------------------|
I am trying to use JSON_MODIFY to merge these values with another table's values.
The following works fine for just objects, but not strings:
SELECT JSON_MODIFY('{}', '$.Values', JSON_QUERY(JsonValue))
FROM JsonTable
WHERE Id = 0 -- Fails when string is included in results
-- Result = |------------------------------|
|{"Values":{"sample":"object"} |
|------------------------------|
However it fails to parse the ordinary string (understandably since it is not JSON)
So then my solution was to add a case statement to handle strings. However this does not work as wrapping it in a CASE statement string escapes the JSON_QUERY object and garbles it up in the final JSON_MODIFY result.
The following does not work as expected:
SELECT JSON_MODIFY('{}', '$.Values',
CASE
WHEN ISJSON(JsonValue) > 0 THEN JSON_QUERY(JsonValue)
ELSE REPLACE(JsonValue, '"','')
END)
FROM JsonTable
-- Result = |-------------------------------------|
|{"Values":"{\"sample\"::\"object\"}" |
|-------------------------------------|
|{"Values":"plain-string" |
|-------------------------------------|
So I was unable to figure out really why wrapping JSON_QUERY in a CASE statement doesnt return properly, but instead I started using this workaround which is a bit verbose and messy but it works perfectly fine:
SELECT
CASE
WHEN ISJSON(JsonValue) > 0
THEN
(SELECT JSON_MODIFY('{}', '$.Values', JSON_QUERY(JsonValue)))
ELSE
(SELECT JSON_MODIFY('{}', '$.Values', REPLACE(JsonValue, '"','')))
END
FROM JsonTable
-- Result = |-------------------------------------|
|{"Values":{"sample":"object"} |
|-------------------------------------|
|{"Values":"plain-string" |
|-------------------------------------|
Have you tried using the string_escape function to format your string for JSON; i.e. assuming your issue is related to the correctly escaping the quotes? http://sqlfiddle.com/#!18/9eecb/24391
SELECT JSON_MODIFY('{}', '$.Values',
case
when ISJSON(JsonValue) = 1 then JSON_QUERY(JsonValue)
else STRING_ESCAPE(JsonValue,'json')
end
)
FROM (
values
(0, '{"sample":"object"}')
,(1, 'plain-string')
,(2, '"plain-string2"')
) JsonTable(Id,JsonValue)
STRING_ESCAPE documentation

How to pivot on arbitrary column?

I use Apache Spark 2.2.0 and Scala.
I'm following the question as a guide to pivot a dataframe without using the pivot function.
I need to pivot the dataframe without using the pivot function as I have non-numerical data and pivot works with an aggregation function like sum, min, max on numerical data only. I've got a non-numerical column I'd like to use in pivot aggregation.
Here's my data:
+---+-------------+----------+-------------+----------+-------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|
+---+-------------+----------+-------------+----------+-------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)|
| 2|Question2Text| No| abcde1| 0|(x1,y1)|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)|
| 1|Question1Text| No| abcde2| 0|(x2,y2)|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)|
+---+-------------+----------+-------------+----------+-------+
I want it to group by ParticipantID, Assessment and GeoTag tags and "pivot" on Question column and take the values from AnswerText column. In the end, the output should look as follows:
+-------------+-----------+----------+-------+-----+----- +
|ParticipantID|Assessment |GeoTag |Qid_1 |Qid_2|Qid_3 |
+-------------+-----------+----------+-------+-----+------+
|abcde1 |0 |(x1,y1) |Yes |No |3 |
|abcde2 |0 |(x2,y2) |No |Yes |null |
+-------------+-----------+----------+-------+-----+------+
I have tried this:
val questions: Array[String] = df.select("Q_id")
.distinct()
.collect()
.map(_.getAs[String]("Q_id"))
.sortWith(_<_)
val df2: DataFrame = questions.foldLeft(df) {
case (data, question) => data.selectExpr("*", s"IF(Q_id = '$question', AnswerText, 0) AS $question")
}
[followed by a GroupBy expression]
But I'm getting the following error, which must be something to do with the syntax of the final statement AS $question
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: *
17/12/08 16:13:12 INFO SparkSqlParser: Parsing command: IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '?' expecting <EOF>(line 1, pos 104)
== SQL ==
IF(Q_id_string_new_2 = '101_Who_is_with_you_right_now?', AnswerText, 0) AS 101_Who_is_with_you_right_now?
--------------------------------------------------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
Any ideas where I am going wrong? Is there a better way? I thought about reverting to Pandas and Python outside Spark if necessary, but I'd rather write all the code within the same framework if possible.
As $question is substituting the value of the question variable into the SQL statement, you end up with a column name with '?' in it in SQL. ? is not a valid character in a column name so you have to at least use backticks to quote:
s"IF(Q_id = '$question', AnswerText, 0) AS `$question`"
or use select / withColumn:
import org.apache.spark.sql.functions.when
case (data, question) =>
data.withColumn(question, when($"Q_id" === question, $"AnswerText"))
or santize strings first, using regexp_replace.
need to pivot the dataframe without using the pivot function as I have non-numerical data and df.pivot only works with an aggregation function like sum, min, max on numerical data.
You can use first: How to use pivot and calculate average on a non-numeric column (facing AnalysisException "is not a numeric column")?
data.groupBy($"ParticipantID", $"Assessment", $"GeoTag")
.pivot($"Question", questions).agg(first($"AnswerText"))
Just a note to the accepted answer by #user8371915 to make the query a bit faster.
There is a way to avoid the expensive scan to generate questions with the headers.
Simply generate the headers (in the same job and stage!) followed by pivot on the column.
// It's a simple and cheap map-like transformation
val qid_header = input.withColumn("header", concat(lit("Qid_"), $"Qid"))
scala> qid_header.show
+---+-------------+----------+-------------+----------+-------+------+
|Qid| Question|AnswerText|ParticipantID|Assessment| GeoTag|header|
+---+-------------+----------+-------------+----------+-------+------+
| 1|Question1Text| Yes| abcde1| 0|(x1,y1)| Qid_1|
| 2|Question2Text| No| abcde1| 0|(x1,y1)| Qid_2|
| 3|Question3Text| 3| abcde1| 0|(x1,y1)| Qid_3|
| 1|Question1Text| No| abcde2| 0|(x2,y2)| Qid_1|
| 2|Question2Text| Yes| abcde2| 0|(x2,y2)| Qid_2|
+---+-------------+----------+-------------+----------+-------+------+
With the headers as part of the dataset, let's pivot.
val solution = qid_header
.groupBy('ParticipantID, 'Assessment, 'GeoTag)
.pivot('header)
.agg(first('AnswerText))
scala> solution.show
+-------------+----------+-------+-----+-----+-----+
|ParticipantID|Assessment| GeoTag|Qid_1|Qid_2|Qid_3|
+-------------+----------+-------+-----+-----+-----+
| abcde1| 0|(x1,y1)| Yes| No| 3|
| abcde2| 0|(x2,y2)| No| Yes| null|
+-------------+----------+-------+-----+-----+-----+

Get the last element from Apache Spark SQL split() Function

I want to get the last element from the Array that return from Spark SQL split() function.
split(4:3-2:3-5:4-6:4-5:2,'-')
I know it can get by
split(4:3-2:3-5:4-6:4-5:2,'-')[4]
But i want another way when i don't know the length of the Array .
please help me.
You can also use SparkSql Reverse() function on a column after Split().
For example:
SELECT reverse(split(MY_COLUMN,'-'))[0] FROM MY_TABLE
Here [0] gives you the first element of the reversed array, which is the last element of the initial array.
Please check substring_index it should work exactly as you want:
substring_index(lit("1-2-3-4"), "-", -1) // 4
You could use an UDF to do that, as follow:
val df = sc.parallelize(Seq((1L,"one-last1"), (2L,"two-last2"), (3L,"three-last3"))).toDF("key","Value")
+---+-----------+
|key|Value |
+---+-----------+
|1 |one-last1 |
|2 |two-last2 |
|3 |three-last3|
+---+-----------+
val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption)
val with_just_last = df.withColumn("Last" , get_last(split(col("Value"), "-")))
+---+-----------+--------+
|key|Value |Last |
+---+-----------+--------+
|1 |one-last1 |last1 |
|2 |two-last2 |last2 |
|3 |three-last3|last3 |
+---+-----------+--------+
Remember that the split function from SparkSQL can be applied to a column of the DataFrame.
use split(MY_COLUMN,'-').getItem(0) if you are using Java

PostgreSQL: Sub-select inside insert

I have a table called map_tags:
map_id | map_license | map_desc
And another table (widgets) whose records contains a foreign key reference (1 to 1) to a map_tags record:
widget_id | map_id | widget_name
Given the constraint that all map_licenses are unique (however are not set up as keys on map_tags), then if I have a map_license and a widget_name, I'd like to perform an insert on widgets all inside of the same SQL statement:
INSERT INTO
widgets w
(
map_id,
widget_name
)
VALUES (
(
SELECT
mt.map_id
FROM
map_tags mt
WHERE
// This should work and return a single record because map_license is unique
mt.map_license = '12345'
),
'Bupo'
)
I believe I'm on the right track but know right off the bat that this is incorrect SQL for Postgres. Does anybody know the proper way to achieve such a single query?
Use the INSERT INTO SELECT variant, including whatever constants right into the SELECT statement.
The PostgreSQL INSERT syntax is:
INSERT INTO table [ ( column [, ...] ) ]
{ DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
Take note of the query option at the end of the second line above.
Here is an example for you.
INSERT INTO
widgets
(
map_id,
widget_name
)
SELECT
mt.map_id,
'Bupo'
FROM
map_tags mt
WHERE
mt.map_license = '12345'
INSERT INTO widgets
(
map_id,
widget_name
)
SELECT
mt.map_id, 'Bupo'
FROM
map_tags mt
WHERE
mt.map_license = '12345'
Quick Answer:
You don't have "a single record" you have a "set with 1 record"
If this were javascript: You have an "array with 1 value" not "1 value".
In your example, one record may be returned in the sub-query,
but you are still trying to unpack an "array" of records into separate
actual parameters into a place that takes only 1 parameter.
It took me a few hours to wrap my head around the "why not".
As I was trying to do something very similiar:
Here are my notes:
tb_table01: (no records)
+---+---+---+
| a | b | c | << column names
+---+---+---+
tb_table02:
+---+---+---+
| a | b | c | << column names
+---+---+---+
|'d'|'d'|'d'| << record #1
+---+---+---+
|'e'|'e'|'e'| << record #2
+---+---+---+
|'f'|'f'|'f'| << record #3
+---+---+---+
--This statement will fail:
INSERT into tb_table01
( a, b, c )
VALUES
( 'record_1.a', 'record_1.b', 'record_1.c' ),
( 'record_2.a', 'record_2.b', 'record_2.c' ),
-- This sub query has multiple
-- rows returned. And they are NOT
-- automatically unpacked like in
-- javascript were you can send an
-- array to a variadic function.
(
SELECT a,b,c from tb_table02
)
;
Basically, don't think of "VALUES" as a variadic
function that can unpack an array of records. There is
no argument unpacking here like you would have in a javascript
function. Such as:
function takeValues( ...values ){
values.forEach((v)=>{ console.log( v ) });
};
var records = [ [1,2,3],[4,5,6],[7,8,9] ];
takeValues( records );
//:RESULT:
//: console.log #1 : [1,2,3]
//: console.log #2 : [4,5,7]
//: console.log #3 : [7,8,9]
Back to your SQL question:
The reality of this functionality not existing does not change
just because your sub-selection contains only one result. It is
a "set with one record" not "a single record".