automatically inferring column list for pivot in spark-sql - apache-spark-sql

in spark-sql is there any way to automatically infer distinct column values in pivot operator
data_list = \[("maaruti",1000,"hyderabad"),
("tata",2000,"mumbai"),
("hyundai",1500,"delhi"),
("mahindra",1200,"chennai"),
("maaruti",1200,"mumbai"),
("tata",1000,"delhi"),
("hyundai",2000,"chennai"),
("mahindra",1500,"hyderabad"),
("tata",1100,"delhi"),
("mahindra",1200,"chennai")
]
df=spark.createDataFrame(data_list).toDF("company", "sales", "city",)
df.show()
dataframe approach
df.groupby("company","city").sum("sales").groupby("company").pivot("city").sum('sum(sales)').show()
here distinct values in city column are automatically infered
spark-sql approach
df.createOrReplaceTempView("tab")
spark.sql("""
select \* from tab pivot(sum(sales) as assum
for city in ('delhi','mumbai','hyderabad','chennai'))
""").show()
above snippet gives desired output however the column list is need to be specified manually for distinct city column values .is there any way to automatically do this

Related

Passing a DataFrame list to a WHERE clause in a SQL query embedded in R

I've been working on this for hours and can't find a solution that works.
For simplicity let us say the setup is that we have a DataFrame in R, let's call it df, with a single column of values and let's say it has values 1,2,3,4 and 5.
I want to embed the following query in R:
SELECT DISTINCT column1,
column 2
FROM database
WHERE value IN (1,2,3,4,5)
If I embed the query in R via the following
df5 <- dbGetQuery(db,paste0("
SELECT DISTINCT column1,
column 2
FROM database
WHERE value IN (1,2,3,4,5)"))
then my query works, but I want to reference this DataFrame which I am pulling in from an Excel file. The natural thing to do would be to convert it to a list
val_list=list(df$'values')
and do the following
df5 <- dbGetQuery(db,paste0("
SELECT DISTINCT column1,
column 2
FROM database
WHERE value IN '",vals,"))
This is, however, not working. How can I get it to work as I want?
One should never interpolate data directly into the query, lest accidental sql-injection (or query-poisoning) occurs. It's better to use bound parameters or similar, see https://db.rstudio.com/best-practices/run-queries-safely/.
For this code, assuming you have a set of values you want to check the IN set membership:
#?# vec <- df$values
qmarks <- paste(rep("?", length(vec)), collapse = ",")
df5 <- dbGetQuery(db, paste("
SELECT DISTINCT column1, column 2
FROM database
WHERE value IN (", qmarks, ")"),
params = as.list(vec))
Suppose we wish to insert the Time column from the built in data frame BOD into an sql query.
sprintf("select * from X where Y in (%s)", toString(BOD$Time))
## [1] "select * from X where Y in (1, 2, 3, 4, 5, 7)"

Can we flatten column which contain Json as value in Hive table?

I have one hive column 'events' with Json values.How can i flatten this Json to create one hive table with columns as the key field of Json.Is it even possible?
ex- I need hive table columns to be events,start_date,id,details with corresponding values.
| events |
|[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] |
|[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}]|
Demo:
select events,
get_json_object(element,'$.id') as id,
get_json_object(element,'$.start_date') as start_date,
get_json_object(element,'$.details') as details
from
(
select '[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}]' as events
union all
select '[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}]' as events
) s lateral view outer explode (split(regexp_replace(events, '\\[|\\]',''),'(?<=\\}),(?=\\{)')) e as element
Initial string is splitted by comma between curly brackets, (see explanation here), array exploded with lateral view and JSON objects parsed using get_json_object
Result:
events id start_date details
[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] 3245ret 20201230 Imp
[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] 3245rtr 20201228 NoImp
[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}] 3245ret 20191230 vImp
[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}] 3245rwer 20191228 NoImp

selecting columns using below code but getting only top row

Selecting columns using pandas but getting only top row. Using following code
data1=data.loc[:,'Subject':'Sub-Component']
#column names being "Subject", "Original Product Version","Software Version", "Software Version #","Software Release","Component","Sub-Component"
data1=data.loc[:,'Subject':'Sub-Component']
I am expecting this would select all the columns but this is not selecting all columns and I am only getting the row names/ heads as result
If you want to select specific columns, add those columns names into a list and pass that:
col_list = ['Subject', 'Original Sub-Component']
data1 = data.loc[:, col_list]
If you want to select a range of columns, use iloc instead:
data1 = data.iloc[:, 3:8]

How can I aggregate Jsonb columns in postgres using another column type

I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)

SQL command(s) to transform data

For the SQL language gurus...a challenge. Hopefully not too hard. If I have data that contains an asset identifier, followed by 200 data elements for that asset...what SQL snippet would transform that to a vertical format?
Current:
Column names:
Asset ID, Column Header 1, Column Header 2, ... Column Header "n"
Data Row:
abc123, 1234, 2345, 3456, ...
Desired:
Asset ID, Column Header 1, 1234
Asset ID, Column Header 2, 2345
Asset ID, Column Header 3, 3456
...
Asset ID, Column Header n, 9876
The SQL implementation that I am using (DashDB based on DB2 in Bluemix) does not support a "pivot" command. And I would like the code snippet to work unchanged if column headers are changed, or additional columns are added to the "current" data format. I.e. I would prefer not to hard code to a fixed list of columns.
What do you think? Can it be done with an SQL code snippet?
Thanks!
You can do this by composing a pivoted table for each row and performing a cartesian product between the source table and the composed table:
SELECT assetId, colname, colvalue
FROM yourtable T,
TABLE(VALUES ('ColumnHeader1', T.ColumnHeader1),
('ColumnHeader2', T.ColumnHeader2),
('ColumnHeader3', T.ColumnHeader3),
...
('ColumnHeaderN', T.ColumnHeaderN)
) as pivot(colname, colvalue);
This will only require a single scan of yourtable, so it quite efficient.
The canonical way is union all:
select assetId, 'ColumnHeader1' as colname, ColumnHeader1 as value from t union all
select assetId, 'ColumnHeader2' as colname, ColumnHeader2 as value from t union all
. . .
There are other methods but this is usually the simplest to code. It will require reading the table once for each column, which could be an issue.
Note: You can construct such a query using a spreadsheet and formulas. Or, even construct it using another SQL query.