How/where to format INSERT statement by aligning columns with values - sql

I would like to know how I can format the INSERT statement from:
INSERT tablename ([column1], [column2], [column3], [column4], [column5]) VALUES (2,1,11,NULL,4), (2,1,11,NULL,4), (2,1,11,NULL,4);
to:
INSERT tablename ([column1], [column2], [column3], [column4], [column5])
VALUES ( 2, 1, 11, NULL, 4),
( 2, 1, 11, NULL, 4),
( 2, 1, 11, NULL, 4);
Do you know some text formatter that do it?
I found several options, but none that do it like that.

I didn't discover an existent solution, so I do it with PHP.
Let me know about any questions,
Enjoy!
#!/usr/bin/env php
<?php
echo preg_replace_callback("/(INSERT)\s+(.+)\s+\((.+)\)\s+(VALUES)\s+\((.+)\);*/", function($matches) {
$columns = explode(",", $matches[3]);
$values = explode(",", $matches[5]);
//length of INSERT + tablename
$length = strlen("$matches[1] $matches[2]");
//fills VALUES with leading spaces
$values_with_spaces = str_pad($matches[4], $length, " ", STR_PAD_LEFT);
$newColumns = $newValues = [];
foreach($columns as $key => $column) {
$value = !isset($values[$key]) ? 'NULL' : $values[$key];
//takes the longest length between the column and the value
$column_length = max(strlen($column), strlen($value));
$newColumns[$key] = str_pad($column, $column_length, " ", STR_PAD_LEFT);
$newValues[$key] = str_pad($value, $column_length, " ", STR_PAD_LEFT);
}
return "$matches[1] $matches[2] (".join(",",$newColumns).")\n$values_with_spaces (".join(",",$newValues).");\n";
}, file_get_contents("php://stdin"));

Related

R dbGetQuery insert blob data with other text data

library(DBI)
library(RSQLite)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbExecute(db , "create table if not exists drug_rank (
_id integer primary key autoincrement,
pertData_type text,
pertData_name text,
pathway_name text,
drug_name text,
drug_rank_RData blob
)"
)
I want to insert the data using a prepared statement. I have all the drug_rank_RData saved as .RData, which essentially is a list, I want to put them all in the database. How can I do so?
I tried the following but does not work:
df <- list(a =c(1, 2,3), b = c(2, 4, 6), c = c(3, 6, 9)) |> as.data.frame()
drug_rank_obj <- list(sig_drugs = df, name = "test_drug_rank_obj")
dbGetQuery(db,
"insert into drug_rank values (?, ?, ?, ?, ?, ?)",
params = list(
1,
"test type",
"test name",
"test pathway",
"test drugname",
drug_rank_obj
)
)
Thank you.
The only issue is that you need to serialize your data into a binary format in order to insert it into a blog. This example should work
library(DBI)
library(RSQLite)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbExecute(db , "create table if not exists drug_rank (
_id integer primary key autoincrement,
name text,
object blob)"
)
# serialize data to binary
drug_rank_obj <- list(a =c(1, 2,3), b = c(2, 4, 6), c = c(3, 6, 9)) |>
as.data.frame() |> serialize(NULL)
# insert into data base
dbExecute(db,
"insert into drug_rank values (?, ?, ?)",
params = list(1, "my data", list(drug_rank_obj))
)
## retrieve data
rows <- dbGetQuery(db, "select * from drug_rank where _id==?",
params=list(1))
## un-serialize data into object
data <- unserialize(rows$object[[1]])
When doing an insert, use dbExecute rather than dbGetQuery

Array Pair Loading from databases using Qlik Sense

Does anyone has experience how to load/prepare data:
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
taken from SQL database (stored there as value) into qlik sense table:
ID, Value
1, a
2, b
3, v
4, d
Check out the annotated script below.
After its execution the result table will be:
set vSQLData = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')];
SQLData:
Load
// at this point the data will look like: "1, a", "2, b"
// will split the string on "," and will
// get the first value as ID
// and the second one as Valule
SubField(TempField2, ',', 1) as ID,
SubField(TempField2, ',', 2) as Value,
;
Load
// split the string by ")," and generate N number of rows
// then for each row remove "(", ")" and "'" characters
PurgeChar(SubField(TempField1, '),'), '''()''') as TempField2
;
Load
// remove "[" and "]" characters
PurgeChar('$(vSQLData)', '[]') as TempField1
AutoGenerate(1)
;

Formula to Divide Total of Columns and Format Result Not Valid

I have the following SQL code below that is returning an error message stating that the AvgPYs column is not valid. This is also impacting my Auth column. Is there an issue with my formula? Any help is appreciated.
SELECT [tblCeiling].[Proj Code], [tblCeiling].[Act Code], [tblCeiling].[Cost Ctr], [tblCeiling].[Date], [tblCeiling].[Ref2], [tblCeiling].[Analyst], [tblCeiling].[Type], [tblCeiling].[B or O], [tblCeiling].[Jul], [tblCeiling].[Aug],
[tblCeiling].[Sep], [tblCeiling].[Oct], [tblCeiling].[Nov], [tblCeiling].[Dec], [tblCeiling].[Jan], [tblCeiling].[Feb], [tblCeiling].[Mar], [tblCeiling].[Apr], [tblCeiling].[May], [tblCeiling].[Jun], [tblCeiling].[Perm], [tblCeiling].[Temp], [tblCeiling].[LimitedTerm],
[tblCeiling].[LTDate], [tblCeiling].[Sal_Rate], [tblCeiling].[New], [Perm] + [Temp] + [LimitedTerm] AS Monthly, Format(([tblCeiling].[Jul] + [tblCeiling].[Aug] + [tblCeiling].[Sep] + [tblCeiling].[Oct] + [tblCeiling].[Nov] + [tblCeiling].[Dec] + [tblCeiling].[Jan] +
[tblCeiling].[Feb] + [tblCeiling].[Mar] + [tblCeiling].[Apr] + [tblCeiling].[May] + [tblCeiling].[Jun]) / 12, '0.0####') AS AvgPYs,
((([Sal_Rate] * [AvgPYs]) * 1000) / 1000) AS Auth, [Dollar Adj] + [Auth] AS Budget, [tblCeiling].[Import], [tblCeiling].[Dollar Adj], [tblCeiling].[OngoingOrOneTime], [tblCeiling].[OneTimeEndingDate]
FROM (SELECT DISTINCT *
FROM [tblactcode]) AS [tblactcode] RIGHT JOIN
(SELECT DISTINCT *
FROM [tblCeiling]) AS [tblCeiling] ON [tblactcode].[Act Code] = [tblCeiling].[Act Code]
WHERE ((([tblCeiling].[Jul]) = iif([jul] IS NULL, 0, [jul])) AND (([tblCeiling].[Aug]) = iif([aug] IS NULL, 0, [aug])) AND (([tblCeiling].[Sep]) = iif([sep] IS NULL, 0, [sep])) AND (([tblCeiling].[Oct]) = iif([oct] IS NULL, 0, [oct])) AND (([tblCeiling].[Nov]) = iif([nov] IS NULL,
0, [nov])) AND (([tblCeiling].[Dec]) = iif([dec] IS NULL, 0, [dec])) AND (([tblCeiling].[Jan]) = iif([jan] IS NULL, 0, [jan])) AND (([tblCeiling].[Feb]) = iif([feb] IS NULL, 0, [feb])) AND (([tblCeiling].[Mar]) = iif([mar] IS NULL, 0, [mar])) AND (([tblCeiling].[Apr])
= iif([apr] IS NULL, 0, [apr])) AND (([tblCeiling].[May]) = iif([may] IS NULL, 0, [may])) AND (([tblCeiling].[Jun]) = iif([jun] IS NULL, 0, [jun])) AND (([tblCeiling].[Import]) = 0))
ORDER BY [tblCeiling].[Proj Code], [tblCeiling].[Cost Ctr], [tblCeiling].[Date]
Here is the specific line I'm referring to:
Format(([tblCeiling].[Jul] + [tblCeiling].[Aug] + [tblCeiling].[Sep] + [tblCeiling].[Oct] + [tblCeiling].[Nov] + [tblCeiling].[Dec] + [tblCeiling].[Jan] +
[tblCeiling].[Feb] + [tblCeiling].[Mar] + [tblCeiling].[Apr] + [tblCeiling].[May] + [tblCeiling].[Jun]) / 12, '0.0####') AS AvgPYs,
The specific issue you have run into is attempting to use a calculation in the same scope you are calculating it - which isn't possible. You can only access a calculated value in an outer query.
Or a neat solution is to use CROSS APPLY which allows you to reuse a calculation as follows. In general this is done as:
select -- existing columns before AvgPYs
, AvgPYs
-- , some formula which depends on AvgPYs
from (
-- existing query
) C -- C is an acceptable short alias for Ceiling
cross apply (
values (formula)
) X (AvgPYs)
In your case I think the following is correct:
SELECT C.[Proj Code], C.[Act Code], C.[Cost Ctr], C.[Date], C.[Ref2], C.[Analyst], C.[Type], C.[B or O], C.[Jul], C.[Aug]
, C.[Sep], C.[Oct], C.[Nov], C.[Dec], C.[Jan], C.[Feb], C.[Mar], C.[Apr], C.[May], C.[Jun], C.[Perm], C.[Temp], C.[LimitedTerm]
, C.[LTDate], C.[Sal_Rate], C.[New], [Perm] + [Temp] + [LimitedTerm] AS Monthly
, X.AvgPYs
, Y.Auth
, [Dollar Adj] + Y.Auth AS Budget, C.[Import], C.[Dollar Adj], C.[OngoingOrOneTime], C.[OneTimeEndingDate]
FROM (
SELECT DISTINCT *
FROM [tblactcode]
) AS AC
RIGHT JOIN (
SELECT DISTINCT *
FROM [tblCeiling]
) AS C ON AC.[Act Code] = C.[Act Code]
CROSS APPLY (
VALUES (Format((C.[Jul] + C.[Aug] + C.[Sep] + C.[Oct] + C.[Nov] + C.[Dec] + C.[Jan] +
C.[Feb] + C.[Mar] + C.[Apr] + C.[May] + C.[Jun]) / 12, '0.0####'))
) AS X (AvgPYs)
CROSS APPLY (
VALUES (((([Sal_Rate] * X.AvgPYs) * 1000) / 1000))
) Y (Auth)
WHERE (((C.[Jul]) = iif([jul] IS NULL, 0, [jul])) AND ((C.[Aug]) = iif([aug] IS NULL, 0, [aug])) AND ((C.[Sep]) = iif([sep] IS NULL, 0, [sep])) AND ((C.[Oct]) = iif([oct] IS NULL, 0, [oct])) AND ((C.[Nov]) = iif([nov] IS NULL,
0, [nov])) AND ((C.[Dec]) = iif([dec] IS NULL, 0, [dec])) AND ((C.[Jan]) = iif([jan] IS NULL, 0, [jan])) AND ((C.[Feb]) = iif([feb] IS NULL, 0, [feb])) AND ((C.[Mar]) = iif([mar] IS NULL, 0, [mar])) AND ((C.[Apr])
= iif([apr] IS NULL, 0, [apr])) AND ((C.[May]) = iif([may] IS NULL, 0, [may])) AND ((C.[Jun]) = iif([jun] IS NULL, 0, [jun])) AND ((C.[Import]) = 0)
)
ORDER BY C.[Proj Code], C.[Cost Ctr], C.[Date];
Note: A key purpose of a table alias is to have a short reference to the table. See how much easier it is to read with shorter aliases.

Adding a new row to empty dataframe located in data lake

I created a empty dataframe table to location at Delta by using this code below:
deltaResultPath = "/ml/streaming-analysis/delta/Result"
# Create Delta Lake table
sqlq = "CREATE TABLE stockDailyPrices_delta USING DELTA LOCATION '" + deltaResultPath + "'"
spark.sql(sqlq)
I am new to spark and do not fully understand sparkSQL code. What I want to do is instead of inserting values from another dataframe, I would like to add values generated in python script.
Something like modifying the code from:
insert_sql = "insert into stockDailyPrices_delta select f.* from stockDailyPrices f where f.price_date >= '" + price_date_min.strftime('%Y-%m-%d') + "' and f.price_date <= '" + price_date_max.strftime('%Y-%m-%d') + "'"
spark.sql(insert_sql)
to
Time = 10
cpu_temp = 3
dsp_temp = 5
insert_sql = "insert into df (Time, cpu_temp, dsp_temp) values (%s, %s, %s)"
spark.sql(insert_sql)
However, I see the error following:
org.apache.spark.sql.catalyst.parser.ParseException:
ParseException: "\nmismatched input 'Time' expecting {'(', 'SELECT', 'FROM', 'DESC', 'VALUES', 'TABLE', 'INSERT', 'DESCRIBE', 'MAP', 'MERGE', 'UPDATE', 'REDUCE'}(line 1, pos 16)\n\n== SQL ==\ninsert into df (Time, cpu_temp, dsp_temp) values (%s, %s, %s)\n----------------^^^\n"
How can I fix this code?
I could make it work with something like this
spark.sql("insert into Result_delta select {} as Time, {} as cpu_temp, {} as dsp_temp".format(Time, cpu_temp, dsp_temp))

How do I get a SQL row_number equivalent for a Spark RDD?

I need to generate a full list of row_numbers for a data table with many columns.
In SQL, this would look like this:
select
key_value,
col1,
col2,
col3,
row_number() over (partition by key_value order by col1, col2 desc, col3)
from
temp
;
Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like
(key1, (1,2,3))
(key1, (1,4,7))
(key1, (2,2,3))
(key2, (5,5,5))
(key2, (5,5,9))
(key2, (7,5,5))
etc.
I want to order these using commands like sortBy(), sortWith(), sortByKey(), zipWithIndex, etc. and have a new RDD with the correct row_number
(key1, (1,2,3), 2)
(key1, (1,4,7), 1)
(key1, (2,2,3), 3)
(key2, (5,5,5), 1)
(key2, (5,5,9), 2)
(key2, (7,5,5), 3)
etc.
(I don't care about the parentheses, so the form can also be (K, (col1,col2,col3,rownum)) instead)
How do I do this?
Here's my first attempt:
val sample_data = Seq(((3,4),5,5,5),((3,4),5,5,9),((3,4),7,5,5),((1,2),1,2,3),((1,2),1,4,7),((1,2),2,2,3))
val temp1 = sc.parallelize(sample_data)
temp1.collect().foreach(println)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
// ((1,2),1,2,3)
// ((1,2),1,4,7)
// ((1,2),2,2,3)
temp1.map(x => (x, 1)).sortByKey().zipWithIndex.collect().foreach(println)
// ((((1,2),1,2,3),1),0)
// ((((1,2),1,4,7),1),1)
// ((((1,2),2,2,3),1),2)
// ((((3,4),5,5,5),1),3)
// ((((3,4),5,5,9),1),4)
// ((((3,4),7,5,5),1),5)
// note that this isn't ordering with a partition on key value K!
val temp2 = temp1.???
Also note that the function sortBy cannot be applied directly to an RDD, but one must run collect() first, and then the output isn't an RDD, either, but an array
temp1.collect().sortBy(a => a._2 -> -a._3 -> a._4).foreach(println)
// ((1,2),1,4,7)
// ((1,2),1,2,3)
// ((1,2),2,2,3)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
Here's a little more progress, but still not partitioned:
val temp2 = sc.parallelize(temp1.map(a => (a._1,(a._2, a._3, a._4))).collect().sortBy(a => a._2._1 -> -a._2._2 -> a._2._3)).zipWithIndex.map(a => (a._1._1, a._1._2._1, a._1._2._2, a._1._2._3, a._2 + 1))
temp2.collect().foreach(println)
// ((1,2),1,4,7,1)
// ((1,2),1,2,3,2)
// ((1,2),2,2,3,3)
// ((3,4),5,5,5,4)
// ((3,4),5,5,9,5)
// ((3,4),7,5,5,6)
The row_number() over (partition by ... order by ...) functionality was added to Spark 1.4. This answer uses PySpark/DataFrames.
Create a test DataFrame:
from pyspark.sql import Row, functions as F
testDF = sc.parallelize(
(Row(k="key1", v=(1,2,3)),
Row(k="key1", v=(1,4,7)),
Row(k="key1", v=(2,2,3)),
Row(k="key2", v=(5,5,5)),
Row(k="key2", v=(5,5,9)),
Row(k="key2", v=(7,5,5))
)
).toDF()
Add the partitioned row number:
from pyspark.sql.window import Window
(testDF
.select("k", "v",
F.rowNumber()
.over(Window
.partitionBy("k")
.orderBy("k")
)
.alias("rowNum")
)
.show()
)
+----+-------+------+
| k| v|rowNum|
+----+-------+------+
|key1|[1,2,3]| 1|
|key1|[1,4,7]| 2|
|key1|[2,2,3]| 3|
|key2|[5,5,5]| 1|
|key2|[5,5,9]| 2|
|key2|[7,5,5]| 3|
+----+-------+------+
This is an interesting problem you're bringing up. I will answer it in Python but I'm sure you will be able to translate seamlessly to Scala.
Here is how I would tackle it:
1- Simplify your data:
temp2 = temp1.map(lambda x: (x[0],(x[1],x[2],x[3])))
temp2 is now a "real" key-value pair. It looks like that:
[
((3, 4), (5, 5, 5)),
((3, 4), (5, 5, 9)),
((3, 4), (7, 5, 5)),
((1, 2), (1, 2, 3)),
((1, 2), (1, 4, 7)),
((1, 2), (2, 2, 3))
]
2- Then, use the group-by function to reproduce the effect of the PARTITION BY:
temp3 = temp2.groupByKey()
temp3 is now a RDD with 2 rows:
[((1, 2), <pyspark.resultiterable.ResultIterable object at 0x15e08d0>),
((3, 4), <pyspark.resultiterable.ResultIterable object at 0x15e0290>)]
3- Now, you need to apply a rank function for each value of the RDD. In python, I would use the simple sorted function (the enumerate will create your row_number column):
temp4 = temp3.flatMap(lambda x: tuple([(x[0],(i[1],i[0])) for i in enumerate(sorted(x[1]))])).take(10)
Note that to implement your particular order, you would need to feed the right "key" argument (in python, I would just create a lambda function like those:
lambda tuple : (tuple[0],-tuple[1],tuple[2])
At the end (without the key argument function, it looks like that):
[
((1, 2), ((1, 2, 3), 0)),
((1, 2), ((1, 4, 7), 1)),
((1, 2), ((2, 2, 3), 2)),
((3, 4), ((5, 5, 5), 0)),
((3, 4), ((5, 5, 9), 1)),
((3, 4), ((7, 5, 5), 2))
]
Hope that helps!
Good luck.
val test = Seq(("key1", (1,2,3)),("key1",(4,5,6)), ("key2", (7,8,9)), ("key2", (0,1,2)))
test: Seq[(String, (Int, Int, Int))] = List((key1,(1,2,3)), (key1,(4,5,6)), (key2,(7,8,9)), (key2,(0,1,2)))
test.foreach(println)
(key1,(1,2,3))
(key1,(4,5,6))
(key2,(7,8,9))
(key2,(0,1,2))
val rdd = sc.parallelize(test, 2)
rdd: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ParallelCollectionRDD[41] at parallelize at :26
val rdd1 = rdd.groupByKey.map(x => (x._1,x._2.toArray)).map(x => (x._1, x._2.sortBy(x => x._1).zipWithIndex))
rdd1: org.apache.spark.rdd.RDD[(String, Array[((Int, Int, Int), Int)])] = MapPartitionsRDD[44] at map at :25
val rdd2 = rdd1.flatMap{
elem =>
val key = elem._1
elem._2.map(row => (key, row._1, row._2))
}
rdd2: org.apache.spark.rdd.RDD[(String, (Int, Int, Int), Int)] = MapPartitionsRDD[45] at flatMap at :25
rdd2.collect.foreach(println)
(key1,(1,2,3),0)
(key1,(4,5,6),1)
(key2,(0,1,2),0)
(key2,(7,8,9),1)
From spark sql, Read the data files...
val df = spark.read.json("s3://s3bukcet/key/activity/year=2018/month=12/date=15/*");
The above file has fields user_id, pageviews and clicks
Generate the activity Id (row_number) partitioned by user_id and order by clicks
val output = df.withColumn("activity_id", functions.row_number().over(Window.partitionBy("user_id").orderBy("clicks")).cast(DataTypes.IntegerType));