I want to update one row in table which is present in spark-SQL, so how can i do it.
e.g My original query is ,
Update from student set marks = 56 where id =1;
How to do this in spark-SQL? As update is not supported in spark-SQL.
please help in this.
Spark SQL dataframes, at their core, are RDDs and these are immutable meaning you cant change (update) them. However, you generate new Spark SQL Dataframe by applying User Defined Function (UDF) on the original Dataframe something like this:
val my_udf=udf{(given_id:Int, given_marks:Int)=> if(given_id==5) 5 else given_marks}
val new_df= original_df.select(id,my_udf(id,marks))
Related
I would like to know a way of adding one additional column to a BigQuery table, that will populate all the rows for this newly created column with specific constant value.
I know how to create column with NULL values:
ALTER TABLE project_id.dataset.table
ADD COLUMN A STRING
But my goal is to also add the ingestion time with CURRENT_TIMESTAMP() function. Is it even possible with one command? Or maybe I need to apply subsequently some second command?
Seems like a solution is to use another query after the mentioned one:
UPDATE project_id.dataset.table SET A = CAST(CURRENT_TIMESTAMP() AS STRING) WHERE 1=1
I'm stacking on some issue on Tableau when I'm trying to run Custom query with string parameter. I'd like to query one column dynamically from certain table on BigQuery.
My SQL looks like.:
select <Parameters.column for research> as column,
count(*) as N
from table_name
where date=<Parameters.date>
group by 1
Here I'm trying to use parameter as column name.
But unfortunatlly I'm receive string column with one value of the parameter.
Is it possible to execute my request? If it's doable, so how to write the Custom SQL?
I am using the following code to insert a dataframe data directly into a databricks delta table:
eventDataFrame.write.format("delta").mode("append").option("inferSchema","true").insertInto("some delta table"))
but if the column order with which the detla table created is different than the dataframe column order, the values get jumbled up and then don't get written to the correct columns. How to maintain the order? Is there a standard way/best practice to do this?
this is fairly simple -
`
####in pyspark
df= spark.read.table("TARGET_TABLE") ### table in which we need to insert finally
df_increment ## the data frame which has random column order which we want to insert into TARGET_TABLE
df_increment =df_increment.select(df.columns)
df_increment.write.insertInto("TARGET_TABLE")
`
so for you it will
parent_df= spark.read.table("some delta table")
eventDataFrame.select(parent_df.columns).write.format("delta").mode("append").option("inferSchema","true").insertInto("some delta table"))
Use saveAsTable column order doesn't matter with it, spark would find the correct column position by column name.
eventDataFrame.write.format("delta").mode("append").option("inferSchema","true").saveAsTable("foo")
From spark documentation.
The column order in the schema of the DataFrame doesn't need to be same as that of the existing table. Unlike insertInto, saveAsTable will use the column names to find the correct column positions
I have datasets of the same structure and i know I can query them like this, they are named by date:
SELECT column
FROM [xx.ga_sessions_20141019] ,[xx.ga_sessions_20141020],[xx.ga_sessions_20141021]
WHERE column = 'condition';
However I actually want to query various months of this data... so instead of listing them all in the same way as above, is there syntax that you can use that looks like:
SELECT column
FROM [xx.ga_sessions_201410*] ,[xx.ga_sessions_201411*]
WHERE column = 'condition';
Take a look at the table wildcard functions section of the BigQuery query reference. TABLE_DATE_RANGE or TABLE_QUERY will work for you here. Something like:
SELECT column
FROM TABLE_DATE_RANGE(xx.ga_sessions_,
TIMESTAMP('2014-10-19'),
TIMESTAMP('2014-10-21'))
WHERE column = 'condition';
There is a table 'target' that looks like this:
id val
1 A
2 A
3
5 A
8 A
9
There is also a 'source' table that looks like this
id val
1
2 B
4 B
8 B
9 B
The directioins ask to use the 'source' table to transform the 'target' table into the 'result' table, which looks like this:
result
id val
1
2 B
3
5 A
8 B
9 B
I understand the logic of what the question is asking, as I believe what I need to do is basically say
IF target.id = source.id
SET target.val = source.val
ELSE target.val = target.val
However, I am not completely sure how to accomplish this kind of update in SQL based on conditions w/ multiple tables using postgresql.
This looks like homework so I won't give a complete answer.
First step is to turn these into tables you can actually use. A handy tool for this is provided by http://sqlfiddle.com, with its "text to table" feature. Because of the dodgy formatting of the input we've got to make some fixups before it'll work (assuming empty cols are null, not empty string; fixing whitespace errors) but then we get:
http://sqlfiddle.com/#!15/4a046
(SQLfiddle is far from a model of how you should write DDL - it's a useful debugging tool, that's all).
So now you have something to play with.
At this point, I suggest looking into the UPDATE ... FROM statement, which lets you update a join. Or you can use a subquery in UPDATE to perform the required logic.
UPDATE target
SET val = source.val
FROM /* you get to fill this in */
WHERE /* you get to fill this in */
Merging data
Luckily, the result table they've given you is the result of a simple join-update. Note that rows present in "source" but not "target" haven't been added to "result"
If you were instead supposed to merge the two, so that entries in source that do not exist in target at all get added to target, this becomes what's called an upsert. This has been written about extensively; see links included in this post. Be glad you don't have to deal with that.