Adding multiple columns in temp table from dataframe using pyspark

Adding multiple columns in temp table from dataframe using pyspark - dataframe

Trying to create a table in Pyspark in Databricks which would capture database name, table name and DDL statement for each table. Hive metastore has been loaded in AzureMySQL on which Databricks cluster is pointing. While in separate dataframe objects I am able to capture needed details, however unable to combine them into one single table. Schema I am looking at : DatabaseName, TableName, CreateDDLStatement, Location.
For database name and table name here is the dataframe df1
tables = spark.catalog.listTables(db)
for t in tables:
Df1 = spark.sql("show tables in db1")
Df1.createOrReplaceTempView("TempTable1")
result1 = sqlContext.sql('select * from TempTable1')
result1.show()
Output -
DatabaseName TableName IsTemporary
db1 tb1 False
db2 tb2 True db = "db1"
tables = spark.catalog.listTables(db)
for t in tables:
Df1 = spark.sql("show tables in db1")
Df1.createOrReplaceTempView("TempTable1")
result1 = sqlContext.sql('select * from TempTable1')
result1.show()
Output -
DatabaseName TableName IsTemporary
db1 tb1 False
db2 tb2 True
Second dataframe objects has DDL statement for each table
for t in tables:
DF2 = spark.sql("SHOW CREATE TABLE {}.{}".format(db, t.name))
DF2.createOrReplaceTempView("TempTable2")
result2 = sqlContext.sql('select * from TempTable2')
result2.show(10)
crtmnt_stmnt
create table tb1....
create table tb2
Expected Output
DatabaseName TableName IsTemporary crtmnt_stmnt
db1 tb1 False create table tb1...
db2 tb2 True create table tb2...
Can I append a column to dataframe object or temporary table. how to dynamically take values using withColumn and spark.sql command

Since you already have a Dataframe where you have database and table name (which are used while create crtmnt_stmnt column), you can achieve this with dataframes itself.
The following is the dataframe I have using code similar to yours:
df1= spark.sql("show tables in default")
df1.createOrReplaceTempView("default.temp_table")
result = spark.sql("select * from temp_table")
result.show()
Now, I have created a dataframe that has all the details required. Iterating through tables, I created a temporary dataframe which I am appending to my final dataframe. I have also added 2 columns while doing so i.e., tablename and database using withColumn.
This is because we will have key columns in our final dataframe(create statement dataframe) using which we can apply inner join on the above df1.
from pyspark.sql.functions import col,lit,concat
df2=None
db='default'
for t in tables:
if(df2 is None):
df2 = spark.sql("SHOW CREATE TABLE {}.{}".format(db, t.name))
df2 = df2.withColumn("table_name",lit(t.name)).withColumn("db_name",lit(db))
else:
temp = spark.sql("SHOW CREATE TABLE {}.{}".format(db, t.name))
temp = temp.withColumn("table_name",lit(t.name)).withColumn("db_name",lit(db))
df2 = df2.union(temp)
df2.createOrReplaceTempView("create_table")
#display(df2)
I have used pyspark to apply inner join to get the desired result.
final = df1.join(df2,(df1.tableName == df2.table_name) & (df1.database == df2.db_name),"inner").select(df1['database'],df1['tableName'],df1['isTemporary'],df2['createtab_stmt'])
display(final)
Using spark.sql() the code would be as follows (using the temporary views created)
spark.sql("select d1.database,d1.tableName,d1.isTemporary,d2.createtab_stmt from temp_table d1 inner join create_table d2 on d1.database=d2.db_name and d1.tableName=d2.table_name").show()

Related

Updates to a Table | Impala | SQL | Hadoop

I have tables build on Hadoop. These are impala tables. (Not Kudu).
Issue : I have to update the few cols values (eg: load_date,fraud_type) from ulti_up_2 table for a set of keys (dw,auth,ulti_date) in the final_up_2 table.
I have the below mentioned used case :-
Table 1 :
create table dbo.ulti_up_2 (
dw string
,auth int
,ulti_date string
,load_date string
,fraud_type string
),
insert into dbo.ulti_up_2
values ('b',1,'2021-07-25','2021-07-27','x'),
('c',0,'2021-07-25','2021-07-27','y');
Table 2:
create table dbo.final_up_2 (id int,auth_date string,dw string,auth int,ulti_date string,load_date string,fraud_type string)
insert into dbo.final_up_2 values
(1,'2021-07-24','a',1,'2021-07-25','2021-07-25','p'),
(2,'2021-07-24','b',1,'2021-07-25','2021-07-25','q'),
(3,'2021-07-24','c',0,'2021-07-25','2021-07-25','t'),
(4,'2021-07-24','d',1,'2021-07-25','2021-07-25','r');
create table dbo.refresh_table1 as
select df_prep.id,df_prep.auth_date,df_prep.dw,df_prep.auth,df_prep.ulti_date,
ulti_prep.fraud_type,ulti_prep.load_date
from dbo.final_up_2 df_prep
left join
dbo.ulti_up_2 ulti_prep
on df_prep.dw=ulti_prep.dw and
df_prep.auth=ulti_prep.auth and
df_prep.ulti_date=ulti_prep.ulti_date;
Output Coming :
id|auth_date|dw|auth|ulti_date|fraud_type|load_date
(1,'2021-07-24','a',1,'2021-07-25',NULL,NULL),
(2,'2021-07-24','b',1,'2021-07-25','x','2021-07-27'),
(3,'2021-07-24','c',0,'2021-07-25','y','2021-07-27'),
(4,'2021-07-24','d',1,'2021-07-25',NULL,NULL);
Output I need :
id|auth_date|dw|auth|ulti_date|fraud_type|load_date
(1,'2021-07-24','a',1,'2021-07-25','p','2021-07-25'),
(2,'2021-07-24','b',1,'2021-07-25','x','2021-07-27'),
(3,'2021-07-24','c',0,'2021-07-25','y','2021-07-27'),
(4,'2021-07-24','d',1,'2021-07-25','r','2021-07-25');
Thanks in Advance. Please Help.

This is because the left join with ulti_up_2 is failing for some cases. if you handle them, you should get data.
create table dbo.refresh_table1 as
select df_prep.id,df_prep.auth_date,df_prep.dw,df_prep.auth,df_prep.ulti_date,
ifnull(ulti_prep.fraud_type,df_prep.fraud_type) as fraud_type , --This will fetch data from final_up_2 in case left join fails.
ifnull(ulti_prep.load_date,df_prep.load_date) as load_date --This will fetch data from final_up_2 in case left join fails.
from dbo.final_up_2 df_prep
left join
dbo.ulti_up_2 ulti_prep
on df_prep.dw=ulti_prep.dw and
df_prep.auth=ulti_prep.auth and
df_prep.ulti_date=ulti_prep.ulti_date;

I need to create a VIEW from an existing TABLE and MAP an additional COLUMN to that VIEW

I am fairly new to SQL. What I am trying to do is create a view from an existing table. I also need to add a new column to the view which maps to the values of an existing column in the table.
So within the view, if the value in a field for Col_1 = A, then the value in the corresponding row for New_Col = C etc
Does this even make sense? Would I use the CASE clause? Is mapping in this way even possible?
Thanks

The best way to do this is to create a mapping or lookup table
For example consider the following LOOKUP table.
COL_A NEW_VALUE
---- -----
A C
B D
Then you can have a query like this:
SELECT A.*, LOOK.NEW_VALUE
FROM TABLEA AS A
JOIN LOOKUP AS LOOK ON A.COL_A = LOOK.COL_A
This is what DimaSUN is doing in his query too -- but in his case he is creating the table dynamically in the body of the query.
Also note, I'm using a JOIN (which is an inner join) so only results in the lookup table will be returned. This could filter the results. A LEFT JOIN there would return all data from A but some of the new columns might be null.

Generally, a view is an instance of a table/a replica provided that there is no alteration to the original table. So, as per your query you can manipulate the data and columns in a view by using case.
Create View viewname as
Select *,
case when column=a.value then 'C'
....
ELSE
END
FROM ( Select * from table) a

If You have restricted list of replaced values You may hardcode that list in query
select T.*,map.New_Col
from ExistingTable T
left join (
values
('A','C')
,('B','D')
) map (Col_1,New_Col) on map.Col_1 = T.Col_1
In this sample You hardcode 'A' -> 'C' and 'B' -> 'D'
In general case You better may to use additional table ( see Hogan answer )

How to update a nested bigquery column with data from another bigquery table

I have 2 bigquery tables with nested columns, I need to update all the columns in table1 whenever table1.value1=table2.value, also those tables having a huge amount of data.
I could update a single nested column with static column like below,
#standardSQL
UPDATE `ck.table1`
SET promotion_id = ARRAY(
SELECT AS STRUCT * REPLACE (100 AS PromotionId ) FROM UNNEST(promotion_id)
)
But when I try to reuse the same to update multiple columns based on table2 data I am getting exceptions.
I am trying to update table1 with table2 data whenever the table1.value1=table2.value with all the nested columns.
As of now, both tables are having a similar schema.

I need to update all the columns in table1 whenever table1.value1=table2.value
... both tables are having a similar schema
I assume by similar you meant same
Below is for BigQuery Standard SQL
You can use below query to get combining result and save it back to table1 either using destination table or CREATE OR REPLACE TABLE syntax
#standardSQL
SELECT AS VALUE IF(value IS NULL, t1, t2)
FROM `project.dataset.table1` t1
LEFT JOIN `project.dataset.table2` t2
ON value1 = value
I have not tried this approach with UPDATE syntax - but you can try and let us know :o)

Copy data from one table to another - Ignore duplicates Postgresql

I am using Postgresql db. I have data in two tables. Table A has 10 records and Table B 5 records.
I would like to copy Table A data to Table B but only copy the new entries (5 records) and ignore the duplicates/already existing data
I would like to copy data from Table A to Table B where Table B will have 10 records (5 old records + 5 new records from Table A)
Can you please help me as to how can this be done?

Assuming id is your primary key, and table structures are identical(both table has common columns as number of columns and data type respectively), use not exists :
insert into TableB
select *
from TableA a
where not exists ( select 0 from TableB b where b.id = a.id )

If you are looking to copy rows unique to A that are not in B then you can use INSERT...SELECT. The SELECT statement should use the union operator EXCEPT:
INSERT INTO B (column)
SELECT column FROM A
EXCEPT
SELECT column FROM B;
EXCEPT (https://www.postgresql.org/docs/current/queries-union.html) compares the two result sets and will return the distinct rows present in result A but not in B, then supply these values to INSERT. For this to work both the columns and respective datatypes must match in both SELECT queries and your INSERT.

INSERT INTO Table_A
SELECT *
FROM Table_B
ON CONFLICT DO NOTHING
Here, the conflict will be taken based on your primary key.

spark count and filtered count in same query

In SQL something like
SELECT count(id), sum(if(column1 = 1, 1, 0)) from groupedTable
could be formulated to perform a count of the total records as well as filtered records in a single pass.
How can I perform this in spark-data-frame API? i.e. without needing to join back one of the counts to the original data frame.

Just use count for both cases:
df.select(count($"id"), count(when($"column1" === 1, true)))
If column is nullable you should correct for that (for example with coalesce or IS NULL, depending on the desired output).

You can try using spark with hive as hive supports sum if() functionality of SQL
First you need to create hive table on top of your data using below code
val conf = new SparkConf().setAppName("Hive_Test")
val sc = new SparkContext(conf)
//Creation of hive context
val hsc = new HiveContext(sc)
import spark.implicits._
import spark.sql
hsc.sql("CREATE TABLE IF NOT EXISTS emp (id INT, name STRING)")
hsc.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/test.txt' INTO TABLE emp")
hsc.sql("""select count(id), SUM(v)
from (
select id, IF(name=1, count(*), 0) AS v
from emp
where id>0
group by id,name
) t2""")

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Adding multiple columns in temp table from dataframe using pyspark - dataframe

Related

Updates to a Table | Impala | SQL | Hadoop

I need to create a VIEW from an existing TABLE and MAP an additional COLUMN to that VIEW

How to update a nested bigquery column with data from another bigquery table

Copy data from one table to another - Ignore duplicates Postgresql

spark count and filtered count in same query

Categories

Resources