Databricks Python Optimization - sql

I need your help please, i have a simple code in python which lists all the fields in the tables in all the databases that are on databricks, there are a little nearly 90 tables and I would like to save the result in a txt or csv file. here is the code used it works but it takes 8 hours to finish it is too long how can I optimize or have another way for it to be faster?
# table containing all name of database in databricks
#df_tables = spark.sql("SELECT * FROM bd_xyh_name")
#DynoSQL is a string table for result in txt
def discribe():
try:
for i in df_tables.collect():
showTables="""show tables in {};""".format(i.nombd)
df1=spark.sql(showTables)
for j in df1.collect():
describeTable="""describe table {0}.{1};""".format(j.database,j.tableName)
df2=spark.sql(describeTable)
#df3=df2.collect()
df3 = df2.rdd.toLocalIterator()
for k in df3:
#df=df2.select(df2.col_name;k.data_type)
#spark.sql("insert into NewTable VALUES ("+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+");")
spark.sql("insert into DynoSQL select \""+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+"\"")
# request="insert into NewTable VALUES ({};{};{};{});""".format(j.database,j.tableName,k.col_name,k.data_type)
#spark.sql(request)
except:
raise

You can try with below logic .
Logic :
Get the available databases within workspace and make list
Iterate the databases name and get the available tables within databases and write into temp table. (Temp table you should create as managed one)
Advantage : Based on this logic , at a time only one databases only will be processed and if fails during the process , we can start from failing databases instead of whole workspace level.
Code Snippet :
from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql.functions import col, concat, lit
df = spark.sql("show databases")
list = [x["databaseName"] for x in df.collect()]
for x in list:
df = spark.sql(f"use {x}")
df1 = spark.sql("show tables")
df_loc.write.insertInto("writeintotable")
display(df1)
Screenshot :

Related

Writing a scalable INSERT statement using cx_Oracle

I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.

SQL - Count All Cells In The Entire Table That Are Not NULL And Not Empty

I have recently been asked to do a count of all the cells in some tables that are not NULL and not empty/blank.
The issue is, I have about 80 tables and some of those tables have dozens of columns and others have hundreds of columns.
Is there a query I could use to count all cells from all columns that fit a specific criteria (in this case not NULL and not empty/blank)?
I have done some searching and it seems most answers revolve around single columns or tables that only have like 3-5 columns.
Thanks!
Try connecting SQL with pandas using pymysql or pyodbc connector and then iterate over each column using for loop and apply the count function on it.
import pymysql
import pandas as pd
import numpy as np
con = pymysql.connect('[host name]', '[user name]','[your password]', '[database name]')
cursor = con.cursor()
df = pd.read_sql('select * from [table name]',con) # SQL converted to pandas dataframe
print(df)
for col in df.columns: # loops through column
count_ = df[col].count()
print(count_) # returns count for non-nan values

How do I swap two (or more) columns in two different data tables? on pandas

new here and I am new to programming.
So.. as the title says I am trying to swap two full columns from two different files (columns has the same name but different data). I started this:
import numpy as np
import pandas as pd
from pandas import DataFrame
df = pd.read_csv('table1.csv', col_name= 'COL1')
df1 = pd.read_csv('table2.csv', col_name = 'COL1')
df1.COL1 = df.COL1
But now I am stack.. how do I select whole column and how can I print the new combined table to a new file (i.e table 3)?
You could perform the swapping by copying one column in a temporary one and deleting afterwards like follows
df1['temp'] = df1['COL1']
df1['COL1'] = df['COL1']
df['COL1'] = df1['temp']
del df1['temp']
and then writing the result via to_csv to a third CSV
df1.to_csv('table3.csv')

There are three problems(Load database, loop, and append series)

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)

Speed-up Pandas Dataframe insert into a Postgres DB using SQLAlchemy

I have a postgres table with about 100k rows. I extracted this dataset and applied some transformation resulting in a new pandas dataframe containing 100K rows. Now I want to load this dataframe as a new table in the database. I used to_sql to convert the dataframe to a postgres table using SQLAlchemy connection. However, this is very slow and takes several hours. How can I use SQLAlchemy to speed up dataframe insert into database table? I want to increase insert speed from several hours to few seconds? Can someone help me with this?
I have searched through other similar questions on Stackoverflow. Most of them converts data to a csv file and then use copy_from for sql. I am looking towards a solution using SQLAlchemy bulk insert statement with pandas dataframe.
Here is a small version of my code:
from sqlalchemy import *
url = 'postgresql://{}:{}#{}:{}/{}'
url = url.format(user, password, localhost, 5432, db)
con = sqlalchemy.create_engine(url, client_encoding='utf8')
# I have a dataframe named 'df' containing 100k rows. I use the following code to insert this dataframe into the database table.
df.to_sql(name='new_table', con=con, if_exists='replace')
Try below model if the pandas version is above 0.24
Alternative to_sql() method for DBs that support COPY FROM import csv from io import StringIO
def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
chunksize = 10 4 # it depends on your server configuration. for my case 104 ~10**5 is OK.
df.to_sql('tablename',con=con, if_exists='replace',method=psql_insert_copy ,chunksize= chunksize)
if you use above psql_insert_copy mode and your postgresql server is work normally, you should enjoy fly speed.
Here is my ETL speed. Average 280~300K tuple per batch(in seconds).