Unable to write dataframe of pyspark into mysql database [duplicate] - dataframe

I am attempting to insert records into a MySql table. The table contains id and name as columns.
I am doing like below in a pyspark shell.
name = 'tester_1'
id = '103'
import pandas as pd
l = [id,name]
df = pd.DataFrame([l])
df.write.format('jdbc').options(
url='jdbc:mysql://localhost/database_name',
driver='com.mysql.jdbc.Driver',
dbtable='DestinationTableName',
user='your_user_name',
password='your_password').mode('append').save()
I am getting the below attribute error
AttributeError: 'DataFrame' object has no attribute 'write'
What am I doing wrong? What is the correct method to insert records into a MySql table from pySpark

Use Spark DataFrame instead of pandas', as .write is available on Spark Dataframe only
So the final code could be
data =['103', 'tester_1']
df = sc.parallelize(data).toDF(['id', 'name'])
df.write.format('jdbc').options(
url='jdbc:mysql://localhost/database_name',
driver='com.mysql.jdbc.Driver',
dbtable='DestinationTableName',
user='your_user_name',
password='your_password').mode('append').save()

Just to add #mrsrinivas answer's.
Make sure that you have jar location of sql connector available in your spark session. This code helps:
spark = SparkSession\
.builder\
.config("spark.jars", "/Users/coder/Downloads/mysql-connector-java-8.0.22.jar")\
.master("local[*]")\
.appName("pivot and unpivot")\
.getOrCreate()
otherwise it will throw an error.

Related

Synapse Analytics Pyspark: TypeError: list object is not callable

I need to create a new dataframe in Synapse Analytics using column names from another dataframe. The new dataframe will have just one column (column header:col_name and the columns names from the other dataframe are the cell values. Here's my code:
df1= df.columns
colName =[]
for e in df1:
list1 = [e]
colName.append(list1)
col=['col_name']
df2=spark.createDataFrame(colName,col)
display(df2)
The output table created look like below:
With the output dataframe, i can do the following count, display or withColumn command.
df2.count()
df2=df2.withColumn('index',lit(1))
But when i start doing the below filter command, i ended up with 'list' object not callable error message.
display(df2.filter(col('col_name')=='dob'))
I am just wondering if anyone know what I am missing and how I can solve this.At the end i'd like to add a conditional column based on the value in the col_name column.
The problem is that you have two objects called col.
You did this :
col=['col_name']
therefore, when you do this :
display(df2.filter(col('col_name')=='dob'))
you do not call pyspark.sql.functions.col anymore but ['col_name'], hence, TypeError: list object is not callable.
Simply replace here :
# display(df2.filter(col('col_name')=='dob'))
from pyspark.sql import functions as F
display(df2.filter(F.col('col_name')=='dob'))

Writing a scalable INSERT statement using cx_Oracle

I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.

HOW - to convert a python generator to pandas dataframe

Im very new to python and pandas dataframe and im struggling to wrap my head around how to convert a python generator to a pandas dataframe.
What i want to do is to fetch a large table into chunks with this function that yields a generator:
def fetch_data_into_chunks(cursor, arraysize=10**5):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
Then i want to append or concat the result to a pandas dataframe:
for data in fetch_data_into_chunks(cursor):
df.append(data)
But this doesnt works and give me the error message:
TypeError: cannot concatenate object of type "<class 'pyodbc.Row'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Thanks for the help!
Assuming you have a connection to a sql database, you can use Pandas's built-in read_sql method and specify a chunksize. This is in itself a generator, which you can iterate through to create a single dataframe.
In this example, sql is your sql query and conn is the connection to your database.
def fetch_data(sql, chunksize=10**5):
df = pd.DataFrame()
reader = pd.read_sql(sql,
conn,
chunksize=chunksize)
for chunk in reader:
df = pd.concat([df, chunk], ignore_index=True)
return df

Speed-up Pandas Dataframe insert into a Postgres DB using SQLAlchemy

I have a postgres table with about 100k rows. I extracted this dataset and applied some transformation resulting in a new pandas dataframe containing 100K rows. Now I want to load this dataframe as a new table in the database. I used to_sql to convert the dataframe to a postgres table using SQLAlchemy connection. However, this is very slow and takes several hours. How can I use SQLAlchemy to speed up dataframe insert into database table? I want to increase insert speed from several hours to few seconds? Can someone help me with this?
I have searched through other similar questions on Stackoverflow. Most of them converts data to a csv file and then use copy_from for sql. I am looking towards a solution using SQLAlchemy bulk insert statement with pandas dataframe.
Here is a small version of my code:
from sqlalchemy import *
url = 'postgresql://{}:{}#{}:{}/{}'
url = url.format(user, password, localhost, 5432, db)
con = sqlalchemy.create_engine(url, client_encoding='utf8')
# I have a dataframe named 'df' containing 100k rows. I use the following code to insert this dataframe into the database table.
df.to_sql(name='new_table', con=con, if_exists='replace')
Try below model if the pandas version is above 0.24
Alternative to_sql() method for DBs that support COPY FROM import csv from io import StringIO
def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
chunksize = 10 4 # it depends on your server configuration. for my case 104 ~10**5 is OK.
df.to_sql('tablename',con=con, if_exists='replace',method=psql_insert_copy ,chunksize= chunksize)
if you use above psql_insert_copy mode and your postgresql server is work normally, you should enjoy fly speed.
Here is my ETL speed. Average 280~300K tuple per batch(in seconds).

pandas HDFStore select rows with non-null values in the data column

In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')