Spark Streaming: foreachRDD insert into mongoDB using python? - pymongo

please help me in writing the insertmongo function for a spark streaming job.
it is a word count program
import pymongo_spark
........
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word,1)).reduceByKey(lambda a, b: a+b)
counts = counts.map(lambda x:"word":x[0],"count":int(x[1]),"ts":str(uuid1())})
def insertMongo(time,rdd):
rdd.saveToMongoDB('mongodb://localhost:27017/telematics.test')
counts.foreachRDD(insertMongo)

Related

Writing a scalable INSERT statement using cx_Oracle

I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.

Databricks Python Optimization

I need your help please, i have a simple code in python which lists all the fields in the tables in all the databases that are on databricks, there are a little nearly 90 tables and I would like to save the result in a txt or csv file. here is the code used it works but it takes 8 hours to finish it is too long how can I optimize or have another way for it to be faster?
# table containing all name of database in databricks
#df_tables = spark.sql("SELECT * FROM bd_xyh_name")
#DynoSQL is a string table for result in txt
def discribe():
try:
for i in df_tables.collect():
showTables="""show tables in {};""".format(i.nombd)
df1=spark.sql(showTables)
for j in df1.collect():
describeTable="""describe table {0}.{1};""".format(j.database,j.tableName)
df2=spark.sql(describeTable)
#df3=df2.collect()
df3 = df2.rdd.toLocalIterator()
for k in df3:
#df=df2.select(df2.col_name;k.data_type)
#spark.sql("insert into NewTable VALUES ("+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+");")
spark.sql("insert into DynoSQL select \""+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+"\"")
# request="insert into NewTable VALUES ({};{};{};{});""".format(j.database,j.tableName,k.col_name,k.data_type)
#spark.sql(request)
except:
raise
You can try with below logic .
Logic :
Get the available databases within workspace and make list
Iterate the databases name and get the available tables within databases and write into temp table. (Temp table you should create as managed one)
Advantage : Based on this logic , at a time only one databases only will be processed and if fails during the process , we can start from failing databases instead of whole workspace level.
Code Snippet :
from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql.functions import col, concat, lit
df = spark.sql("show databases")
list = [x["databaseName"] for x in df.collect()]
for x in list:
df = spark.sql(f"use {x}")
df1 = spark.sql("show tables")
df_loc.write.insertInto("writeintotable")
display(df1)
Screenshot :

psycopg2.errors.InvalidTextRepresentation while using COPY in postgresql

I am using a custom callable to pandas.to_sql(). The below snippet is from pandas documentation for using it
import csv
from io import StringIO
def psql_insert_copy(table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
but while using this copy functionality, I am getting the error
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for integer: "3.0"
This is not a problem with input as this table schemas and values where working initially, when I have used to_sql() function without using the custom callable 'psql_insert_copy()'. I am using sqlalchemy engine for getting the connection cursor
I would recommend using string fields in the table for such actions, or writing the entire (sql) script manually, indicating the types of table fields

Fastest way to iterate Pyarrow Table

I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory). Pandas has iterrows()/iterrtuples() methods. Is there any fast way to iterate Pyarrow Table except for-loop and index addressing?
This code worked for me:
for batch in table.to_batches():
d = batch.to_pydict()
for c1, c2, c3 in zip(d['c1'], d['c2'], d['c3']):
# Do something with the row of c1, c2, c3
If you have a large parquet data set split into mupltiple files, this seems reasonably fast and memory-efficient.
import argparse
import pyarrow.parquet as pq
from glob import glob
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('parquet_dir')
return parser.parse_args()
def iter_parquet(dirpath):
for fpath in glob(f'{dirpath}/*.parquet'):
tbl = pq.ParquetFile(fpath)
for group_i in range(tbl.num_row_groups):
row_group = tbl.read_row_group(group_i)
for batch in row_group.to_batches():
for row in zip(*batch.columns):
yield row
if __name__ == '__main__':
args = parse_args()
total_count = 0
for row in iter_parquet(args.parquet_dir):
total_count += 1
print(total_count)
The software is not optimized at all for this use case at the moment. I would recommend using Cython or C++ or interact with the data row by row. If you have further questions, please reach out on the developer mailing list dev#arrow.apache.org

Speed-up Pandas Dataframe insert into a Postgres DB using SQLAlchemy

I have a postgres table with about 100k rows. I extracted this dataset and applied some transformation resulting in a new pandas dataframe containing 100K rows. Now I want to load this dataframe as a new table in the database. I used to_sql to convert the dataframe to a postgres table using SQLAlchemy connection. However, this is very slow and takes several hours. How can I use SQLAlchemy to speed up dataframe insert into database table? I want to increase insert speed from several hours to few seconds? Can someone help me with this?
I have searched through other similar questions on Stackoverflow. Most of them converts data to a csv file and then use copy_from for sql. I am looking towards a solution using SQLAlchemy bulk insert statement with pandas dataframe.
Here is a small version of my code:
from sqlalchemy import *
url = 'postgresql://{}:{}#{}:{}/{}'
url = url.format(user, password, localhost, 5432, db)
con = sqlalchemy.create_engine(url, client_encoding='utf8')
# I have a dataframe named 'df' containing 100k rows. I use the following code to insert this dataframe into the database table.
df.to_sql(name='new_table', con=con, if_exists='replace')
Try below model if the pandas version is above 0.24
Alternative to_sql() method for DBs that support COPY FROM import csv from io import StringIO
def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
chunksize = 10 4 # it depends on your server configuration. for my case 104 ~10**5 is OK.
df.to_sql('tablename',con=con, if_exists='replace',method=psql_insert_copy ,chunksize= chunksize)
if you use above psql_insert_copy mode and your postgresql server is work normally, you should enjoy fly speed.
Here is my ETL speed. Average 280~300K tuple per batch(in seconds).