Insert from PySpark into RedShift taking too long - amazon-s3

I'm trying to insert some data into a redshift, but it's taking too long to execute.
I'm running locally with 16Gb Ram and 12 cores.
The snippet looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-redshift_2.11:4.0.1")
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[8]").appName("Dim_Customer").getOrCreate()
my_dataframe.write.format('jdbc').options(
url=f"jdbc:redshift://{HOST_REDSHIFT}:{PORT_REDSHIFT}/{DATABASE_REDSHIFT}",
driver="com.amazon.redshift.jdbc42.Driver",
dbtable="dim_customer_pyspark",
user=USER_REDSHIFT,
password=PASSWORD_REDSHIFT,
batchsize=1000).mode("append").save()
My dataframe has only 2 partitions.
I also have a S3 cluster, and tried adding the tempdir option:
tempdir="https://my_bucket.s3.my_region.amazonaws.com/"
But I'm not sure if it's working. I'm not getting any erros when using this option.
The problem is that it's taking up to 10 minutes to insert 2000 rows.
Could some one help me here?
Thank you in advance!

Related

Creating table in big query by uploading csv

I am new to big query . I am trying o create a table by uploading csv. Its size if 290 kb. Even if I fill all the required information the thee dots beside create table keeps moving(like loading ) but even after waiting for a long time, the table doesn't get created.
You can upload the CSV in a bucket and then reference it from BigQuery creation panel.
Here is the official guide from Google, with the screenshot. Should be rather simple. ( https://cloud.google.com/bigquery/docs/schema-detect)
On step 4 of the image below, select the path to the file and CSV format.
On step 5 you can either keep everything like it is or select "External Table" (which I recommend), in order to delete the table in case of error and not lose the CSV.
BigQuery should automatically handle the rest. Please, share more detailed information in case of error.
There are couple of ways through which you can upload CSV file to Bigquery as given below :-
Write an Apache beam code (Python/Java) and get data loaded to Bigquery. Sample code for reading and writing you can combine it.
Write a python script which is responsible for loading data to Bigquery.
import pandas as pd
from pandas.io import gbq
import os
import numpy as np
dept_dt=pd.read_csv('dept_data')
#print(dept_dt)
# Replace with Project Id
project = 'xxxxx-aaaaa-de'
#Replace with service account path
path_service_account = 'xxxxx-aaaa-jhhhhh9874.json'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_service_account
dept_dt.to_gbq(destination_table='test1.Emp_data1',project_id='xxxxxx-ccvd-err',if_exists='fail')

Python bulk insert to Teradata? Default is too slow

I was asked for a Python script to read a file, load into a dataframe, and write to a table in Teradata. It works but it takes 3-4 minutes to write a table that's 300 rows.
Now that's a tiny job for a data warehouse, and our Teradata works fine when handling massive data sets, but I find it's such a drag waiting 3 minutes for this script to run. I don't believe it's a system issue.
Is there a better way to load small to medium size tables into Teradata? If we did this in SQL Server it would take seconds but some other data we need to access is already there in Teradata.
The main code looks like this:
import pandas as pd
from sqlalchemy import create_engine
conn_string = 'teradata://' + user + ':' + passw + '#' + host + '/?authentication=LDAP'
eng = create_engine(conn_string)
# insert into table
df.to_sql(table_name, con = eng, if_exists = 'replace',
schema='TEMP', index=False, dtype=df_datatypes)
The above works but when I added the method='multi' parameter to to_sql, I got the error:
CompileError: The 'teradata' dialect with current database version settings does not
support in-place multirow inserts.
Then I added chunksize=100 and I got this error:
DatabaseError: [HY000] [Teradata][ODBC Teradata Driver][Teradata Database](-9719)QVCI feature is disabled.
Is there a way to speed this up? Should I do something outside of Python altogether?
Pedro, if you are trying to load one row at a time through ODBC, you are approaching the problem like trying to fill a swimming pool through a straw.
Teradata is and has always been a parallel platform. You need to use bulk loaders—or, if you are a Python fan, use teradataml (code below).
By the way, I loaded 728,000 records in 12 seconds on my laptop. When I built a Teradata system in Azure, I was able using the same code below to load the same 728,000 records in 2 seconds:
# -*- coding: utf-8 -*-
import pandas as pd
from teradataml import create_context, remove_context
con = create_context(host = <host name or IP address>, username=<username>, password=<password>, database='<#what ever default database you want>')
colnames = ['storeid','custnumber','invoicenumber','amount']
dataDF = pd.read_csv('datafile.csv', delimiter=',', names=colnames, header=None)
copy_to_sql(dataDF, table_name = 'store', if_exists='replace') # this will load the datafrom from your client machine into the Teradata system.
remove_context()
pedro. I can empathize with you. I work at an organization that has massive data warehouses in Teradata and it's next to impossible to get it to work well. I developed a Python (3.9) script that uses the teradatasql package to interface Python/SQLAlchemy and it works reasonably well.
This script reflects a PostgreSQL database and, table-by-table, pulls the data from PostgreSQL, saves it as an object in Python and then uses the reflected PostgreSQL tables to create comparable versions in Teradata and finally push the retrieved data to the tables.
I've quality checked this and here's what I get:
table reflection time in PostgreSQL: 0.95 seconds on average
longest PostgreSQL data retrieval: 4.22 seconds
shortest time to create mirrored tables in Teradata: 12.10 seconds
shortest time to load data to Teradata: 8.39 seconds
longest time to load data to Teradata: approx. 7,212 seconds
The point here is Teradata is a miserable platform that's ill equipped for business critical tasks. I've demonstrated to senior leadership the performance of Teradata against 5 other ODBC's and Teradata was far and away the least capable.
Maybe things are better on version 17; however, we're on version 16.20.53.30 which also doesn't support in-place multirow inserts. I'd recommend your organization stop using Teradata and move to something more capable, like Snowflake. Snowflake has EXCELLENT Python support and a great CLI.

Trying to fetch data from DB2 to Hbase using Sqoop is very slow

Thanks in advance.
I have been trying to import the data from DB2 to HBase table using SQOOP which is taking very very long time to even initiate the map and reduce . I can see only Map 0 and Reduce 0 all the times .
I can put the same command in DB2 and the results are quite faster than I expected. But when I import the same to HBASE . Taking very long time(10 hours) . Created a sample data in DB2(150 records) and tried to import to HBASE still taking the same amount of time .
sqoop import --connect jdbc:db2://{hostname}:50001/databasename --username user --password pass --hbase-create-table --hbase-table new_tbl --column-family abc --hbase-row-key=same --query "select a,b,c,d,e concat(a,e) from table_name where \$CONDITIONS AND a>='2018-08-01 00:00:01' and b<='2018-08-01 00:00:02'" -m 1
Tried adjusted all the configurations
yarn.nodemanager.resource.memory-mb=116800
yarn.scheduler.minimum-allocation-mb=4096
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=8192
mapreduce.map.java.opts=-Xmx3072m
mapreduce.reduce.java.opts=-Xmx6144m
yarn.nodemanager.vmem-pmem-ratio=2.1
In Sqoop Side I have tried to tweak the query as well little configurations as well
-m 4 create some inconsistency in records
-removed the filter(timestamps(a,b)) still taking longtime (10 hours)
HBASE performance test results are pretty good .
HBase Performance Evaluation
Elapsed time in milliseconds=705914
Row count=1048550
File Input Format Counters
Bytes Read=778810
File Output Format Counters
Bytes Written=618
real 1m29.968s
user 0m10.523s
sys 0m1.140s
It is hard to suggest unless you show the sample data and data type. The extra mapper will work correctly and efficiently only when you have a fair distribution of records among mappers. If you have a primary key available in the table, you can give it as split column and mappers will distribute the workload equally and start fetching slices in balanced mode. While running you can also see the split key distribution and record count from the log itself.
If your cluster is not having enough memory for resources, it may take longer time and sometimes it is in submit mode for a long time as YARN cannot allocate memory to run it.
Instead of trying to HBase, you can first try doing it with HDFS as a storage location and see the performance and also check the Job detail to understand the MapReduce behavior.

Pyspark dataframe join taking a very long time

I have 2 dataframes in pyspark that I loaded from a hive database using 2 sparksql queries.
When I try to join the 2 dataframes using the df1.join(df2,df1.id_1=df2.id_2), it takes a long time.
Does Spark re execute the sqls for df1 and df2 when I call the JOIN?
The underlying database is HIVE
Pyspark will be slower compared to using Scala as data serialization occurs between Python process and JVM, and work is done in Python.

Saving/Exporting the results of a Spark SQL Zeppelin query

We're using apache zeppelin to analyse our datasets. We have some queries that we would like to run that have a large number of results that come back from them and would like to run the query in zeppelin but save the results (display is limited to 1000). Is there an easy way to get zeppelin save all the results of a query to s3 bucket maybe?
I managed to whip up a notebook that effectively does what i want using the scala interpreter.
z.load("com.databricks:spark-csv_2.10:1.4.0")
val df= sqlContext.sql("""
select * from table
""")
df.repartition(1).write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://amazon.bucket.com/csv_output/")
Its worth mentioning that the z.load function seemed to work for me one day, but then i tried it again and for some reason i had to declare it in its own paragraph with the %dep interpreter, then the remaining code in the standard scala interpreter