Pyspark dataframe remove duplicate in AWS Glue Script - dataframe

I have a script in AWS Glue ETL Job, where it reads a S3 bucket with a lot of parquet files, do a sort by key1, key2 and a timestamp field. After that the script delete the duplicates and save a single parquet file in other S3 Bucket.
Look the data I have before the Job runs:
key1
key2
uploadTimestamp
0005541779
10
2021-12-29 14:54:08.753
0005541779
10
2021-12-29 15:06:05.968
The code that do the sort and eliminate duplicates:
#############################################################
tempDF = S3bucket_node1.toDF() #from Dynamic Frame to Data Frame
sortedDF = tempDF.orderBy(f.col("uploadTimestamp").desc(),"key1","key2").dropDuplicates(["key1","key2"]) #sort and remove duplicates
dynamicFrame = DynamicFrame.fromDF(sortedDF, glueContext, 'salesOrder') #back to Dynamic Frame
#############################################################
Get a look on this image after an order by:
My problem:
In the output file, some data got the last timestamp, some data got the first.. I can't understand why it doesnt work for all data.
Thanks.

It worked with the following code:
tempDF = S3bucket_node1.toDF()
w = Window.partitionBy("key1","key2").orderBy(f.desc("uploadTimestamp"))
df = tempDF.withColumn("rn", f.row_number().over(w)).filter("rn = 1").drop("rn")
dynamicFrame = DynamicFrame.fromDF(df, glueContext, 'dynamicFrame')
The tip to solve that, was found here:
pyspark dataframe drop duplicate values with older time stamp

Related

Is there any way other to write to S3 bucket using Pyspark without creating multiple partitioning while performing an aggregation to DF?

I have a data frame DF1 with n columns in those n columns we two columns called (storeid,Date). I am trying to take a subset of that DF1 with those two columns while group by storeid and with max(Date) as below:
df2=(df1.select(['storeid','Date']).withColumn("Date", col("Date").cast("timestamp"))\ .groupBy("storeid").agg(max("Date")))
After that i am renaming the column Max(Date) since write to S3 will not allow brackets:
df2=df2.withColumnRenamed('max(Date)', 'max_date')
Now i am trying write this to S3 bucket and it is trying to creating parquet files on each row, like for example if the df2 has 10 rows it is creating 10 parquet files instead of 1 parquet file.
df2.write.mode("overwrite").parquet('s3://path')
Can any one please help me in this, i need the df2 to write only single parquet file instead of many with all the data in it as a table format.
If the DataFrame is not too big, you can try to repartition it to a single partition before writing:
df2.repartition(1).write.mode("overwrite").parquet('s3://path')

How do I proper upload a huge pandas dataframe into a postgres sql database?

I am trying to upload a very big pandas df (13230 rows x 2502 cols) into a postgres database. I am uploading the dataframe with the function df_to_sql but it gives me this error:
tables can have at most 1600 columns
Therefore I split the df into two dfs (13230 rows x 1251 cols each df) with the idea to merge them later. But when I try to upload the first df into the db I receive the following error:
row is too big: size 8168, maximum size 8160
How can I manage this? I would love to upload the df as a whole (13230 rows x 2502 cols) without merging it later.

Speed-up Pandas Dataframe insert into a Postgres DB using SQLAlchemy

I have a postgres table with about 100k rows. I extracted this dataset and applied some transformation resulting in a new pandas dataframe containing 100K rows. Now I want to load this dataframe as a new table in the database. I used to_sql to convert the dataframe to a postgres table using SQLAlchemy connection. However, this is very slow and takes several hours. How can I use SQLAlchemy to speed up dataframe insert into database table? I want to increase insert speed from several hours to few seconds? Can someone help me with this?
I have searched through other similar questions on Stackoverflow. Most of them converts data to a csv file and then use copy_from for sql. I am looking towards a solution using SQLAlchemy bulk insert statement with pandas dataframe.
Here is a small version of my code:
from sqlalchemy import *
url = 'postgresql://{}:{}#{}:{}/{}'
url = url.format(user, password, localhost, 5432, db)
con = sqlalchemy.create_engine(url, client_encoding='utf8')
# I have a dataframe named 'df' containing 100k rows. I use the following code to insert this dataframe into the database table.
df.to_sql(name='new_table', con=con, if_exists='replace')
Try below model if the pandas version is above 0.24
Alternative to_sql() method for DBs that support COPY FROM import csv from io import StringIO
def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
chunksize = 10 4 # it depends on your server configuration. for my case 104 ~10**5 is OK.
df.to_sql('tablename',con=con, if_exists='replace',method=psql_insert_copy ,chunksize= chunksize)
if you use above psql_insert_copy mode and your postgresql server is work normally, you should enjoy fly speed.
Here is my ETL speed. Average 280~300K tuple per batch(in seconds).

Key error: '3' When extracting data from Pandas DataFrame

My code plan is as follows:
1) find csv files in folder using glob and create a list of files
2) covert each csv file into dataframe
3) extract data from a column location and convert into a separate dataframe
4) append the new data into a separate summary csv file
code is as follows:
Result = []
def result(filepath):
files = glob.glob(filepath)
print files
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['3'].ix[0:4]
new_dfb[colname] = selected_data
Result.append(new_dfb)
folder = r"C:/Users/Joey/Desktop/tcd/summary.csv"
new_dfb.to_csv(folder)
result("C:/Users/Joey/Desktop/tcd/*.csv")
print Result
The code error is shown below. The issue seems to be with line 36 .. which corresponds to the selected_data = df['3'].ix[0:4].
I show one of my csv files below:
I'm not sure what the problem is with the dataframe constructor?
You're csv snippet is a bit unclear. But as suggested in the comments, read_csv (from_csv in this case) automatically taken the first row as a list of headers. The behaviour you appear to want is the columns to be labelled as 1,2,3 etc. To achieve this you need to have
[pd.DataFrame.from_csv(f, index_col=None,header=None) for f in files]

Storing .csv in HDF5 pandas

I was experimenting with HDF and it seems pretty great because my data is not normalized and it contains a lot of text. I love being able to query when I read data into pandas.
loc2 = r'C:\\Users\Documents\\'
(my dataframe with data is called 'export')
hdf = HDFStore(loc2+'consolidated.h5')
hdf.put('raw', export, format= 'table', complib= 'blosc', complevel=9, data_columns = True, append = True)
21 columns and about 12 million rows so far and I will add about 1 million rows per month.
1 Date column [I convert this to datetime64]
2 Datetime columns (one of them for each row and the other one is null about 70% of the time) [I convert this to datetime64]
9 text columns [I convert these to categorical which saves a ton of room]
1 float column
8 integer columns, 3 of these can reach a max of maybe a couple of hundred and the other 5 can only be 1 or 0 values
I made a nice small h5 table and it was perfect until I tried to append more data to it (literally just one day of data since I am receiving daily raw .csv files). I received errors which showed that the dtypes were not matching up for each column although I used the same exact ipython notebook.
Is my hdf.put code correct? If I have append = True does that mean it will create the file if it does not exist, but append the data if it does exist? I will be appending to this file everyday basically.
For columns which only contain 1 or 0, should I specify a dtype like int8 or int16 - will this save space or should I keep it at int64? It looks like some of my columns are randomly float64 (although no decimals) and int64. I guess I need to specify the dtype for each column individually. Any tips?
I have no idea what blosc compression is. Is that the most efficient one to use? Any recommendations here? This file is mainly used to quickly read data into a dataframe to join to other .csv files which Tableau is connected to