Pandas join is slow - pandas

edit at Oct 16 2017: I think I found the problem, it seems to be a bug in pandas core. It can't merge/join anything over 145k rows. 144k rows it can do without an issue. Pandas version 0.20.3, running on Fedora 26.
----Original post----
I have a medium size amount of data to process (about 200k rows with about 40 columns). I've optimised a lot of the code, but the only trouble I have now is joining the columns.
I receive the data in an unfortunate structure and need to extract the data in a certain way, then put it all into a dataframe.
Basically I extract 2 arrays at a time (each 200k rows long). One array is the timestamp, the other array is the values.
Here I create a dataframe, and use the timestamp as the index.
When I extract the second block of data, I do the same and create a new dataframe using the new values + timestamp.
I need to join the two dataframes on the index. The timestamps can be slightly different, so I use a join method using the 'outer' method, to keep the new timestamps. Basically I follow the documentation below.
result = left.join(right, how='outer')
https://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index
This however is way to slow. I left it for about 15 mins and it still hadn't finished processing, so I killed the process.
Can anyone help? Any hints/tips?
edit:
It's a work thing, so I can't give out the data sorry. But it's just two long dataframes, each with a timestamp as the index, and a single column for the values.
The code is just as described above.
data_df.join(variable_df, how='outer')

I forgot to answer this. It's not really a bug in pandas.
The timestamp was a nanosecond timestamp, and joining them on the index like this was causing a massive slow down. Basically it was better to join on a column - made it all much faster.

Related

adding to multiindexed pandas dataframe by specific indices is incredibly slow

I have a rather large data frame (~400M rows). For each row, I calculate which elements in another data frame (~200M rows) I need to increment by 1. This second data frame is indexed by three parameters. So the line I use to do this looks like:
total_counts[i].loc[zipcode,j,g_durs[j][0]:g_durs[j][-1]+1] += 1
The above line works, but it appears to be a horrendous bottleneck and takes almost 1 second to run. Unfortunately, I cannot wait 11 years for this to finish.
Any insights on how I might speed up this operation? Do I need to split up the 200M row data frame into an array of smaller objects? I don't understand exactly what is making this so slow.
Thanks in advance.

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

can i compress a pandas dataframe into one row?

I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)

Why is renaming columns in pandas so slow?

Given a large data frame (in my case 250M rows and 30 cols), why is it so slow to just change then name of a column?
I am using df.rename(columns={'oldName':'newName'},inplace=True) so this should not make any copies of the data, yet it is taking over 30 seconds, while I would have expected this to be in the order of milliseconds (as it's just replacing one string by another).
I know, that' a huge table, more than most people have RAM in their machine (hence I'm not going to add example code either), but still this shouldn't take any significant amount of time as it's not actually touching any of the data. Why does this take so long, i.e. why is renaming a column doing effort proportional to the number of rows of my dataframe?
I don't think inplace=True doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.
You can just override the columns with:
df.columns = df.columns.to_series().replace({'a':'b'})

Loading a spark DF from S3, multiple files. Which of these approaches is best?

I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in
What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.
You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')
Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.