I need to apply pd.to_numeric to a long and wide (1000+ columns) dataframe where invalid values are coerced as NaN.
Currently I'm using
df.apply(pd.to_numeric, errors="coerce")
which can take substantial amount of time due to the number of columns.
df.astype()
does not work either as it does not take coerce option.
Any comment is appreciated.
As it has already been commented on, the amount of data you're working with makes it pretty hard for pandas transformations to not be extremely slow.
I recommend you set up a PySpark session inside your local machine, transform the DataFrame column types and proceed to convert to Pandas at the end if you really need it.
In PySpark, you can convert all your dataframe's column to float by doing this:
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Afterwards you can just save your DataFrame back to where you want it to be (or maybe stick to PySpark and join the group!):
df.toPandas().to_csv('my_file.csv')
Related
I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one.
The standard solution are of the following syntax:
df.loc[row_mask, cols] = assigned_val
Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.
Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.
Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.
I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.
You could give np.where a try.
Here is an simple example of np.where
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])
np.where() do nothing if condition fails
has another example.
In your case, it might be
df[col] = np.where(df[col]==row_condition, assigned_val, df[col])
I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy%20was%20faster%20than%20Pandas,exception%20of%20simple%20arithmetic%20operations.
I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.
I want to use dask to make a groupby.
Using pandas I would have to write this to make a basic a basic group by and filter.
My dataset contains 2 indexes : ORDER_ID and PROD_ID. Each ORDER defined by ORDER_ID, we can have 1 or more product defined by its PROD_ID.
My objective is to remove ORDER_ID that contain 1 product.
Using pandas I can do it this way:
df = df.groupby('ORDER_ID').filter(lambda x: len(x) >= 2)
I didn't find any suitable solution with dask.
https://docs.dask.org/en/latest/dataframe-best-practices.html discusses the issues with pandas and dask.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.
So this task is not working in pandas as it takes too much memory?
I'm looking to perform a transform-like operation after a groupby using dask dataframes. Looking through the docs, it doesn't seem like dask currently gives this option - but does anyone have any work around?
In practice: I'm looking to subtract the means of column B (after doing a groupby on A) from the raw values of B. In pure pandas, it looks like this:
def demean_and_log(x):
x_log = np.log(x)
x_log_mean = x_log.mean()
return x_log - x_log_mean
log_demean_col = X.groupby(['A'])['B'].transform(demean_and_log)
However, this is very slow - as I'm dealing with very large dataframes, and since I'm using a custom transformation function in Python, pandas doesn't release the GIL.
I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.