Serialized Results too large PySpark Left Join

Serialized Results too large PySpark Left Join - optimization

I have a problem with a last "left" join of a transformation.
The result continually returns me the following error: There is too much data being sent to the driver. 4.0 GiB of serialized data from 10700 tasks exceeds the limit of 4.0 GiB. and not so how to fix it.
The final desired dataset contains 4.5 million rows and shouldn't be complicated to obtain. I have already disabled the join broadcast but to no avail.
Edit for deepening:
The join is located at the last line of the code (which involves more or less heavy operations) but this is precisely where it stops. Without this join, in fact, I can build both datasets (df and df2). Then when I execute the join it returns me the error.
df --> ~2.500.000 rows, 3 columns, 24.5 MB size. (Result of F.explode of DATE for each ID
df2 --> ~700.000 rows, 10 columns, 29.5 MB size. (Result of union of some datasets)
df_final = df.join(df2, ['ID', 'DATE'], 'left')
Please help me! Thank u!

Related

How can I detect similarity of names in the same columns

Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

Summing time series with slight variance in timestamps

I imagine that I have several time series like following, from different "sources":
time events
0 1000 1080000
1 2003 2122386
2 3007 3043985
3 4007 3872544
4 5007 4853763
Here, an monotonic increasing count events is sampled every 1000 ms. The sampling is not exact so most of the timestamps vary from their ideal values by a few ms - e.g., the second point is at 2003 instead of 2000.
I want to sum several of these time series: they will all be sampled at ~1000 ms but may not agree to the exact millsecond. E.g another time series could be:
time events
0 1000 1070000
1 2002 2122486
2 3006 3063985
3 4007 3872544
4 5009 4853763
I'd like something reasonable in terms of the final result. For example the same number of rows as each of the input dataframes, with a timestamp column the same as the first, or average of the inputs times. As long as the inputs are smooth, the outputs should be too.

I'd suggest DataFrame.reindex() with nearest method. Example:
def combine_datasources(reference_df, extra_dfs, tolerance_ms=100):
reindexed_df_list = [df.reindex(reference_df.index, method='nearest', tolerance=tolerance_ms) for df in extra_dfs]
combined = pd.concat([reference_df, *reindexed_df_list])
return combined.groupby(combined.index).sum()
combine_datasources(df_a, [df_b])
This code changes the index on the dataframes in the extra_dfs list to match the index for the reference dataframe. Then, it concatenates all of the dataframes together. It uses groupby to do the sum, which requires that the indexes match exactly to work. The timestamps will be the same as the one on the reference dataframe.
Note that if you have data from a time period not covered by the reference dataframe, that data will be dropped.
Here's the output for the dataset in your question:
events
time
1000 2150000
2003 4244872
3007 6107970
4007 7745088
5007 9707526

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.

Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂

Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Pandas to_sql() performance related to number of columns

I noticed some odd behaviour of a script of mine which uses pandas' to_sql function to insert large numbers of rows into one of my mssql server.
The performance dramatically decreases when the number of columns exceeds 10
For example:
34484 rows x 10 columns => ~10k records per second
34484 rows x 12 columns => ~500 records per second
I use the fast_executemany Flag when establishing the conneciton, anyone got any idea!?
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s?charset=utf8" % params, fast_executemany=True)
sqlalchemy_connection = engine.connect()
....
df.to_sql(name='TEST', con=sqlalchemy_connection , if_exists='append', index=False)

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?

Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:

Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Serialized Results too large PySpark Left Join - optimization

Related

How can I detect similarity of names in the same columns

Summing time series with slight variance in timestamps

Pandas run function only on subset of whole Dataframe

Pandas to_sql() performance related to number of columns

groupby 2 columns and count into separate columns based on one columns cases

Categories

Resources