I had used pd.read_csv(my_csv, na_values=['N/A', '--']) such that string 'N/A' and '--' get interpreted as NULL, NaN, etc.
But if I used BigQuery client, I couldn't figure how to achieve the same feat. I read the quick help from .to_dataframe() which "Return a pandas DataFrame from a QueryJob" but it didn't seem to take in any extra argument.
Is this possible? Or I have to do my own custom post-processing to track missing values?
you can achieve same from below:
dataFrame.applymap(lambda x: np.nan if x in ['N/A', '--'] else x)
If you are running some query before getting its results in to the dataframe, you can easily do it on the BigQuery side without worrying about filtering your results on the client side.
Something like IF(column in ('N\A', '--'), null, column) as column should do the job for you.
Related
I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is the code that I was using in Pandas:
new = old.groupby(['A', 'B']) \
.apply(lambda x: pd.Series({
'v1': x['v1'].sum(),
'v2': x['v2'].sum(),
'v3': (x['v1'].sum() / x['v2'].sum()),
'v4': x['v4'].min()
})
)
I believe that it is working well and the resulting dataframe appears to be correct value-wise.
I just have a few questions:
Does this error mean that my method will be deprecated in the future?
/databricks/spark/python/pyspark/sql/pandas/group_ops.py:76: UserWarning: It is preferred to use 'applyInPandas' over this API. This API will be deprecated in the future releases. See SPARK-28264 for more details.
How can I rename the group-by columns to 'A' and 'B' instead of "__groupkey_0__ __groupkey_1__"?
As you noticed, I had to call pd.Series -- is there a way to do this in Koalas? Calling ks.Series gives me the following error that I am unsure how to implement:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Thanks for any help that you can provide!
I'm not sure about the error. I am using koalas==1.2.0 and pandas==1.0.5 and I don't get the error so I wouldn't worry about it
The groupby columns are already called A and B when I run the code. This again may have been a bug which has since been patched.
For this you have 3 options:
Keep utilising pd.Series. As long as your original Dataframe is a koalas Dataframe, your output will also be a koalas Dataframe (with the pd.Series automatically converted to ks.Series)
Keep the function and the data exactly the same and just convert the final dataframe to koalas using the from_pandas function
Do the whole thing in koalas. This is slightly more tricky because you are computing an aggregate column based on two GroupBy columns and koalas doesn't support lambda functions as a valid aggregation. One way we can get around this is by computing the other aggregations together and adding the multi-column aggregation afterwards:
import databricks.koalas as ks
ks.set_option('compute.ops_on_diff_frames', True)
# Dummy data
old = ks.DataFrame({"A":[1,2,3,1,2,3], "B":[1,2,3,3,2,3], "v1":[10,20,30,40,50,60], "v2":[4,5,6,7,8,9], "v4":[0,0,1,1,2,2]})
new = old.groupby(['A', 'B']).agg({'v1':'sum', 'v2':'sum', 'v4': 'min'})
new['v3'] = old.groupby(['A', 'B']).apply(lambda x: x['v1'].sum() / x['v2'].sum())
I was able to solve a problem with pandas thanks to the answer provided in Grouping by with Where conditions in Pandas.
I was first trying to make use of the .where() function like the following:
df['X'] = df['Col1'].where(['Col1'] == 'Y').groupby('Z')['S'].transform('max').astype(int)
but got this error: ValueError: Array conditional must be same shape as self
By writing it like
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
it worked.
I'm trying to understand what is the difference as I thought .where() would do the trick.
You have a typo in your first statement. .where(['Col1'] == 'Y') is comparing a single element list with 'Y'. I think you meant to use .where(df['Col1'] == 'Y', however, this will not work either because you filtering dataframe columns to just 'Col1' in front of the where method. This is what you really wanted to do, in my opinion.
df['X'] = df.where(df['Col1'] == 'Y').groupby('Z')['S'].transform('max')
Which is equalivant to using
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
Also, not the astype(int) is not going to do any good on either of these statements because one side effect in pandas is that any column with a 'int' dtype that contains a NaN will automatically change that column to a 'float'.
I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]
I am new to Pyspark and am a bit confused on how to think of the problem.
I have a large dataframe and I would like to filter down every subset of that dataframe based on two columns and run it through the same algorithm.
Here is an example of how I run it (extremely inefficiently) now:
for letter in ['a', 'b', 'c']:
for number in [1, 2, 3]
filtered_DF_1, filtered_DF_2 = filter_func(DF_1, DF_2, letter, number)
process_function(filtered_DF_1, filtered_DF_2)
Basic filter function:
def filter_func(DF_1, DF_2, letter, number):
DF_1 = DF_1.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
DF_2 = DF_2.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
return DF_1, DF_2
Since this is Pyspark, I would like to parallelize it, since each iteration of the function can run independently.
Do I need to do some sort of mapping to get all my data subsets?
And then do I need to do anything to the process_function to make it available to all nodes as well to run and return an answer?
What is the best way to do this?
EDIT:
The process_function takes the filtered dataset and runs it through about 7 different functions that are already written in 300 lines of pyspark --> the end goal is to return a list of timestamps that are overbooked based on a bunch of complicated logic.
I think my plan is to build a dictionary of letter --> [number], then explode that list to get every permutation and create a dataset from that. Then map through that, and hopefully am able to create a udf for my process_function.
I don't think you need to worry a lot about parallelizing or the execution plan because the spark catalyst does it in the background for you. Also better to avoid UDF, you can do it mostly with inbulit function.
Are you doing a transformation function or an aggregate function inside you process_func?
Please provide any test data and suitable example of expected output. That would help in better answering..
I have a dataframe which I am doing some work on
d={'x':[2,8,4,-5,4,5,-3,5],'y':[-.12,.35,.3,.15,.4,-.5,.6,.57]}
df=pd.DataFrame(d)
df['x_even']=df['x']%2==0
subdf, get all rows where x is negative and then square x and then multiple 100 to y
subdf=df[df.x<0]
subdf['x']=subdf.x**2
subdf['y']=subdf.y*100
subdf's work is completed. I am not sure how I can incorporate these changes to the master dataframe (df).
It looks like your current code should give you a SettingWithCopyWarning warning.
To avoid this you could do the following:
df.loc[df.x<0, 'y'] = df.loc[df.x<0, 'y']*100
df.loc[df.x<0, 'x'] = df.loc[df.x<0, 'x']**2
Which will change your df, without raising a warning and there is no need to merge anything back.
pd.merge(subdf,df,how='outer')
This does what I was asking for. Thanks for the tip Primer