Error when filtering pandas dataframe by column value - pandas

I am having a problem with filtering a pandas dataframe. I am trying to filter a dataframe based on column values being equal to a specific list but I am getting a length error.
I tried every possible way of filtering a dataframe but got nowhere. Any help would be appreciated, thanks in advance.
Here is my code :
for ind in df_hourly.index:
timeslot = df_hourly['date_parsed'][ind][0:4] # List value to filter
filtered_df = df.loc[df['timeslot'] == timeslot]
Error : ValueError: ('Lengths must match to compare', (5696,), (4,))
Above Image : df , Below Image : df_hourly
In the above image, the dataframe I want to filter is shown. Specifically, I want to filter according to the "timeslot" column.
And the below image shows the the dataframe which includes the value I want to filter by. I specifically want to filter by "date_parsed" column. In the first line of my code, I iterate through every row in this dataframe and assign the first 4 elements of the list value in df_hourly["date_parsed"] to a variable and later in the code, I try to filter the above dataframe by that variable.

When comparing columns using ==, pandas try to compare value by value - aka does the first item equals to first item, second item to the second and so on. This is why you receive this error - pandas expects to have two columns of the same shape.
If you want to compare if value is inside a list, you can use the .isin (documentation):
df.loc[df['timeslot'].isin(timeslot)]
Depends on what timeslot is exactly, you might to take timeslot.values or something like that (hard to understand exactly without giving an example for your dataframe)

Related

Nested when in Pyspark

I need to apply lots of when conditions which take input from a list by indexes. I wanted to ask if there's a way to write the code efficiently which produces the same results without affecting the runtime efficiency.
Below is the code I am using
df=df.withColumn('date_match_label', F.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[0]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.otherwise('No Match'))
Here date_cols contains six column names. I need to check the first three columns with the last three columns and return the comment if there's a match.
The problem with current approach is as the size of the list increases, I'll have to add more and more lines which makes my code prone to errors and looks ugly. I was wondering if there's a way to that where I only need to specify the list indices which need to be compared against the other list elements.
Considering you want to compare the 1st half of the list (with column names) to second half of the list, you can dynamically build the code expression so that there wouldn't be any need to write more error prone code each time the list expands.
You can build the code dynamically with the help of indices in the following way:
from itertools import product
from pyspark.sql.functions import when,col
n=len(cols)
req=list(range(0,n))
res = list(product(req[:n//2], req[n//2:]))
start = '''df.withColumn('date_match_label','''
whens =[]
for i,j in res:
whens.append(f'''when(col(cols[{i}])==col(cols[{j}]), f"cols[{i}] matches with cols[{j}]")''')
final_exp = start + '.'.join(whens) + '''.otherwise('No Match'))'''
This will generate the final expession as shown below considering there are 4 columns (comparing 1st half with 2nd half):
The above is a string expression. So, to execute it you can use eval function and get results as shown below:
df = eval(final_exp)
df.show(truncate=False)

combining low frequency value into single "other" category using pandas

I am using this line of code which has the replace method to combine low frequency values in the column
psdf['method_name'] = psdf['method_name'].replace(small_categoris, 'Other')
The error I am getting is:
'to_replace' should be one of str, list, tuple, dict, int, float
So I tried to run this line of code before the replace method
psdf['layer'] = psdf['layer'].astype("string")
Now the column is of type string but the same error still appears. For the context, I am working on pandas api on spark. Also, is there a more efficient way than replace? especially if we want to do the same for more than one column.

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

Pandas dataframe being treated as a series object after using groupby

I am conducting an analysis of a dataset. To find my results, I use this line of code:
new_df = df_ncis.groupby(['state', 'year'])['totals'].mean()
The object returned by this statement is a Series, when it should be a dataframe. I don't understand why this happened, or how to solve this issue. Also, one of the columns of the new object is missing its name. Here is the github link for the project: https://github.com/louishrm/gundataUS.
Any help would be great.
You are filtering the result by ['totals'] which is a series.
try this instead
new_df = df_ncis[['state', 'year', 'totals']].groupby(['state', 'year']).mean()
which will give you a dataframe with your 3 columns.
or if you want it as a dataframe of one column (Note the double brackets)
new_df = df_ncis.groupby(['state', 'year'])[['totals']].mean()

how to print by default the first n row for a pandas dataframe?

when print a pandas dataframe, how to print the first n row by default?
I find myself frequently doing df.head(10) to view the column names and first a couple of rows.
I prefer when I type 'df', it prints the first n row by default, instead of printing the whole df, which in this case I cannot see the column names.
If I understand you correctly, you may set
pd.options.display.max_rows = 10
and whenever you do just df in your notebook, only 10 rows would be displayed.
You can always set back to the default value doing
pd.reset_option('display.max_rows')
Check pd.describe_option('display') for more information
Curry DataFrame.head using functools.partial.
from functools import partial
head10 = partial(pd.DataFrame.head, n=10)
Now you can either call the function passing your DataFrame as an argument,
head10(df)
Or, pass the function to df.pipe (which internally passes df as an argument to your function),
df.pipe(head10)
To get the first 10 rows by default.
The other option is to create a new class that extends DataFrame and add your own function (e.g., headXX) which internally calls df.head(n=10) and returns the result.
See the subclassing DataFrame section in the docs.