Select columns based on exact row value matches - pandas

I am trying to select columns of a specified integer value (24). However, my new dataframe includes all values = and > 24. I have tried converting from integer to both float and string, and it gives the same results. Writing "24" and 24 gives same result. The dataframe is loaded from a .csv file.
data_PM1_query24 = data_PM1.query('hours == "24"' and 'averages_590_nm_minus_blank > 0.3')
data_PM1_sorted24 = data_PM1_query24.sort_values(by=['hours', 'averages_590_nm_minus_blank'])
data_PM1_sorted24
What am I missing here?

Please try out the below codes. I'm assuming that the data type of "hours" and "averages_590_nm_minus_blank" is float. If not float, convert them to float.
data_PM1_query24 = data_PM1.query('hours == 24 & averages_590_nm_minus_blank > 0.3')
or you can also use,
data_PM1_query24 = data_PM1[(data_PM1.hours == 24) & (data.averages_590_nm_minus_blank > 0.3)]
Hope this solves your query!

Related

TypeError: '<' not supported between instances of 'int' and 'Timestamp'

I am trying to change the product name when the period between the expiry date and today is less than 6 months. When I try to add the color, the following error appears:
TypeError: '<' not supported between instances of 'int' and 'Timestamp'.
Validade is the column where the products expiry dates are in. How do I solve it?
epi1 = pd.read_excel('/content/timadatepandasepi.xlsx')
epi2 = epi1.dropna(subset=['Validade'])`
pd.DatetimeIndex(epi2['Validade'])
today = pd.to_datetime('today').normalize()
epi2['ate_vencer'] = (epi2['Validade'] - today) /np.timedelta64(1, 'M')
def add_color(x):
if 0 <x< epi2['ate_vencer']:
color='red'
return f'background = {color}'
epi2.style.applymap(add_color, subset=['Validade'])
Looking at your data, it seems that you're subtracting two dates and using this result inside your comparison. The problem is likely occurring because df['date1'] - today returns a pandas.Series with values of type pandas._libs.tslibs.timedeltas.Timedelta, and this type of object does not allow you to make comparisons with integers. Here's a possible solution:
epi2['ate_vencer'] = (epi2['Validade'] - today).dt.days
# Now you can compare values from `"ate_vencer"` with integers. For example:
def f(x): # Dummy function for demonstration purposes
return 0 < x < 10
epi2['ate_vencer'].apply(f) # This works
Example 1
Here's a similar error to yours, when subtracting dates and calling function f without .dt.days:
Example 2
Here's the same code but instead using .dt.days:

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Retain all dataframe columns when using spark map

I am trying to expand the body json structure using map (as below), but also need to keep the DateTime column. Currently only the expanded json columns are kept.
Do you know how to solve this?
jsonRdd = df.select(df.DateTime, df.Body.cast("string").alias("json"))
jsonRdd = jsonRdd.rdd.map(lambda x : x.json)
data = spark.read.json(jsonRdd)
display(data)
current output looks like :
name age
j blogg 21
expected output should be :
DateTime name age
4/6/2020 j blogg 21
thank you.

TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']

Converting dynamic, nicely formatted tabular data in Python to str.format()

I have the following Python 2.x code, which generates a header row for tabular data:
headers = ['Name', 'Date', 'Age']
maxColumnWidth = 20 # this is just a placeholder
headerRow = "|".join( ["%s" % k.center(maxColumnWidth) for k in headers] )
print(headerRow)
This code outputs the following:
Name | Date | Age
Which is exactly what I want - the data is nicely formatted and centered in columns of width maxColumnWidth. (maxColumnWidth is calculated earlier in the program)
According to the Python docs, you should be able to do the same thing in Python3 with curly brace string formatting, as follows:
headerRow = "|".join( ["{:^maxColumnWidth}".format(k) for k in headers] )
However, when I do this, I get the following:
ValueError: Invalid conversion specification
But, if I do this:
headerRow = "|".join( ["{:^30}".format(k) for k in headers] )
Everything works fine.
My question is: How do I use a variable in the format string instead of an integer?:
headerRow = "|".join( ["{:^maxColumnWidth}".format(k) for k in headers] )
headers = ['Name', 'Date', 'Age']
maxColumnWidth=21
headerRow = "|".join( "{k:^{m}}".format(k=k,m=maxColumnWidth) for k in headers )
print(headerRow)
yields
Name | Date | Age
You can represent the width maxColumnWidth as {m}, and then
substitute the value through a format parameter.
No need to use brackets (list comprehension) inside the join. A
generator expression (without brackets) suffices.
As it says, your conversion specification is invalid. "maxColumnWidth" is not a valid conversion specification.
>>> "{:^{maxColumnWidth}}".format('foo', maxColumnWidth=10)
' foo '