df.groupby() giving me wrong total calculations. pandas. numpy - pandas

So I was just checking the final results after grouping, but the column sums are not matching. Here's my code (the last logical statement is failing even though the sums should be the same):
dfM = pd.concat([df1,df2])
dfM_V = sum(dfM['SumOfPCS'])
A = ['SOLDTO', 'PICKUP', 'ORIZIP3','ORIGINFACILITYCODE', 'PRODUCT_ID','ACTUALRECIPIENTCOUNTRY', 'LB_BRK','COUNTRY', 'MANIFESTEDDSPPRODUCT']
V = ['SumOfPCS', 'SumOfLBS']
dfM2 = dfM.groupby(A).agg([np.sum])[V]
dfM2 = dfM2.reset_index()
dfM2.columns = dfM2.columns.get_level_values(0)
dfM2_V = sum(dfM2['SumOfPCS']
print(dfM2_V == dfM_V)
By the way, A + V = list(dfM.columns) and there no empty rows nor cells in the dataset. (When I do the exact same grouping on MS Access, the logical condition tested at the end is met, so there's nothing inherently wrong with the dataset.)

Related

Workaround Google Sheets API does not accept range request without specifying desired final line

My spreadsheet has values in this model:
And I need to create a list to use in Python, including the empty fields that exist between values:
CLIENT_SECRET_FILE = 'client_secrets.json'
API_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
service = Create_Service(CLIENT_SECRET_FILE, API_NAME, API_VERSION, SCOPES)
spreadsheet_id = sheet_id
get_page_id = 'Winning_Margin'
range_score = 'O1:O10000'
spreadsheets_match_score = []
range_names2 = get_page_id + '!' + range_score
result2 = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=range_names2, valueRenderOption='UNFORMATTED_VALUE').execute()
sheet_output_data2 = result2["values"]
for i, eventao2 in enumerate(sheet_output_data2):
try:
spreadsheets_match_score.append(sheet_output_data2[i][0])
except:
spreadsheets_match_score.append('')
In this case, this list (spreadsheets_match_score = []) would result in:
["0-0","0-0","4-0","0-1","6-0","","","","0-3","2-2","","","","","0-1","","","3-0","1-1","3-1","","","",""]
My spreadsheet currently has 24 rows, but it will grow without a fixed ending value.
So, I tried to use the range without putting the value of the last line (range_score = 'O1:O'), but it doesn't accept, the range needs to specify the final line (range_score = 'O1:O10000').
I put 10000 exactly so I don't have to change, but this is very wrong to do, because it does a search for a non-existent range, I'm very afraid that in the future it will generate an error.
Is there any method so that I can not need to specify the last row of the worksheet?
To be something like:
range_score = 'O1:O'
The problem is not in the range specification method for data collection, can use either range_score = 'O1:O' or range_score = 'O1:O100000000000' if looking for all the column rows.
In the case of the question, the problem was when line 1 of the desired column has no values, being null, the request failed but because of the empty ["values"] return.
In short, I was looking for the error in the wrong place.

Filter out entries of datasets based on string matching

I'm working with a dataframe of chemical formulas (str objects). Example
formula
Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...
I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb' therefore the desired output should result in:
formula
Na0.2Cl0.4O0.7Rb1
What I've tried to do is the following
for i, formula in enumerate(df['formula'])
if ('Na' and 'Cl' and 'Rb' not in formula):
df = df.drop(index=i)
but it seems not to work.
You can use use contains with or condition for multiple string pattern matching for matching only one of them
df[df['formula'].str.contains("Na|Cl|Rb", na=False)]
Or you can use pattern with contains if you want to match all of them
df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]
Your requirements are unclear, but assuming you want to filter based on a set of elements.
Keeping formulas where all elements from the set are used:
s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
df['formula']
.str.extractall(regex)[0]
.groupby(level=0).nunique().eq(len(s))
)
df.loc[mask[mask].index]
output:
formula
0 Na0.2Cl0.4O0.7Rb1
Keeping formulas where only elements from the set are used:
s = {'Na','Cl','Rb'}
mask = (df['formula']
.str.extractall('([A-Z][a-z]*)')[0]
.isin(s)
.groupby(level=0).all()
)
df[mask]
output: no rows for this dataset

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

How to return ONLY top 5% of responses in a column PANDAS

I am looking to return the top 5% of responses in a column using pandas. So, for col_1, basically, I want a list of the responses that make up at least 5% of the responses in that column.
The following returns the list of ALL responses in the col_1 that meet the condition, as well as those that do not (returns boolean True and False):
df['col_1'].value_counts(normalize = True) >= .05
While this is somewhat helpful, I would like to return ONLY those that evaluate to true. Should I use a dictionary and loop? If so, how do I signal that I am using value_counts(normalize = True) >= .05 to append to that dictionary?
Thank you for your help!
If need filter by boolean indexing:
s = df['col_1'].value_counts(normalize = True)
L = s.index[s >= .05].tolist()
print (L)

Putting dbSendQuery into a function in R

I'm using RJDBC in RStudio to pull a set of data from an Oracle database into R.
After loading the RJDBC package I have the following lines:
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
Run through the usual script, they always execute without fail; it can sometimes take a few minutes dependent on variable x, e.g. may result in 100K rows or 1M rows being pulled. masterdata will return everything in a dataframe.
I'm now trying to place all of the above into a function, with one required argument, variable x which is a TEXT argument (a city name); this input however is also part of the LONG SQL QUERY.
The function I wrote called Data_Grab is as follows:
Data_Grab = function(x) {
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA,
INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
return (masterdata)
}
My function appears to execute in seconds (no error is produced) however I get just the 21 column headings for the dataframe and the line
<0 rows> (or 0-length row.names)
Not sure what is wrong here; obviously expecting function to still take minutes to execute as data being pulled is large, but not being returned any actual data frame.
Help is appreciated!
if you want to parameterize your query to a JDBC database, try also using the gsubfn package. code might look like this:
library(gsubfn)
library(RJDBC)
Data_Grab = function(x) {
rd1 = x
df <- fn$dbGetQuery(conn,"SELECT BLAH1, BLAH2
FROM TABLENAME
WHERE BLAH1 = '$rd1')
return(df)
basically, you need to put a $ before the variable name that stores the parameter you wish to pass.