Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have this problem getting the Standard Deviation (equiation here). My question is how could I get the sum of ([X interval] - mean) from a set of data wherein a certain criteria(s) is to be followed.
For example, the data is:
Gender Grade
M 36
M 32
F 25
F 40
I have acquired N needed in the equation via COUNTIFS and acquired the mean via SUMIFS. The problem is having the get the sum of the range (X interval minus mean) without declaring a cell/column for the said range. In the given example, I would want to get the Standard Deviation of Grade with respect to gender. It would be hard if record 2 gender would be changed to 'F' if I would add column for X interval minus mean.
Any thoughts how this maybe done?
With a little algebra the sd formula can be rewritten as
Ʃ(x²) - Ʃ(x)²/n
sd = √( --------------- )
n
which can be implemented with SUMIFS, COUNTIFS and SUMPRODUCT
Assuming gender data is in range A1:A4 and grade in B1:B4 and criteria in C1 use
=SQRT( (SUMPRODUCT($B$1:$B$4,$B$1:$B$4,--($A$1:$A$4=C1)) -
SUMIFS($B$1:$B$4,$A$1:$A$4,C1)^2/COUNTIFS($A$1:$A$4,C1)) /
COUNTIFS($A$1:$A$4,C1) )
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 10 days ago.
Improve this question
I'm trying to scrape a complex Wikipedia table (I'm not sure if it's appropriate to generalize such tables with the term "pivot table") using Beautiful Soup in hopes of recreating a simpler, more-analyzable version of it in Pandas.
JLPT "Applications and results" table on English Wikipedia
As an overview, moving from the left side: the table lists the years when JLPT was held, which exam levels were open that year, and then the statistics defined by the columns on top. The aggregated columns don't really matter for my purposes, although it'd be nice if there's a way to scrape and reconstruct them as such.
What makes the table difficult to reconstruct is that it has grouped rows (the years under column 'Year'), but the rows of that year are placed in the same hierarchical level as the year header, not under. Further, instead of having a <th> tag of the year in each <tr> row, it's only present in the first row of the year group:
HTML structure of the table
Another problem is that the year headers do not have any sort of defining identifiers in their tags or attributes, so I also can't pick only the rows with years in it.
These things make it impossible to group the rows by year.
So far, the only way I've been able to reconstruct some of the table is by:
scraping the entire table,
appending every <tr> element into a list,
since every year has a citation in square brackets: deleting every instance of strings with a [ in it, resulting in a uniform length of elements in every row
converting them into a pandas dataframe (manually adding column names, removing leftover HTML using regex, etc.), without the years:
Row elements in a list
Processed dataframe (minus the years)
After coming this far, now I realize that it's still difficult to group the rows by years without doing so manually. I'm wondering if there's a simpler, more straightforward way of scraping similarly complex tables with only BeautifulSoup itself, and little to no postprocessing in pandas. In this case, it's okay if it's not possible to get the table in its original pivot format, I just want to have the year value for each row. Something like:
Dataframe goal
You do not need to use BeautifulSoup to do this. Instead, you can use pd.read_html directly to get what you need. When you read the HTML from Wikipedia, it will pull in all of the tables into a list. If you scan through the list, you will see that it is the 10th dataframe.
df = pd.read_html('https://en.wikipedia.org/wiki/Japanese-Language_Proficiency_Test')[10]
From there, you'll do some data cleaning to create the table that you need.
# Convert multi-level column into single columns
df.columns = df.columns.map('_'.join)
#Fix column names
df = df.rename({'Year_Year': 'dummy_year',
'Level_Level': 'level',
'JLPT in Japan_Applicants': 'japan_applicants',
'JLPT in Japan_Examinees': 'japan_examinees',
'JLPT in Japan_Certified (%)': 'japan_certified',
'JLPT overseas_Applicants': 'overseas_applicants',
'JLPT overseas_Examinees': 'overseas_examinees',
'JLPT overseas_Certified (%)': 'overseas_certified'},
axis=1)
# Remove text in [], (). Remove commas. Convert to int.
df['japan_certified'] = df['japan_certified'].str.replace(r'\([^)]*\)', '', regex=True).str.replace(',', '').astype(int)
df['overseas_certified'] = df['overseas_certified'].str.replace(r'\([^)]*\)', '', regex=True).str.replace(',', '').astype(int)
df['dummy_year'] = df['dummy_year'].str.replace(r'\[.*?\]', '', regex=True)
Output:
dummy_year level ... overseas_examinees overseas_certified
0 2007 1 kyū ... 110937 28550
1 2007 2 kyū ... 152198 40975
2 2007 3 kyū ... 113526 53806
3 2007 4 kyū ... 53476 27767
4 2008 1 kyū ... 116271 38988
.. ... ... ... ... ...
127 2022-1 N1 ... 49223 17282
128 2022-1 N2 ... 54542 25677
129 2022-1 N3 ... 41264 21058
130 2022-1 N4 ... 40120 19389
131 2022-1 N5 ... 30203 16132
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
So, let's say I have 5 items, A, B, C, D and E. Item A comes in sizes 1 and 2, item B comes in sizes 2 and 3, C comes in 1 and 3, D comes in 1 and E comes in 3. Now, I am considering 2 table options, as follow:
Table 1
Name
Size
A
1
A
2
B
2
B
3
C
1
C
3
D
1
E
3
Another option is Table 2, as follows:
Name
A1
A2
B2
B3
C1
C3
D1
E3
Now, which of these 2 tables is actually a better option? What are the advantages and disadvantages (if any) of each of the 2 tables above? One thing that I can think of is that, if I use table 1, I can easily extract all items by size, no matter what item I want. So, for instance, if I want to analyze this month's sales of items of size 1, it's easy to do it with Table 1. I can't seem to see the same advantage if I use table 2. What do you guys think? Please kindly enlighten me on this matter. Thank you in advance for your kind assistance, everyone. Cheers! :)
I don't even understand why you have the second table option - what purpose does it have or how does it help you? Plain and simple you have a one to many relationship. That is an item comes in 1 or more different sizes. You just saying that sentence should scream ONLY option 1. Option 2 will make your life a living hell because you are going against normalization guidelines by taking 2 datatypes into 1, and it has no real benefit.
Option 1 says I have an item and it can have one or more sizes associated with it.
Item Size
A 1
A 2
A 3
B 1
C 1
C 2
Then you can do simple queries like give me all items that have more then 1 size. Give me any item that only has 1 size. Give me all the sizes of item with item id A, etc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to pandas.
I have been trying to solve a problem here
This is the problem statement where I want to drop any row where I have a duplicate A but non duplicate B
Here is the kind of output I want
enter image description here
IIUC, this is what you need
a = (df['A'].ne(df['A'].shift())).ne((df['B'].ne(df['B'].shift())))
df[~a].reset_index(drop=True)
Output
A B
0 2 z
1 3 x
2 3 x
I think you need:
cond=(df.eq(df.shift(-1))|df.eq(df.shift())).all(axis=1)
pd.concat([df[~cond].groupby('A').last().reset_index(),df[cond]])
A B
0 2 y
2 3 x
3 3 x
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am having some difficulty writing an awk/sed code for finding the distances between every row and the last row systematically. To be more specific, suppose I have a file f1 as follows.
1 2 3
4 5 6
7 8 9
.
.
.
51 52 53
30 31 32
where the first column is the x coordinate, second column is the y coordinate, and third column is the z coordinate. How do I create a file containing the distances between the first row and the last row (i.e. distance between (1,2,3) and (30,31,32)), second row and last row, third row and last row, and so on, until the penultimate row and last row. If f1 has n rows, then the file (let's call it f2) would therefore have n-1 rows.
I have been stuck on this for a long time, but any help would be much appreciated. Thanks!
Use tac to get last line first:
$ tac file | awk '(NR == 1){ x=$1; y=$2; z=$3; next } {
print sqrt((x-$1)^2 + (y-$2)^2 + (z-$3)^2)
}' | tac
50.2295
45.0333
39.8372
36.3731
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have a data frame called EPI.
it looks like this:
It has 104 countries. Each country has values from 1991 till 2008 (18 years).
I want to have average every 9 years. So, each country will have 2 averages.
An edit:
This is the command I used to use it to get average. But it gives me one value (average) for each country.
aver_economic_growth <- aggregate( HDI_growth_rate[,3], list(economic_growth$cname), mean, na.rm=TRUE)
But I need to get an average for each 9 years of a country.
Please note that I am a new user of r and I didn't find pandas in packages installment!
I think you can first convert years to datetime and then groupby with resample mean. Last convert to years.
#sample data for testing
np.random.seed(100)
start = pd.to_datetime('1991-02-24')
rng = pd.date_range(start, periods=36, freq='A')
df = pd.DataFrame({'cname': ['Albania'] * 18 + ['Argentina'] * 18,
'year': rng.year,
'rgdpna.pop': np.random.choice([0,1,2], size=36)})
#print (df)
df.year = pd.to_datetime(df.year, format='%Y')
df1 = df.set_index('year').groupby('cname').resample('9A',closed='left').mean().reset_index()
df1.year = df1.year.dt.year
print (df1)
cname year rgdpna.pop
0 Albania 1999 1.000000
1 Albania 2008 1.000000
2 Argentina 2017 0.888889
3 Argentina 2026 0.888889