How to scrape a pivot table with Beautiful Soup [closed] - pandas

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 10 days ago.
Improve this question
I'm trying to scrape a complex Wikipedia table (I'm not sure if it's appropriate to generalize such tables with the term "pivot table") using Beautiful Soup in hopes of recreating a simpler, more-analyzable version of it in Pandas.
JLPT "Applications and results" table on English Wikipedia
As an overview, moving from the left side: the table lists the years when JLPT was held, which exam levels were open that year, and then the statistics defined by the columns on top. The aggregated columns don't really matter for my purposes, although it'd be nice if there's a way to scrape and reconstruct them as such.
What makes the table difficult to reconstruct is that it has grouped rows (the years under column 'Year'), but the rows of that year are placed in the same hierarchical level as the year header, not under. Further, instead of having a <th> tag of the year in each <tr> row, it's only present in the first row of the year group:
HTML structure of the table
Another problem is that the year headers do not have any sort of defining identifiers in their tags or attributes, so I also can't pick only the rows with years in it.
These things make it impossible to group the rows by year.
So far, the only way I've been able to reconstruct some of the table is by:
scraping the entire table,
appending every <tr> element into a list,
since every year has a citation in square brackets: deleting every instance of strings with a [ in it, resulting in a uniform length of elements in every row
converting them into a pandas dataframe (manually adding column names, removing leftover HTML using regex, etc.), without the years:
Row elements in a list
Processed dataframe (minus the years)
After coming this far, now I realize that it's still difficult to group the rows by years without doing so manually. I'm wondering if there's a simpler, more straightforward way of scraping similarly complex tables with only BeautifulSoup itself, and little to no postprocessing in pandas. In this case, it's okay if it's not possible to get the table in its original pivot format, I just want to have the year value for each row. Something like:
Dataframe goal

You do not need to use BeautifulSoup to do this. Instead, you can use pd.read_html directly to get what you need. When you read the HTML from Wikipedia, it will pull in all of the tables into a list. If you scan through the list, you will see that it is the 10th dataframe.
df = pd.read_html('https://en.wikipedia.org/wiki/Japanese-Language_Proficiency_Test')[10]
From there, you'll do some data cleaning to create the table that you need.
# Convert multi-level column into single columns
df.columns = df.columns.map('_'.join)
#Fix column names
df = df.rename({'Year_Year': 'dummy_year',
'Level_Level': 'level',
'JLPT in Japan_Applicants': 'japan_applicants',
'JLPT in Japan_Examinees': 'japan_examinees',
'JLPT in Japan_Certified (%)': 'japan_certified',
'JLPT overseas_Applicants': 'overseas_applicants',
'JLPT overseas_Examinees': 'overseas_examinees',
'JLPT overseas_Certified (%)': 'overseas_certified'},
axis=1)
# Remove text in [], (). Remove commas. Convert to int.
df['japan_certified'] = df['japan_certified'].str.replace(r'\([^)]*\)', '', regex=True).str.replace(',', '').astype(int)
df['overseas_certified'] = df['overseas_certified'].str.replace(r'\([^)]*\)', '', regex=True).str.replace(',', '').astype(int)
df['dummy_year'] = df['dummy_year'].str.replace(r'\[.*?\]', '', regex=True)
Output:
dummy_year level ... overseas_examinees overseas_certified
0 2007 1 kyū ... 110937 28550
1 2007 2 kyū ... 152198 40975
2 2007 3 kyū ... 113526 53806
3 2007 4 kyū ... 53476 27767
4 2008 1 kyū ... 116271 38988
.. ... ... ... ... ...
127 2022-1 N1 ... 49223 17282
128 2022-1 N2 ... 54542 25677
129 2022-1 N3 ... 41264 21058
130 2022-1 N4 ... 40120 19389
131 2022-1 N5 ... 30203 16132

Related

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Is It A Good Idea or A Huge Mistake to Combine More Than 1 Type of Data Into A Single Column in An SQL Database Table? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
So, let's say I have 5 items, A, B, C, D and E. Item A comes in sizes 1 and 2, item B comes in sizes 2 and 3, C comes in 1 and 3, D comes in 1 and E comes in 3. Now, I am considering 2 table options, as follow:
Table 1
Name
Size
A
1
A
2
B
2
B
3
C
1
C
3
D
1
E
3
Another option is Table 2, as follows:
Name
A1
A2
B2
B3
C1
C3
D1
E3
Now, which of these 2 tables is actually a better option? What are the advantages and disadvantages (if any) of each of the 2 tables above? One thing that I can think of is that, if I use table 1, I can easily extract all items by size, no matter what item I want. So, for instance, if I want to analyze this month's sales of items of size 1, it's easy to do it with Table 1. I can't seem to see the same advantage if I use table 2. What do you guys think? Please kindly enlighten me on this matter. Thank you in advance for your kind assistance, everyone. Cheers! :)
I don't even understand why you have the second table option - what purpose does it have or how does it help you? Plain and simple you have a one to many relationship. That is an item comes in 1 or more different sizes. You just saying that sentence should scream ONLY option 1. Option 2 will make your life a living hell because you are going against normalization guidelines by taking 2 datatypes into 1, and it has no real benefit.
Option 1 says I have an item and it can have one or more sizes associated with it.
Item Size
A 1
A 2
A 3
B 1
C 1
C 2
Then you can do simple queries like give me all items that have more then 1 size. Give me any item that only has 1 size. Give me all the sizes of item with item id A, etc.

Pandas Python How to handle question mark that appeared in dataframe

I have these question marks that appeared in my data frame just next to numbers and I dont know how to erase or or replace them. I dont want to drop the whole row since it may result in inaccurate results.
. Value
0 58
1 82
2 69
3 48
4 8
I agree with the comments above that you should look into how you imported the data. But here is the answer to your question of how to remove the non numeric characters:
This will remove the non numeric characters
df['Value'] = df['Value'].str.extract('(\d+)')
Then if you wish to change the datatype to in you can use this:
df['Value'] = pd.to_numeric(df['Value'])

How to select columns based on value they contain pandas

I am working in pandas with a certain dataset that describes the population of a certain country per year. The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set. The dataset describes every year from 1960 up til now but I only need 1970, 1980, 1990 etc. For this purpose I've created a list with all those years and tried to make a new dataset which is equivalent to the old one but only has the columns that contain a value from said list so I don't have all this extra info I'm not using. Online I can only find instructions for removing rows or selecting by column name, since both these criteria don't apply in this situation I thought i should ask here.
The dataset is a csv file which I've downloaded off some world population site. here a link to a screenshot of the data
As you can see the years are given in scientific notation for some years, which is also how I've added them to my list.
pop = pd.read_csv('./maps/API_SP.POP.TOTL_DS2_en_csv_v2_10576638.csv',
header=None, engine='python', skiprows=4)
display(pop)
years = ['1.970000e+03','1.980000e+03','1.990000e+03','2.000000e+03','2.010000e+03','2.015000e+03', 'Country Name']
pop[pop.columns[pop.isin(years).any()]]
This is one of the things I've tried so far which I thought made the most sense, but I am still very new to pandas so any help would be greatly appreciated.
Using the data at https://data.worldbank.org/indicator/sp.pop.totl, copied into pastebin (first time using the service, so apologies if it doesn't work for some reason):
# actual code using CSV file saved to desktop
#df = pd.read_csv(<path to CSV>, skiprows=4)
# pastebin for reproducibility
df = pd.read_csv(r'https://pastebin.com/raw/LmdGySCf',sep='\t')
# manually select years and other columns of interest
colsX = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
'1990', '1995', '2000']
dfX = df[colsX]
# select every fifth year
colsY = df.filter(regex='19|20', axis=1).columns[[int(col) % 5 == 0 for col in df.filter(regex='19|20', axis=1).columns]]
dfY = df[colsY]
As a general comment:
The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set.
This is not correct. Viewing the CSV file, it is quite clear that row 5 (Country Name, Country Code, Indicator Name, Indicator Code, 1960, 1961, ...) are indeed column names. You have read the data into pandas in such a way that those values are not column years, but your first step, before trying to subset your data, should be to ensure you have read in the data properly -- which, in this case, would give you column headers named for each year.

Looping calculations from data frames

I have a large dataset coming in from SQLdf. I use split to order it by an index field from the query and list2env to split these into several data frames. These data frames will have names like 1 through 178. After splitting them, i want to do some calculations on all of them. How should i "call" a calculations for 1 through 178 (might change from day to day) ?
Simplification: one dataset becomes n data frames splitted on an index (like this):
return date return benchmark_returen index
28-03-2014 0.03 0.05 6095
with typically 252 * 5 obs (IE: 5 years)
then i want to split these on the index into (now 178 dfs)
and perform typically risk/return analytics from the PerformanceAnalytics package like for example chart.Histogram or charts.PerformanceSummary.
In the next step i would like to group these and insert them into a PDF for each Index. (the graphs/results that is).
As others have pointed out the question lacks a proper example but indexing of environments can be done as with lists. In order to construct a list the have digits as index values one needs to use backticks, and arguments to [[ when accessing environments need to be characters
> mylist <- list(`1`="a", `2`="b")
> myenv <- list2env(mylist)
> myenv$`1`
[1] "a"
> myenv[[as.character(1)]]
[1] "a"
If you want to extract values (and then possibly put them back into the environment:
sapply(1:2, function(n) get(as.character(n), envir=myenv) )
[1] "a" "b"
myenv$calc <- with(myenv, paste(`1`, `2`))