enter image description here
My code are as below:
i am trying to learn data science and found this \n in my output.
can someone help to sort it out ?
Year\n Team\n GP\n GS\n MPG\n FG%\n 3P%\n FT%\n RPG\n APG\n SPG\n BPG\n PPG\n
0 1984–85\n Chicago\n 82 82 38.3 .515 .173 .845 6.5 5.9 2.4 .8 28.2\n
1 1985–86\n Chicago\n 18 7 25.1 .457 .167 .840 3.6 2.9 2.1 1.2 22.7\n
2 1986–87\n Chicago\n 82 82 40.0 .482 .182 .857 5.2 4.6 2.9 1.5 37.1*\n
3 1987–88\n Chicago\n 82 82 40.4* .535 .132 .841 5.5 5.9 3.2* 1.6 35.0*\n
4 1988–89\n Chicago\n 81 81 40.2* .538 .276 .850 8.0 8.0 2.9 .8 32.5*\n
5 1989–90\n Chicago\n 82 82 39.0 .526 .376 .848 6.9 6.3 2.8* .7 33.6*\n
6 1990–91†\n Chicago\n 82 82 37.0 .539 .312 .851 6.0 5.5 2.7 1.0 31.5*\n
7 1991–92†\n Chicago\n 80 80 38.8 .519 .270 .832 6.4 6.1 2.3 .9 30.1*\n
8 1992–93†\n Chicago\n 78 78 39.3 .495 .352 .837 6.7 5.5 2.8* .8 32.6*\n
9 1994–95\n Chicago\n 17 17 39.3 .411 .500 .801 6.9 5.3 1.8 .8 26.9\n
10 1995–96†\n Chicago\n 82 82 37.7 .495 .427 .834 6.6 4.3 2.2 .5 30.4*\n
11 1996–97†\n Chicago\n 82 82 37.9 .486 .374 .833 5.9 4.3 1.7 .5 29.6*\n
12 1997–98†\n Chicago\n 82 82 38.8 .465 .238 .784 5.8 3.5 1.7 .5 28.7*\n
13 2001–02\n Washington\n 60 53 34.9 .416 .189 .790 5.7 5.2 1.4 .4 22.9\n
14 2002–03\n Washington\n 82 67 37.0 .445 .291 .821 6.1 3.8 1.5 .5 20.0\n
15 Career\n 1,072 1,039 38.3 .497 .327 .835 6.2 5.3 2.3 .8 30.1\n None
16 All-Star\n 13 13 29.4 .472 .273 .750 4.7 4.2 2.8 .5 20.2\n None
response = requests.get(links)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table', class_ = 'wikitable sortable')
all_raws = table.find_all('tr')
data = []
for raw in all_raws:
raw_list = raw.find_all('td')
//raw_list will print only last raw of table
[<td colspan="2" style="text-align:center;">All-Star </td>, <td>13</td>, <td>13</td>, <td>29.4</td>, <td>.472</td>, <td>.273</td>, <td>.750</td>, <td>4.7</td>, <td>4.2</td>, <td>2.8</td>, <td>.5</td>, <td>20.2</td>]
dataRaw = []
for cell in raw_list:
dataRaw.append(cell.text) # datRaw will print only last raw ['All-Star\n', '13', '13', '29.4', '.472', '.273', '.750', '4.7', '4.2', '2.8', '.5', '20.2\n']
data.append(dataRaw)
data = data[1:]
header_list = []
col_header = table.find_all('th')
for col in col_header:
header_list.append(col.text)
df = pd.DataFrame(data)
df.columns = header_list
df
this should work
df['<column name>'] = df.<column name>.str.replace(r'\n', '')
Related
Here was the original question:
With only being able to import numpy and pandas, I need to do the following: Scale the medianIncome to express the values in $10,000 of dollars (example: 150000 will become 15, 30000 will become 3, 15000 will become 1.5, etc)
Here's the code that works:
temp_housing['medianIncome'].replace( '[(]','-', regex=True ).astype(float)) / 10000
But when I call the df after, it still shows the original amount instead of the 15 of 1.5. I'm not sure what I'm missing on this.
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
The result is:
id medianIncome
0 1.7250
1 2.1806
2 2.4038
3 2.4597
4 1.8080
Name: medianIncome, Length: 20640, dtype: float64
But then when I call the df with
housing_cal
, it's back to:
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
I have a large SQLite database with many tables. I have established a connection to this database in RStudio using RSQLite and DBI packages. (I have named this database db)
library(RSQLite)
library(DBI)
At the moment I have to read in all the tables and assign them a name manually. For example:
country <- dbReadTable(db, "country")
date <- dbReadTable(db, "date")
#...and so on
You can see this can be a very time-consuming process if you were to have many tables.
So I was wondering if it is possible to create a new function or using existing functions (e.g. lapply() ?) to complete this more efficiently and essentially speed up this process?
Any suggestions are much appreciated :)
Two mindsets:
All tables/data into one named-list:
alldat <- lapply(setNames(nm = dbListTables(db)), dbReadTable, conn = db)
The benefit of this is that if the tables have similar meaning, then you can use lapply to apply the same function to each. Another benefit is that all data from one database are stored together.
See How do I make a list of data frames? for working on a list-of-frames.
If you want them as actual variables in the global (or enclosing) environment, then take the previous alldat, and
ign <- list2env(alldat, envir = .GlobalEnv)
The return value from list2env is the environment that we passed to list2env, so it's not incredibly useful in this context (though it is useful other times). The only reason I capture it into ign is to reduce the clutter on the console ... which is minor. list2env works primarily in side-effect, so the return value in this case is not critical.
You can use dbListTables() to generate a character vector of all your table names in your SQLite database and use lapply() to import them into R efficiently. I would first check you are able to import all the tables in your database into memory.
Below is an reproducible example of this:
library(RSQLite)
library(DBI)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(db, "mtcars", mtcars)
dbWriteTable(db, "iris", iris)
db_tbls <- dbListTables(db)
tbl_list <- lapply(db_tbls, dbReadTable, conn = db)
tbl_list <- setNames(tbl_list, db_tbls)
dbDisconnect(db)
> lapply(tbl_list, head)
$iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
$mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I have a datatable (dt) like the following in SQL:
ID state_id act rd_1 rd_2 rd_3 rd_4 rd_5
11 abc,13.3 1 1. 31 17.4 32.4 0.4
11 afd,23.2 4 1. 42.1 1.3 31.9 0.39
11 dfw,154 7 0. 0.3 4.3 8.21 163.3
12 vab,64.5 8 1. 32.3 11 2.1 21.3
12 avf,542 2 0. 2.12 28.2 8.12 57.5
12 vjg,35 4 1. 5.7 8.64 7.46 0.25
13 vaw,424.2 4 1. 64.3 0.435 4.3 35.3
14 bda,243 1 0. 4.4 4.6 2.4 4.2
15 rbe,24.2 3 1. 43 53.5 4.4 8.5
I want to, for each row, calculate the variance of values from rd_1 to rd_5 (they are doubles). ID and state_id uniquely identifies a row. The desired output is the like the following:
ID state_id act rd_1 rd_2 rd_3 rd_4 rd_5. var_rd
11 abc,13.3 1 1. 31 17.4 32.4 0.4 192.6624
11 afd,23.2 4 1. 42.1 1.3 31.9 0.39 323.3181
11 dfw,154 7 0. 0.3 4.3 8.21 163.3 4109.9855
12 vab,64.5 8 1. 32.3 11 2.1 21.3 141.3463
13 vaw,424.2 4 1. 64.3 0.435 4.3 35.3 636.2333
14 bda,243 1 0. 4.4 4.6 2.4 4.2 3.0496
15 rbe,24.2 3 1. 43 53.5 4.4 8.5 473.2456
I know it is possible to use pivot to flatten the data and then calculate variance on column (rd_value) in the flattened data. But the SQL I use do not support Pivot method. I tried using UNION but it appears that it messes up with user_id.
I would approach this just by applying the formula for variance:
select t.*,
( (rd_1 - rd_avg) * (rd_1 - rd_avg) +
(rd_2 - rd_avg) * (rd_2 - rd_avg) +
(rd_3 - rd_avg) * (rd_3 - rd_avg) +
(rd_4 - rd_avg) * (rd_4 - rd_avg) +
(rd_5 - rd_avg) * (rd_5 - rd_avg) +
) as variance
from (select t.*,
(rd_1 + rd_2 + rd_3 + rd_4 + rd_5) / 5 as rd_avg
from t
) t
I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike
This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]
I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.
After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = “utf-8” to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).
Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.
I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.
Is there any solution of getting the characters in the table correctly into r using dplyr?
Minimal Example
library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
##
## KH_ID Einschätzung Prozess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP
You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.
The name of the second column is displayed correctly, but the first column is wrong:
read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
## X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1.6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13
I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:
library(dplyr)
library(sqldf)
# Modifying built-in mtcars dataset
mtcars$test <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
mtcars$češćžä <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
names(mtcars) <-
iconv(names(mtcars), "cp1250", "utf-8")
# Connecting to sqlite database
my_db <- src_sqlite("my_db.sqlite3", create = T)
# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)
# dbSendQuery(my_db$con, "drop table mtcars")
# getting data from sqlite database
my_mtcars_from_db <-
collect(tbl(my_db, "mtcars"))
# disconnecting from database
dbDisconnect(my_db$con)
convert_to_encoding() function
# a function that encodes
# column names and values in character columns
# with specified encodings
convert_to_encoding <-
function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){
# names of columns are encoded in specified encoding
my_names <-
iconv(names(x), from_encoding, to_encoding)
# if any column name is NA, leave the names
# otherwise replace them with new names
if(any(is.na(my_names))){
names(x)
} else {
names(x) <- my_names
}
# get column classes
x_char_columns <- sapply(x, class)
# identify character columns
x_cols <- names(x_char_columns[x_char_columns == "character"])
# convert all string values in character columns to
# specified encoding
x <-
x %>%
mutate_each_(funs(iconv(., from_encoding, to_encoding)),
x_cols)
# return x
return(x)
}
# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Results
# before conversion
my_mtcars_from_db
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb češćžä test
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ÄŤ ÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Ĺľ Ĺľ
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ÄŤ ÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Ĺľ Ĺľ
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb test češćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 č č
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ž ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 č č
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ž ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
Session information
devtools::session_info()
Session info -------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, mingw32
ui RStudio (0.99.441)
language (EN)
collate Slovenian_Slovenia.1250
tz Europe/Prague
Packages -----------------------------------------------------------------------
package * version date source
assertthat * 0.1 2013-12-06 CRAN (R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
dplyr 0.4.1 2015-01-14 CRAN (R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN (R 3.2.0)
lazyeval * 0.1.10 2015-01-02 CRAN (R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
R6 * 2.0.1 2014-10-29 CRAN (R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN (R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)