I would like to ask,
is there a way how to get longtitude and latitude variables (in data.frame) from RDS geographic files downloaded for example form: https://gadm.org/download_country_v3.html.
I know we can easily plot from this dataset simply using:
df2 <- readRDS("C:/Users/petr7/Downloads/gadm36_DEU_1_sp.rds")
library(leaflet)
library(ggplot2)
# Using leaflet
leaflet() %>% addProviderTiles("CartoDB.Positron")%>%addPolygons(data=df, weight = 0.5, fill = F)
# Using ggplot
ggplot() +
geom_polygon(data = df2, aes(x=long, y = lat, group = group), color = "black", fill = F)
how ever even doing df2$ no long and latitude options are there
I would do something like this:
# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
my_url <- "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_ITA_0_sf.rds"
data <- readRDS(url(my_url))
italy_coordinate <- st_coordinates(data)
head(italy_coordinate)
#> X Y L1 L2 L3
#> [1,] 12.61486 35.49292 1 1 1
#> [2,] 12.61430 35.49292 1 1 1
#> [3,] 12.61430 35.49347 1 1 1
#> [4,] 12.61375 35.49347 1 1 1
#> [5,] 12.61375 35.49403 1 1 1
#> [6,] 12.61347 35.49409 1 1 1
Created on 2019-12-27 by the reprex package (v0.3.0)
Now you just to change the url according to your problem. Read the help page of st_coordinates function (i.e. ?sf::st_coordinates) for an explanation of the meaning of L1, L2 and L3 columns.
Related
I would like to rearrange a facet plot with 3 panels to have them fit better in a poster presentation. Currently, I have A over B over C (one column), and it is important to keep B over C.
What I would like is to have a square (2x2) presentation, with A over nothing, and B over C.
Can I either extract the individual panels of the plot, or create a facet with no axes or other graphic (like plotgrid with a NULL panel).
A second option would be the ggh4x package which via facet_manual adds some flexibility to place the panels:
library(ggplot2)
library(ggh4x)
design <- "
AB
#C"
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
facet_manual(~cyl, design = design)
One approach could be creating separate plots using nest() and map() from {tidyverse} and then using {patchwork} package to align them as we want.
(Since OP didn't provide any data and code, I am using builtin mtcars dataset to show how to do this). Suppose This is the case where we have a facetted plot with 3 panels in a 3 x 1 format.
library(tidyverse)
# 3 x 1 faceted plot
mtcars %>%
ggplot(aes(mpg, disp)) +
geom_point() +
facet_wrap(~cyl, nrow = 3)
Now to match the question, lets suppose panel for cyl 4 is plot A, panel for cyl 6 is plot B and for cyl 8 is plot C.
So to this, we first created a nested dataset with respect to facet variable using group_by(facet_var) %>% nest() and then map the ggplot over the nested data to get plots (gg object) for each nested data.
library(tidyverse)
library(patchwork)
# Say, plotA is cyl 4
# plotB is cyl 6
# plotC is cyl 8
# 2 x 2 facet plot
plot_data <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(
plots = map2(
.x = data,
.y = cyl,
.f = ~ ggplot(data = .x, mapping = aes(mpg, disp)) +
geom_point() +
ggtitle(paste0("cyl is ",.y))
)
)
plot_data
#> # A tibble: 3 × 3
#> # Groups: cyl [3]
#> cyl data plots
#> <dbl> <list> <list>
#> 1 6 <tibble [7 × 10]> <gg>
#> 2 4 <tibble [11 × 10]> <gg>
#> 3 8 <tibble [14 × 10]> <gg>
Then simply align the plots using {patchwork} syntax as we wanted. I have used plot_spacer() to create blank space.
plot_data
plots <- plot_data$plots
plots[[2]] + plots[[1]] + plot_spacer() + plots[[3]] +
plot_annotation(
title = "A 2 X 2 faceted plot"
)
I usually use "${:,.2f}". format(prices) to round numbers before commas, but what I'm looking for is different, I want to change values numbers to group them and reference them by mode:
Let say I have this list:
0 34,123.45
1 34,456.78
2 34,567.89
3 33,222.22
4 30,123.45
And the replace function will turn the list to:
0 34,500.00
1 34,500.00
2 34,500.00
3 33,200.00
4 30,100.00
Like this when I use stats.mode(prices_rounded) it will show as a result:
Mode Value = 34500.00
Mode Count = 3
Is there a conversion function already available that does the job? I did search for days without luck...
EDIT - WORKING CODE:
#create list
df3 = df_array
print('########## df3: ',df3)
#convert to float
df4 = df3.astype(float)
print('########## df4: ',df4)
#convert list to string
#df5 = ''.join(map(str, df4))
#print('########## df5: ',df5)
#round values
df6 = np.round(df4 /100) * 100
print('######df6',df6)
#get mode stats
df7 = stats.mode(df6)
print('######df7',df7)
#get mode value
df8 = df7[0][0]
print('######df8',df8)
#convert to integer
df9 = int(df8)
print('######df9',df9)
This is exactly what I wanted, thanks!
You can use:
>>> sr
0 34123.45 # <- why 34500.00?
1 34456.78
2 34567.89 # <- why 34500.00?
3 33222.22
4 30123.45
dtype: float64
>>> np.round(sr / 100) * 100
0 34100.0
1 34500.0
2 34600.0
3 33200.0
4 30100.0
dtype: float64
I have the following issue when trying to append dataframes containing geometry types. The pandas dataframe I am looking at looks likes this:
name x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
As you can see, there are four rows per name as these represent the corners of polygons. I need this to be in the the form of a polygon as defined in geopandas, i.e. I need a GeoDataFrame. To do so, I use the following code for just one of the name (just to check it works):
df = df[df['name']=='A1']
x = df['x_zone'].to_list()
y = df['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon)
which returns:
geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...
polygon.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 1 entries, A1 to A1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geometry 1 non-null geometry
dtypes: geometry(1)
memory usage: 16.0+ bytes
So fa, so good. So, for more name I though the following would work:
unique_place = list(df['name'].unique())
GE = []
for name in unique_aisle:
f = df[df['id']==name]
x = f['x_zone'].to_list()
y = f['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon.info())
GE.append(polygon)
But it returns a list, not a dataframe.
[ geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...,
geometry
A3 POLYGON ((46.63571 54.10819, 46.63571 51.84477...]
THis is strange, because *.append(**) works very well if what is to be appended is a pandas dataframe.
What am I missing? Also, even in the first case, I am left with only the geometry column, but that is not an issue because I can write the file to a shp and read it again to have a resecond column (name).
Grateful for any solution that'll get me going!
I guess you need an example code using groupby on your data. Let me know if it is not the case.
from io import StringIO
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
import numpy as np
dats_str = """index id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190"""
# read the string, convert to dataframe
df1 = pd.read_csv(StringIO(dats_str), sep='\s+', index_col='index')
# Use groupBy as an iterator to:-
# - collect interested items
# - process some data: mean, creat Polygon, maybe others
# - all are collected/appended as lists
ids = []
counts = []
meanx = []
meany = []
list_x = []
list_y = []
polygon = []
for label, group in df1.groupby('id'):
# label: 'A1', 'A3';
# group: dataframe of 'A', of 'B'
ids.append(label)
counts.append(len(group)) #number of rows
meanx.append(group.x_zone.mean())
meany.append(group.y_zone.mean())
# process x,y data of this group -> for polygon
xs = group.x_zone.values
ys = group.y_zone.values
list_x.append(xs)
list_y.append(ys)
polygon.append(Polygon(zip(xs, ys))) # make/collect polygon
# items above are used to create a dataframe here
df_from_groupby = pd.DataFrame({'id': ids, 'counts': counts, \
'meanx': meanx, "meany": meany, \
'list_x': list_x, 'list_y': list_y,
'polygon': polygon
})
If you print the dataframe df_from_groupby, you will get:-
id counts meanx meany \
0 A1 4 56.783368 47.761185
1 A3 4 54.699137 52.222007
list_x \
0 [65.42208, 46.635708, 46.597984, 68.4777]
1 [46.635708, 46.635708, 63.30956, 62.215572]
list_y \
0 [48.14785, 51.165745, 47.657444, 44.0737]
1 [54.10819, 51.84477, 48.826878, 54.10819]
polygon
0 POLYGON ((65.42207999999999 48.14785, 46.63570...
1 POLYGON ((46.635708 54.10819, 46.635708 51.844...
I have two dataframes in pandas.
import pandas as pd
inp1 = [{'network':'1.0.0.0/24', 'A':1, 'B':2}, {'network':'5.46.8.0/23', 'A':3, 'B':4}, {'network':'78.212.13.0/24', 'A':5, 'B':6}]
df1 = pd.DataFrame(inp)
print("df1", df1)
inp2 = [{'ip':'1.0.0.10'}, {'ip':'blahblahblah'}, {'ip':'78.212.13.249'}]
df2 = pd.DataFrame(inp2)
print("df2", df2)
Output:
network A B
0 1.0.0.0/24 1 2
1 5.46.8.0/23 3 4
2 78.212.13.0/24 5 6
ip
0 1.0.0.10
1 blahblahblah
2 78.212.13.249
The ultimate output I want would appear as follows:
ip A B
0 1.0.0.10 1 2
1 blahblahblah NaN Nan
2 78.212.13.249 5 6
I want to iterate through each cell in df2['ip'] and check if it belongs to a network in df1['network']. If it belongs to a network, it would return the corresponding A and B column for the specific ip address. I have referenced this article and considered netaddr, IPNetwork, IPAddress, ipaddress but cannot quite figure it out.
Help appreciated!
You can do it using netaddr + apply(). Here is an example:
from netaddr import IPNetwork, IPAddress, AddrFormatError
network_df = pd.DataFrame([
{'network': '1.0.0.0/24', 'A': 1, 'B': 2},
{'network': '5.46.8.0/23', 'A': 3, 'B': 4},
{'network': '78.212.13.0/24', 'A': 5, 'B': 6}
])
ip_df = pd.DataFrame([{'ip': '1.0.0.10'}, {'ip': 'blahblahblah'}, {'ip': '78.212.13.249'}])
# create all networks using netaddr
networks = (IPNetwork(n) for n in network_df.network.to_list())
def find_network(ip):
# return empty string when bad/wrong IP
try:
ip_address = IPAddress(ip)
except AddrFormatError:
return ''
# return network name as string if we found network
for network in networks:
if ip_address in network:
return str(network.cidr)
return ''
# add network column. set network names by ip column
ip_df['network'] = ip_df['ip'].apply(find_network)
# just merge by network columns(str in both dataframes)
result = pd.merge(ip_df, network_df, how='left', on='network')
# you don't need network column in expected output...
result = result.drop(columns=['network'])
print(result)
# ip A B
# 0 1.0.0.10 1.0 2.0
# 1 blahblahblah NaN NaN
# 2 78.212.13.249 5.0 6.0
See comments. Hope this helps.
If you're willing to use R instead of Python, I've written an ipaddress package which can solve this problem. There's still an underlying loop, but it's implemented in C++ (much faster!)
library(tibble)
library(ipaddress)
library(fuzzyjoin)
addr <- tibble(
address = ip_address(c("1.0.0.10", "blahblahblah", "78.212.13.249"))
)
#> Warning: Problem on row 2: blahblahblah
nets <- tibble(
network = ip_network(c("1.0.0.0/24", "5.46.8.0/23", "78.212.13.0/24")),
A = c(1, 3, 5),
B = c(2, 4, 6)
)
fuzzy_left_join(addr, nets, c("address" = "network"), is_within)
#> # A tibble: 3 x 4
#> address network A B
#> <ip_addr> <ip_netwk> <dbl> <dbl>
#> 1 1.0.0.10 1.0.0.0/24 1 2
#> 2 NA NA NA NA
#> 3 78.212.13.249 78.212.13.0/24 5 6
Created on 2020-09-02 by the reprex package (v0.3.0)
I have just found the function facet_grid in ggplot2, it's awesome. The question is: I have a list with 6 countries (column HC) and destination of flights all around the world. My data look like this:
HC Reason Destination freq Perc
<chr> <chr> <chr> <int> <dbl>
1 Germany Study Germany 9 0.3651116
2 Germany Work Germany 3 0.1488095
3 Germany Others Germany 3 0.4901961
4 Hungary Study Germany 105 21.4285714
5 Hungary Work Germany 118 17.6382661
6 Hungary Others Germany 24 5.0955414
7 Luxembourg Study Germany 362 31.5056571
Is there a way that in each country only show the top ten destinations and using the function facet_grid? Im trying to make a scatter plot in this way:
Geograp %>%
gather(key=Destination, value=freq, -Reason, -Qcountry) %>%
rename(HC = Qcountry) %>%
group_by(HC,Reason) %>%
mutate(Perc=freq*100/sum(freq)) %>%
ggplot(aes(x=Perc, y=reorder(Destination,Perc))) +
geom_point(size=3) +
theme_bw() +
facet_grid(HC~Reason) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey60", linetype = "dashed"))
Which produces this graph: I want to avoid the overplotting in the y-axis. Thanks in advance!!!
You could create a variable indicating the rank of each destination by country and then in the ggplot call select rows with ranking <= 10, e.g.
ggplot(data = mydata[rank <= 10, ], ....)
PS: Currently you create data and plot data all in one line using pipes. I would separate the data creation and plotting step.
As You have not posted Your data in correct format (check out dput()), i have used just a sample data. Using dplyr package i grouped in this case by grp variable (group_by(grp), in Your case it is a country) and selected top 10 rows (...top_n(n = 10,...) which are sorted by x variable (wt = x, in Your case it will be freq) and plotted it further (just in this case scatter plot):
library(dplyr)
set.seed(123)
d <- data.frame(x = runif(90),grp = gl(3, 30))
d %>%
group_by(grp) %>%
top_n(n = 10, wt = x) %>%
ggplot(aes(x=x, y=grp)) + geom_point()