Circlize chord diagram with multiple levels of data - chord-diagram

I am finding myself a bit stuck,
I want to show flows between regions for trafficked species via a chord diagram on circlize, but am unable to work out how to plot when column 1 and 2 represent the "connection", column 3 is the "factor" of interest and column 4 are the values.
I have included a sample of data below (yes I am aware indonesia is a region), as you can see each species is not unique to a particular region. I would like to produce a plot similar to the one included below but replace the "countries" with "species" for each region. Is this possible to do?
import_region export_region species flow
North America Europe Acanthosaura armata 0.0104
Southeast Asia Europe Acanthosaura armata 0.0022
Indonesia Europe Acanthosaura armata 0.1971
Indonesia Europe Acrochordus granulatus 0.7846
Southeast Asia Europe Acrochordus granulatus 0.1101
Indonesia Europe Acrochordus javanicus 2.00E-04
Southeast Asia Europe Acrochordus javanicus 0.0015
Indonesia North America Acrochordus javanicus 0.0024
East Asia Europe Acrochordus javanicus 0.0028
Indonesia Europe Ahaetulla prasina 4.00E-04
Southeast Asia Europe Ahaetulla prasina 4.00E-04
Southeast Asia East Asia Amyda cartilaginea 0.0027
Indonesia East Asia Amyda cartilaginea 5.00E-04
Indonesia Europe Amyda cartilaginea 0.004
Indonesia Southeast Asia Amyda cartilaginea 0.0334
Europe North America Amyda cartilaginea 4.00E-04
Indonesia North America Amyda cartilaginea 0.1291
Southeast Asia Southeast Asia Amyda cartilaginea 0.0283
Indonesia West Asia Amyda cartilaginea 0.7614
South Asia Europe Amyda cartilaginea 2.8484
Australasia Europe Apodora papuana 0.0368
Indonesia North America Apodora papuana 0.324
Indonesia Europe Apodora papuana 0.0691
Europe Europe Apodora papuana 0.0106
Indonesia East Asia Apodora papuana 0.0129
Europe North America Apodora papuana 0.0034
East Asia East Asia Apodora papuana 2.00E-04
Indonesia Southeast Asia Apodora papuana 0.0045
East Asia North America Apodora papuans 0.0042
example of diagram similar to what I would like, please click link below:
chord diagram

In circlize package, the ChordDiagram() function only allows a "from" column, a "to" column and a optional "value" column. However, in your case, actually we can make some transformation for the original data frame to modify it into a three-column data frame.
In you example, you want to distinguish e.g. Acanthosaura_armata in North America from Acanthosaura_armata in Europe, one solution is to merge region names and species names such as Acanthosaura_armata|North_America to form a unique identifier. Next I will demonstrate how to visualize this dataset by circlize package.
Read in the data. Note I replaced space with underscores.
df = read.table(textConnection(
"import_region export_region species flow
North_America Europe Acanthosaura_armata 0.0104
Southeast_Asia Europe Acanthosaura_armata 0.0022
Indonesia Europe Acanthosaura_armata 0.1971
Indonesia Europe Acrochordus_granulatus 0.7846
Southeast_Asia Europe Acrochordus_granulatus 0.1101
Indonesia Europe Acrochordus_javanicus 2.00E-04
Southeast_Asia Europe Acrochordus_javanicus 0.0015
Indonesia North_America Acrochordus_javanicus 0.0024
East_Asia Europe Acrochordus_javanicus 0.0028
Indonesia Europe Ahaetulla_prasina 4.00E-04
Southeast_Asia Europe Ahaetulla_prasina 4.00E-04
Southeast_Asia East_Asia Amyda_cartilaginea 0.0027
Indonesia East_Asia Amyda_cartilaginea 5.00E-04
Indonesia Europe Amyda_cartilaginea 0.004
Indonesia Southeast_Asia Amyda_cartilaginea 0.0334
Europe North_America Amyda_cartilaginea 4.00E-04
Indonesia North_America Amyda_cartilaginea 0.1291
Southeast_Asia Southeast_Asia Amyda_cartilaginea 0.0283
Indonesia West_Asia Amyda_cartilaginea 0.7614
South_Asia Europe Amyda_cartilaginea 2.8484
Australasia Europe Apodora_papuana 0.0368
Indonesia North_America Apodora_papuana 0.324
Indonesia Europe Apodora_papuana 0.0691
Europe Europe Apodora_papuana 0.0106
Indonesia East_Asia Apodora_papuana 0.0129
Europe North_America Apodora_papuana 0.0034
East_Asia East_Asia Apodora_papuana 2.00E-04
Indonesia Southeast_Asia Apodora_papuana 0.0045
East_Asia North_America Apodora_papuans 0.0042"),
header = TRUE, stringsAsFactors = FALSE)
Also, I removed some rows which have very tiny values.
df = df[df[[4]] > 0.01, ]
Assign colors for species and regions.
library(circlize)
library(RColorBrewer)
all_species = unique(df[[3]])
color_species = structure(brewer.pal(length(all_species), "Set1"), names = all_species)
all_regions = unique(c(df[[1]], df[[2]]))
color_regions = structure(brewer.pal(length(all_regions), "Set2"), names = all_regions)
Group by species
First I will demonstrate how to group the chord diagram by species.
As mentioned before, we use species|region as unique identifier.
df2 = data.frame(from = paste(df[[3]], df[[1]], sep = "|"),
to = paste(df[[3]], df[[2]], sep = "|"),
value = df[[4]], stringsAsFactors = FALSE)
Next we adjust the order of all sectors to first order by species, then by regions.
combined = unique(data.frame(regions = c(df[[1]], df[[2]]),
species = c(df[[3]], df[[3]]), stringsAsFactors = FALSE))
combined = combined[order(combined$species, combined$regions), ]
order = paste(combined$species, combined$regions, sep = "|")
We want the color of the links to be the same as the color of regoins
grid.col = structure(color_regions[combined$regions], names = order)
Since the chord diagram is grouped by species, gaps between species should be larger than inside each species.
gap = rep(1, length(order))
gap[which(!duplicated(combined$species, fromLast = TRUE))] = 5
With all settings ready, we now can make the chord diagram:
In following code, we set preAllocateTracks so that circular lines which represents species will be added afterwards.
circos.par(gap.degree = gap)
chordDiagram(df2, order = order, annotationTrack = c("grid", "axis"),
grid.col = grid.col, directional = TRUE,
preAllocateTracks = list(
track.height = 0.04,
track.margin = c(0.05, 0)
)
)
Circular lines are added to represent species:
for(species in unique(combined$species)) {
l = combined$species == species
sn = paste(combined$species[l], combined$regions[l], sep = "|")
highlight.sector(sn, track.index = 1, col = color_species[species],
text = species, niceFacing = TRUE)
}
circos.clear()
And the legends for regions and species:
legend("bottomleft", pch = 15, col = color_regions,
legend = names(color_regions), cex = 0.6)
legend("bottomright", pch = 15, col = color_species,
legend = names(color_species), cex = 0.6)
The plot looks like this:
Group by regions
The code is similar that I will not explain it but just attach the code in the post. The plot looks like this:
## group by regions
df2 = data.frame(from = paste(df[[1]], df[[3]], sep = "|"),
to = paste(df[[2]], df[[3]], sep = "|"),
value = df[[4]], stringsAsFactors = FALSE)
combined = unique(data.frame(regions = c(df[[1]], df[[2]]),
species = c(df[[3]], df[[3]]), stringsAsFactors = FALSE))
combined = combined[order(combined$regions, combined$species), ]
order = paste(combined$regions, combined$species, sep = "|")
grid.col = structure(color_species[combined$species], names = order)
gap = rep(1, length(order))
gap[which(!duplicated(combined$species, fromLast = TRUE))] = 5
circos.par(gap.degree = gap)
chordDiagram(df2, order = order, annotationTrack = c("grid", "axis"),
grid.col = grid.col, directional = TRUE,
preAllocateTracks = list(
track.height = 0.04,
track.margin = c(0.05, 0)
)
)
for(region in unique(combined$regions)) {
l = combined$regions == region
sn = paste(combined$regions[l], combined$species[l], sep = "|")
highlight.sector(sn, track.index = 1, col = color_regions[region],
text = region, niceFacing = TRUE)
}
circos.clear()
legend("bottomleft", pch = 15, col = color_regions,
legend = names(color_regions), cex = 0.6)
legend("bottomright", pch = 15, col = color_species, l
egend = names(color_species), cex = 0.6)

Related

Map of New York State counties with binned colors and legend

I am trying to make a county-level map of the state of New York. I would like to color each county based on their level of unionization. I need the map and legend to have four discrete colors of red, rather than a red gradient. I need the legend to display these four different colors with non-overlapping labels/ranges (e.g. 0-25; 26-50; 51-75; 76-100).
Here is my data:
fips unionized
1 36001 33.33333
2 36005 86.11111
3 36007 0.00000
4 36017 0.00000
5 36021 0.00000
6 36027 66.66667
7 36029 40.00000
8 36035 50.00000
9 36039 0.00000
10 36047 82.85714
11 36051 0.00000
12 36053 100.00000
13 36055 30.76923
14 36057 0.00000
15 36059 84.37500
16 36061 81.81818
17 36063 60.00000
18 36065 50.00000
19 36067 71.42857
20 36069 0.00000
21 36071 55.55556
22 36073 0.00000
23 36079 100.00000
24 36081 92.15686
25 36083 50.00000
26 36085 100.00000
27 36087 87.50000
28 36101 0.00000
29 36103 63.88889
30 36105 0.00000
31 36107 0.00000
32 36111 50.00000
33 36113 50.00000
34 36115 100.00000
35 36117 0.00000
36 36119 73.33333
37 36121 0.00000
38 36123 0.00000
I have successfully made the map with a gradient of colors, but cannot figure out how to make discrete colors in the map and legend.
Here is my code:
library(usmap)
library(ggplot2)
plot_usmap(regions = "counties", include = c("NY"), data = Z, values = "unionized") +
labs(title = "Percent Unionized", subtitle = "") +
scale_fill_continuous(low = "white", high = "red", na.value="light grey", name = "Unionization") + theme(legend.position = "right")
Thanks!
This could be achieved via scale_fill_binned and guide_bins. Try this:
library(usmap)
library(ggplot2)
plot_usmap(regions = "counties", include = c("NY"), data = Z, values = "unionized") +
labs(title = "Percent Unionized", subtitle = "") +
scale_fill_binned(low = "white", high = "red", na.value="light grey", name = "Unionization", guide = guide_bins(axis = FALSE, show.limits = TRUE)) +
theme(legend.position = "right")
A second option would be to bin the variable manually and use scale_fill_manual to set the fill colors which makes it easy to set the labels and has the advantage that it adds the NAs automatically. For the color scale I make use of colorRampPalette (By default colorRampPalette interpolates in rgb color space. To get fill colors like the one using scale_fill_binned you can add the argument space = "Lab".).
library(usmap)
library(ggplot2)
Z$union_bin <- cut_interval(Z$unionized, n = 4, labels = c("0-25", "26-50", "51-75", "76-100"))
plot_usmap(regions = "counties", include = c("NY"), data = Z, values = "union_bin") +
labs(title = "Percent Unionized", subtitle = "") +
scale_fill_manual(values = colorRampPalette(c("white", "red"))(5)[2:5],
na.value="light grey", name = "Unionization") +
theme(legend.position = "right")

How to find word frequency per country list in pandas?

Let's say I have a .CSV which has three columns: tidytext, location, vader_senti
I was already able to get the amount of *positive, neutral and negative text instead of word* pero country using the following code:
data_vis = pd.read_csv(r"csviamcrpreprocessed.csv", usecols=fields)
def print_sentiment_scores(text):
vadersenti = analyser.polarity_scores(str(text))
return pd.Series([vadersenti['pos'], vadersenti['neg'], vadersenti['neu'], vadersenti['compound']])
data_vis[['vadersenti_pos', 'vadersenti_neg', 'vadersenti_neu', 'vadersenti_compound']] = data_vis['tidytext'].apply(print_sentiment_scores)
data_vis['vader_senti'] = 'neutral'
data_vis.loc[data_vis['vadersenti_compound'] > 0.3 , 'vader_senti'] = 'positive'
data_vis.loc[data_vis['vadersenti_compound'] < 0.23 , 'vader_senti'] = 'negative'
data_vis['vader_possentiment'] = 0
data_vis.loc[data_vis['vadersenti_compound'] > 0.3 , 'vader_possentiment'] = 1
data_vis['vader_negsentiment'] = 0
data_vis.loc[data_vis['vadersenti_compound'] <0.23 , 'vader_negsentiment'] = 1
data_vis['vader_neusentiment'] = 0
data_vis.loc[(data_vis['vadersenti_compound'] <=0.3) & (data_vis['vadersenti_compound'] >=0.23) , 'vader_neusentiment'] = 1
sentimentbylocation = data_vis.groupby(["Location"])['vader_senti'].value_counts()
sentimentbylocation
sentimentbylocation gives me the following results:
Location vader_senti
Afghanistan negative 151
positive 25
neutral 2
Albania negative 6
positive 1
Algeria negative 116
positive 13
neutral 4
TO GET THE MOST COMMON POSITIVE WORDS, I USED THIS CODE:
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct + ['rt','via','...','…','’','—','—:',"‚","â"]
pos_lines = list(data_vis[data_vis.vader_senti == 'positive'].tidytext)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
Running this will give me the most common words and the number of times they appeared, such as
[(good, 1212),
(amazing, 123)
However, what I want to see is how many of these positive words appeared in a country.
For example:
I have a sample CSV here: https://drive.google.com/file/d/112k-6VLB3UyljFFUbeo7KhulcrMedR-l/view?usp=sharing
Create a column for each most_common word, then do a groupby location and use agg to apply a sum for each count:
words = [i[0] for i in pos_freq.most_common()]
# lowering all cases in tidytext
data_vis.tidytext = data_vis.tidytext.str.lower()
for i in words:
data_vis[i] = data_vis.tidytext.str.count(i)
funs = {i: 'sum' for i in words}
grouped = data_vis.groupby('Location').agg(funs)
Based on the example from the CSV and using most_common as ['good', 'amazing'] the result would be:
grouped
# good amazing
# Location
# Australia 0 1
# Belgium 6 4
# Japan 2 1
# Thailand 2 0
# United States 1 0

Add sample size to a panel figure of boxplots

I am trying to add sample size to boxplots (preferably at the top or bottom of them) that are grouped by two levels. I used the facet_grid() function to produce a panel plot. I then tried to use the annotate() function to add the sample sizes, however this couldn't work because it repeated the values in the second panel. Is there a simple way to do this?
head(FeatherData, n=10)
Location Status FeatherD Species ID
## 1 TX Resident -27.41495 Carolina wren CARW (32)
## 2 TX Resident -29.17626 Carolina wren CARW (32)
## 3 TX Resident -31.08070 Carolina wren CARW (32)
## 4 TX Migrant -169.19579 Yellow-rumped warbler YRWA (28)
## 5 TX Migrant -170.42079 Yellow-rumped warbler YRWA (28)
## 6 TX Migrant -158.66925 Yellow-rumped warbler YRWA (28)
## 7 TX Migrant -165.55278 Yellow-rumped warbler YRWA (28)
## 8 TX Migrant -170.43374 Yellow-rumped warbler YRWA (28)
## 9 TX Migrant -170.21801 Yellow-rumped warbler YRWA (28)
## 10 TX Migrant -184.45871 Yellow-rumped warbler YRWA (28)
ggplot(FeatherData, aes(x = Location, y = FeatherD)) +
geom_boxplot(alpha = 0.7, fill='#A4A4A4') +
scale_y_continuous() +
scale_x_discrete(name = "Location") +
theme_bw() +
theme(plot.title = element_text(size = 20, family = "Times", face =
"bold"),
text = element_text(size = 20, family = "Times"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(size = 15)) +
ylab(expression(Feather~delta^2~H["f"]~"‰")) +
facet_grid(. ~ Status)
There's multiple ways to do this sort of task. The most flexible way is to compute your statistic outside the plotting call as a separate dataframe and use it as its own layer:
library(dplyr)
library(ggplot2)
cw_summary <- ChickWeight %>%
group_by(Diet) %>%
tally()
cw_summary
# A tibble: 4 x 2
Diet n
<fctr> <int>
1 1 220
2 2 120
3 3 120
4 4 118
ggplot(ChickWeight, aes(Diet, weight)) +
geom_boxplot() +
facet_grid(~Diet) +
geom_text(data = cw_summary,
aes(Diet, Inf, label = n), vjust = 1)
The other method is to use the summary functions built in, but that can be fiddly. Here's an example:
ggplot(ChickWeight, aes(Diet, weight)) +
geom_boxplot() +
stat_summary(fun.y = median, fun.ymax = length,
geom = "text", aes(label = ..ymax..), vjust = -1) +
facet_grid(~Diet)
Here I used fun.y to position the summary at the median of the y values, and used fun.ymax to compute an internal variable called ..ymax.. with the function length (which just counts the number of observations).

Incorrect Long/Lat Plotting with ggplot

dat <- read.table(text=" 'Country_of_Asylum' 'ISO_3' 'Refugees_1000_inhabitants' Lat Long
Lebanon LBN 208.91 33.8333 35.8333
Jordan JOR 89.55 31.0000 36.0000
Nauru NRU 50.60 -0.5333 166.9167
Chad TCD 30.97 15.0000 19.0000
Turkey TUR 23.72 39.0000 35.0000
'South Sudan' SSD 22.32 4.8500 31.6000
Mauritania MRT 19.36 20.0000 -12.0000
Djibouti DJI 16.88 11.5000 43.0000 Sweden SWE 14.66 62.0000 15.0000 Malta MLT 14.58 35.9000 14.4000", header=TRUE)
data.frame(top_ten_pcapita)
library(ggplot2)
library(maps)
mdat <- map_data('world')
str(mdat)
ggplot() +
geom_polygon(dat=mdat, aes(long, lat, group=group), fill="grey50") +
geom_point(data=top_ten_pcapita,
aes(x=Lat, y=Long, map_id=ISO_3, size=`Refugees_1000_inhabitants`), col="red")
I tried to make a map on ggplot, but the longitudes and latitudes are completely off. I'm not entirely sure what's going on. For example, why is the lat. going over 100 on the map?
You mixed up longitude and latitude. If you added labels to your points you'd see that none of the countries is plotted at the correct place.
So make sure that x = Longitude and y = Latitude and it'll work:
ggplot() +
geom_polygon(dat = mdat, aes(long, lat, group = group), fill = "grey50") +
geom_point(data = top_ten_pcapita,
aes(x = long, y = lat, size = Refugees_1000_inhabitants), col = "red")
You switched the x and y axis. x should be longitude, y should be latitude. Also, I don't think Nauru should have a negative longitude.
Try this:
ggplot() +
geom_polygon(dat=mdat, aes(long, lat, group=group), fill="grey50") +
geom_point(data=top_ten_pcapita,
aes(x=Long, y=Lat, size=`Refugees_1000_inhabitants`), col="red")

Create 33 variables without manually writing them in python

I am creating 33 variables which execute a db query and I know there must be a way to make it simpler then writing 33 lines of code manually. What I am doing right now is:
allrtd = c.execute("SELECT Amsterdam from rtdtimes")
allrtd1 = c.execute("SELECT Bucharest from rtdtimes")
allrtd2 = c.execute("SELECT Barcelona from rtdtimes")
allrtd3 = c.execute("SELECT Berlin from rtdtimes")
allrtd4 = c.execute("SELECT Bratislava from rtdtimes")
allrtd5 = c.execute("SELECT Budapest from rtdtimes")
allrtd6 = c.execute("SELECT Copenhagen from rtdtimes")
allrtd7 = c.execute("SELECT Dublin from rtdtimes")
allrtd8 = c.execute("SELECT Dusseldorf from rtdtimes")
allrtd9 = c.execute("SELECT Dubai from rtdtimes")
allrtd10 = c.execute("SELECT Florence from rtdtimes")
allrtd11 = c.execute("SELECT Frankfurt from rtdtimes")
allrtd12 = c.execute("SELECT Geneva from rtdtimes")
allrtd13 = c.execute("SELECT Hamburg from rtdtimes")
allrtd14 = c.execute("SELECT HongKong from rtdtimes")
allrtd15 = c.execute("SELECT Istanbul from rtdtimes")
allrtd16 = c.execute("SELECT LosAngeles from rtdtimes")
allrtd17 = c.execute("SELECT London from rtdtimes")
allrtd18 = c.execute("SELECT Madrid from rtdtimes")
allrtd19 = c.execute("SELECT Milan from rtdtimes")
allrtd20 = c.execute("SELECT Marseille from rtdtimes")
allrtd21 = c.execute("SELECT Moscow from rtdtimes")
allrtd22 = c.execute("SELECT Munich from rtdtimes")
allrtd23 = c.execute("SELECT NewYork from rtdtimes")
allrtd24 = c.execute("SELECT Paris from rtdtimes")
allrtd25 = c.execute("SELECT Prague from rtdtimes")
allrtd26 = c.execute("SELECT Rotterdam from rtdtimes")
allrtd27 = c.execute("SELECT Sofia from rtdtimes")
allrtd28 = c.execute("SELECT Stockholm from rtdtimes")
allrtd29 = c.execute("SELECT Venice from rtdtimes")
allrtd30 = c.execute("SELECT Vienna from rtdtimes")
allrtd31 = c.execute("SELECT Warsaw from rtdtimes")
allrtd32 = c.execute("SELECT Zurich from rtdtimes")
I have a list with all the city names already which is done like this:
city1 = ['Amsterdam', 'Bucharest', 'Barcelona', 'Berlin', 'Bratislava', 'Budapest',
'Copenhagen', 'Dublin', 'Dusseldorf', 'Dubai', 'Florence', 'Frankfurt',
'Geneva', 'Hamburg', 'HongKong', 'Istanbul', 'LosAngeles', 'London',
'Madrid', 'Milan', 'Marseille', 'Moscow', 'Munich', 'NewYork',
'Paris', 'Prague', 'Rotterdam', 'Sofia', 'Stockholm', 'Venice', 'Vienna',
'Warsaw', 'Zurich']
Also I am using Flask so I need to render this variable in the template so that I can use these variables in the html file.
Right now I am rendering it like this:
return render_template("Allresults.html", city1=city1, allrtd=allrtd,
allrtd1=allrtd1, allrtd2=allrtd2, allrtd3=allrtd3, allrtd4=allrtd4,
allrtd5=allrtd5, allrtd6=allrtd6, allrtd7=allrtd7, allrtd8=allrtd8,
allrtd9=allrtd9, allrtd10=allrtd10, allrtd11=allrtd11, allrtd12=allrtd12,
allrtd13=allrtd13, allrtd14=allrtd14, allrtd15=allrtd15, allrtd16=allrtd16,
allrtd17=allrtd17, allrtd18=allrtd18, allrtd19=allrtd19, allrtd20=allrtd20,
allrtd21=allrtd21, allrtd22=allrtd22, allrtd23=allrtd23, allrtd24=allrtd24,
allrtd25=allrtd25, allrtd26=allrtd26, allrtd27=allrtd27, allrtd28=allrtd28,
allrtd29=allrtd29, allrtd30=allrtd30, allrtd31=allrtd31, allrtd32=allrtd32)
I do not know how to use a dictionary to do exactly this as the duplicate question is very different to what I am doing.
You may benefit from a custom jinja2 filter here:
#app.template_filter('city_data')
def city_data(city_name):
return c.execute("SELECT {city} from rtdtimes".format(city=city_name))
#app.route('/')
def index():
return render_template("Allresults.html", cities=city1)
Then, in your Allresults.html template use:
{% for city in cities %}
{{ city|city_data }}
{% endfor %}
However, be very careful because you are leaving yourself open to SQL injection attacks if you ever allow a user to specify the city.