I'm trying to build a sort of bar-chart using a simple data file (.example) containing only 0s or 1s. Here is the data contained in .example:
dest P1 P2 P3 P4 P5 NA
D1 0 1 1 0 0 0
D2 0 0 1 0 0 0
D3 0 1 0 1 0 0
""
GPV 1 1 1 1 1 1
and here is the code I'm using:
set style histogram rowstacked title textcolor lt -1
set datafile missing 'nan'
set style data histograms
plot '.example' using ( $2==0 ? 1 : 0 ) ls 17 title 'NA', \
'' using ( $2==1 ? 1 : 0 ) ls 1, \
for [i=3:5] '.example' using ( column(i)==0 ? 1 : 0) ls 17 notitle, \
for [i=3:5] '' using ( column(i)==1 ? 1 : 0) ls i-1
where the last two commands iterate over a potentially large number of
columns stacking white or colored boxes depending on the value of column(i). To keep the same color order among different columns in the histogram I would need to merge the two iterations into a single one with two commands.
Is it possible? Any suggestion on how to do that?
You can use nested loops, which I think is what you want to achieve. You can use an outer loop iterating over your large number of columns and an inner loop iterating over the two options (white vs. colored), for [i=3:5] for [j=0:1], and tell gnuplot to ignore the column if its content doesn't match the value of j using 1/0 (or use the trick, valid for histograms, of setting it to 0 as you're already doing):
set style histogram rowstacked title textcolor lt -1
set datafile missing 'nan'
set style data histograms
plot '.example' using ( $2==0 ? 1 : 0 ) ls 17 title 'NA', \
'' using ( $2==1 ? 1 : 0 ) ls 1, \
for [i=3:5] for [j=0:1] '.example' using ( column(i) == j ? 1 : 0 ) \
ls ( j == 0 ? 17 : i-1 ) notitle
The code above is equivalent to what you have already, only the value of j allows to switch the style depending on whether you have a 0 or a 1 as the column's value.
Related
I have a pandas dataframe which includes columns (amongst others) like this, with RATING being integers 0 to 5 and COMMENT is string:
RATING COMMENT
1 some text
2 more text
3 other text
... ...
I would now like to mine (for lack of better word ) the key words for a list of strings:
list = ['like', trust', 'etc etc etc']
and would like to iterate through the COMMENT and count the number of key words by rating to get a df out like so
KEYWORD RATING COUNT
like 1 202
like 2 325
like 3 0
like 4 967
like 5 534
...
trust 1 126
....
how can I achieve this?
I am beginner so would really appreciate your help (and the simpler and more understandable the better)
thank you
hi at the moment I have been iterating through manually,
ie
#DATA_df is the original data
word_list = ['word', 'words', 'words', 'more']
values = [0] * len(word_list)
tot_val=[values]*5
rating_table = pd.DataFrame(tot_val, columns=word_list)
for i in len(word_list):
for g in len (DATA_df[COMMENT]):
if i in DATA_df[COMMENT][g]:
rating_table[i][DATA_df[RATING]-1] +=1
this give a DF like so
word words words more
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
that I am then trying to add to.... it appears really clunky
I managed to solve it, key points learnt are use group by to pre-select data based on the rating, this slices the data and it is possible to alternate through the groups. also use of str.lower() in combination with str.count() worked well.
I am thankful if more experienced programmers could show me a better solution, but at least this works.
rating = [1,2,3,4,5]
rategroup = tp_clean.groupby('Rating')
#print (rategroup.groups)
results_list =[]
for w in word_list:
current = [w]
for r in rating:
stargroup = rategroup.get_group(str(r))
found = stargroup['Content'].str.lower().str.count(w)
c = found.sum()
current.append(c)
results_list.append(current)
results_df = pd.DataFrame (results_list, columns=['Keyword','1 Star','2 Star','3 Star','4 Star','5 Star'])
The one thing I am still struggling with is how to use regex to make it look for full words. I believe \b is the right one but how do I put it into str.count function?
My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone
I'm trying to use sqlSave command to import R dataframe into SQL database. Below is my code
> head(final_series)
Price Time FactorID CountryID id
1 5.363334e+01 1980-01-01 1 1 1
2 5.143333e+01 1980-04-01 1 1 16384
3 5.060000e+01 1980-07-01 1 1 32767
4 5.250000e+01 1980-10-01 1 1 49150
5 5.266667e+01 1981-01-01 1 1 65533
6 5.280000e+01 1981-04-01 1 1 81916
> sqlSave(dbhandle, final_series, tablename = "db_time_price", varTypes = c(id="uniqueidentifier", FactorID= "float", CountryID="float", Time="date", Price="float"), append=TRUE, verbose = T, fast = F)
But I got the following error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Anyone knows why? Thanks!
Did you check if the table already exists? If the table already exists but with a different dimension you would see this error.
I am working on heat map with a unique dataset. The dataset consists of a symbol.
Here is the example of my dataset 1q.txt
one two three
2009 0/0 1 0/0 1 0/0 1
2010 0/0 1 0/0 1 0/0 1
2011 0/0 1 0/0 1 6/179.5 1
2012 0/0 1 2/0.4 1 11/83.0 1
2013 7/0.8 1 7/21.3 1 17/268.5 1
2014 1/3.5 1 4/7.7 1 9/37.9 1
and here is my gnuplot script
set term pos eps font 20
unset colorbox
unset key
set nocbtics
set cblabel "Score"
set cbtics scale 0
set cbrange [ 0.00000 : 110.00000 ] noreverse nowriteback
set palette defined ( 0.0 "#FFFFFF",\
1 "#FFCCCC",\
2 "#FF9999 ",\
3 "#FF6666")
set size 1, 0.5
set output '1q.eps'
YTICS="`awk 'BEGIN{getline}{printf "%s ",$1}' '1q.dat'`"
XTICS="`head -1 '1q.dat'`"
set for [i=1:words(XTICS)] xtics ( word(XTICS,i) i-1 )
set for [i=1:words(YTICS)] ytics ( word(YTICS,i) i-1 )
set for [i=1:words(XTICS)] xtics ( word(XTICS,i) 2*i-1 )
plot "<awk '{$1=\"\"}1' '1q.dat' | sed '1 d'" matrix every 2::1 w image, \
'' matrix using ($1+1):2:(sprintf('%.f', $3)) every 2 with labels
What I'm trying to do here is I want to displaying "0/0" as a label in the heatmap and the integer number as a heatmap color.
The problem that I face here is the gnuplot only take a number before "/"
and ignore the other one.
Here is the result of my current plot.
How to make the heatmap show a lable like "1/3.5" and have color based on the integer number.
There is no need to usesprintf at all. Simply use stringcolumn to get the raw content of a column as saved in the data file:
plot "<awk '{$1=\"\"}1' '1q.dat' | sed '1 d'" matrix every 2::1 w image, \
'' matrix using ($1+1):2:(stringcolumn(3)) every 2 with labels
Not sure how else to ask this but, I want to search for a term within several string elements. Here's what my code looks like (but wrong):
inplay = vector(length=nrow(des))
for (ii in 1:nrow(des)) {
if (des[ii] = 'In play%')
inplay[ii] = 1
else inplay[ii] = 0
}
des is a vector that stores strings such as "Swinging Strike", "In play (run(s))", "In play (out(s) recorded)" and etc. What I want inplay to store is a 1s and 0s vector corresponding with the des vector, with the 1s in inplay indicating that the des value had "In play%" in it and 0s otherwise.
I believe the 3rd line is incorrect, because all this does is return a vector of 0s with a 1 in the last element.
Thanks in advance!
The data.table package has syntax that is often similar to SQL. The package includes %like%, which is a "convenience function for calling regexpr". Here is an example taken from its help file:
## Create the data.table:
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
## Subset the DT table where the Name column is like "Mar%":
DT[Name %like% "^Mar"]
## Name Salary
## 1: Mary 2
## 2: Martha 4
The R analog to SQL's LIKE is just R's ordinary indexing syntax.
The 'LIKE' operator selects data rows from a table by matching string values in a specified column against a user-supplied pattern
> # create a data frame having a character column
> clrs = c("blue", "black", "brown", "beige", "berry", "bronze", "blue-green", "blueberry")
> dfx = data.frame(Velocity=sample(100, 8), Colors=clrs)
> dfx
Velocity Colors
1 90 blue
2 94 black
3 71 brown
4 36 beige
5 75 berry
6 2 bronze
7 89 blue-green
8 93 blueberry
> # create a pattern to use (the same as you would do when using the LIKE operator)
> ptn = '^be.*?' # gets beige and berry but not blueberry
> # execute a pattern-matching function on your data to create an index vector
> ndx = grep(ptn, dfx$Colors, perl=T)
> # use this index vector to extract the rows you want from the data frome:
> selected_rows = dfx[ndx,]
> selected_rows
Velocity Colors
4 36 beige
5 75 berry
In SQL, that would be:
SELECT * FROM dfx WHERE Colors LIKE ptn3
Something like regexpr?
> d <- c("Swinging Strike", "In play (run(s))", "In play (out(s) recorded)")
> regexpr('In play', d)
[1] -1 1 1
attr(,"match.length")
[1] -1 7 7
>
or grep
> grep('In play', d)
[1] 2 3
>
Since stringr 1.5.0, you can use str_like, which follows the structure of SQL's LIKE:
library(stringr)
fruit <- c("apple", "banana", "pear", "pineapple")
str_like(fruit, "app%")
#[1] TRUE FALSE FALSE FALSE
Not only does it include %, but also several other operators (see ?str_like).
Must match the entire string
_ matches a single character (like .)
% matches any number of characters (like .*)
% and _ match literal % and _
The match is case insensitive by default