How to get similar meaning of text from csv file in R - data-science

Following is my code:
library(sentiment)
library(dplyr)
library(plyr)
library(plotly)
library(ggplot2)
library(readr)
library(tm)
library(Matrix)
library(syuzhet)
moviedata=read.csv('C:/Users/Sudeer/Desktop/movies1.csv',stringsAsFactors = FALSE)
df<-data.frame(moviedata)
View(df)
class_pol = classify_polarity(moviedata, algorithm="bayes")
View(class_pol)
a<-subset(class_pol=)
# get polarity best fit
polarity = class_pol[,4]
# Create data frame with the results and obtain some general statistics
# data frame with results
sent_df = data.frame(text= moviedata,polarity=polarity,stringsAsFactors=FALSE)
View(sent_df)
ggplot(sent_df, aes(x=polarity)) +geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy") + labs(x="polarity categories", y="text") + ggtitle("Movie classification")
pos<-subset(sent_df, sent_df$polarity=="positive")
Till now I get positive opinions from csv file. In csv file it contains film effecting indian youth. I separated positive negative netral opinions. from positive opinions i need similar meaning of text in R. please clarify.

Related

How to make a stacking bar using ggplot?

I have got this dataset. I am trying to do a stacking bar graph with proportions using ggplot for this data:
I am not really sure how to manipulate it into tables first! I know, I just started learning R, two weeks ago and I'm kind of stuck. I made a similar graph before. I attached it here.
I'm not sure if I got your question right, but I'll try to answer it. I see that this is your first question in Stack Overflow, so I'd advise you to post a minimal reproducible example on your next question.
1) "I am not really sure how to manipulate it into tables first!"
Copy the data into an excel file, save it as csv and import into R with base R command.
df <- read.csv('your_data.csv')
2) " do a stacking bar graph with proportions"
Your problem is very similar to the one mentioned in this question. Make sure to check it out, but I've already adapted the code below, see if it works.
library(ggplot2)
library(dplyr)
library(tidyr)
df <- read.csv('your_data.csv')
# Add an id variable for the filled regions and reshape
dfm <- df %>%
mutate(Domain = factor(row_number()) %>%
gather(variable, value, -Domain)
ggplot(dfm, aes(x = variable, y = value, fill = Domain)) +
geom_bar(position = "fill",stat = "identity") +
# or:
# geom_bar(position = position_fill(), stat = "identity"
scale_y_continuous(labels = scales::percent_format())

How to filter a very large csv in R prior to opening it?

I'm currently trying to open a 48GB csv on my computer. Needless to say that my RAM does no support such a huge file, so I'm trying to filter it before opening. From what I've researched, the most appropriate way to do so in R is using the sqldf lib, more specifically the read.csv.sql function:
df <- read.csv.sql('CIF_FOB_ITIC-en.csv', sql = "SELECT * FROM file WHERE 'Year' IN (2014, 2015, 2016, 2017, 2018)")
However, I got the following message:
Erro: duplicate column name: Measure
As SQL is case insensitive, having two variables, one named Measure and another named MEASURE, implies duplicity in column names. To get around this, I tried using the header = FALSE argument and substituted the 'Year' by V9, yielding the following error instead:
Error in connection_import_file(conn#ptr, name, value, sep, eol, skip)
: RS_sqlite_import: CIF_FOB_ITIC-en.csv line 2 expected 19 columns
of data but found 24
How should I proceed in this case?
Thanks in advance!
Here's a Tidyverse solution that reads in chunks of the CSV, filters them, and stacks up the resulting rows. This code also does this in parallel, so the whole file gets scanned, but far more quickly (depending on your core count) than if the chunks were processed one at a time, as with apply (or purrr::map for that matter).
Comments inline.
library(tidyverse)
library(furrr)
# Make a CSV file out of the NASA stock dataset for demo purposes
raw_data_path <- tempfile(fileext = ".csv")
nasa %>% as_tibble() %>% write_csv(raw_data_path)
# Get the row count of the raw data, incl. header row, without loading the
# actual data
raw_data_nrow <- length(count.fields(raw_data_path))
# Hard-code the largest batch size you can, given your RAM in relation to the
# data size per row
batch_size <- 1e3
# Set up parallel processing of multiple chunks at a time, leaving one virtual
# core, as usual
plan(multiprocess, workers = availableCores() - 1)
filtered_data <-
# Define the sequence of start-point row numbers for each chunk (each number
# is actually the start point minus 1 since we're using the seq. no. as the
# no. of rows to skip)
seq(from = 0,
# Add the batch size to ensure that the last chunk is large enough to grab
# all the remainder rows
to = raw_data_nrow + batch_size,
by = batch_size) %>%
future_map_dfr(
~ read_csv(
raw_data_path,
skip = .x,
n_max = batch_size,
# Can't read in col. names in each chunk since they're only present in the
# 1st chunk
col_names = FALSE,
# This reads in each column as character, which is safest but slowest and
# most memory-intensive. If you're sure that each batch will contain
# enough values in each column so that the type detection in each batch
# will come to the same conclusions, then comment this out and leave just
# the guess_max
col_types = cols(.default = "c"),
guess_max = batch_size
) %>%
# This is where you'd insert your filter condition(s)
filter(TRUE),
# Progress bar! So you know how many chunks you have left to go
.progress = TRUE
) %>%
# The first row will be the header values, so set the column names to equal
# that first row, and then drop it
set_names(slice(., 1)) %>%
slice(-1)

Convert date/time index of external dataset so that pandas would plot clearly

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here.
But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.
Code:
dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))
gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)
gen = pd.concat(gen_raw, ignore_index=True)
gen.drop(gen.columns[[1,2]], axis=1, inplace=True)
#gen['Date/Time'] = gen['Date/Time'][11:] -> cause error, didnt work
filter = gen[gen['Date/Time'].str.endswith('00') | gen['Date/Time'].str.endswith('30')]
filter['rad_tot'] = filter['Direct radiation [W/m**2]'] + filter['Diffuse radiation [W/m**2]']
lis = np.arange(35040) #used the number of rows, checked by printing. THis is for 2009-2010.
plt.xticks(lis, filter['Date/Time'])
plt.plot(lis, filter['rad_tot'], '.')
plt.title('test of generation 2009')
plt.xlabel('Date/Time')
plt.ylabel('radiation total [W/m**2]')
plt.show()
My other approach in mind was to use plotly. Yet again, its main purpose seems to feed in data on the internet. It would be best if I am familiar with all the modules and try for myself, but I am learning as I go to use pandas and matplotlib.
So I would like to ask whether there are anyone who experienced similar issues as I.
I think you need set labels to not visible by loop:
ax = df.plot(...)
spacing = 10
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Trouble converting UTM to lat long for southern hemisphere

I have ~40 points in UTM zone 19 taken from Peru that I would like to convert to lat/long to project onto Google Earth. I am having some problems with PBSmapping and can't seem to figure out the solution. I have searched through the forums and tried several different methods, including the project command in proj4 but still can't get this to work. Here is the code I have currently written
library(PBSmapping)
#just two example UTM coordinates
data<-as.data.frame(matrix(c(214012,197036,8545520,8567292),nrow=2))
attr(data,"projection") <- "UTM"
attr(data, "zone") <- 19
colnames(data)<-c("X","Y")
convUL(data,km=FALSE)
The corresponding lat/longs should be somewhere with lats between -12.9XXXXX and -13.0XXXXX and long between -71.8XXXX to -71.4XXXX. The values given by convUL seem to be way off.
Once you get the valid pairs of coordinates you could do something like this:
library(rgdal)
data <- data.frame(id = c(1:2), x = c(214012,197036) , y = c(8545520,8567292))
coordinates(data) = ~x+y
Asign projection
# Use the appropriate EPSG Code
proj4string(data) = CRS('+init=epsg:24891') # 24891 or 24893
Transform to geographic coordinates
enter code heredata_wgs84 <- spTransform(data, CRS('+init=epsg:4326'))
Get some valid background data to plot it against
# Country data
package(dismo)
peru <-getData('GADM', country='PER', level=0)
plot(peru, axes = T)
plot(data, add = T)
Write your KML file
# Export kml
tmpd <- 'D:\\'
writeOGR(data_wgs84, paste(tmpd, "peru_data.kml", sep="/"), 'id', driver="KML")

Complementary Filter Code Not functioning

I've been scratching my head too long.
The data is coming from an 3D accelerometer and 3D gyro. I am using a complementary filter to control drift.
I have it working in excel but can't seem to get this python code to do the same thing:
r1_angle_cfx = np.zeros(len(r1_angle_ax))
r1_angle_cfx[0] = r1_angle_ax[0]
for i in xrange(len(r1_angle_ax)-1):
j = i + 1
r1_angle_cfx[j] = 0.98 *(r1_angle_cfx[i] + r1_alpha_x[j]*fs) + (0.02 * r1_angle_ax[j]) #complementary filter
In excel (correct) I get:
In python (incorrect) I get:
What is going wrong? and is there a better way to do this in python?
Thanks,
Scott
EDIT: Link to data files -
sample data
1. The csv file contains accelerometer, gyro data that is entered into the filter formula as well as those values that were calculated in excel.
2. The excel file contains all raw data (steps not mentioned above but I have triple checked and are equivalent up to the point of being entered in the filter formula).
EDIT 2: update - Turns out my code works. It was sloppy debugging. fs should be fs = 0.01. In my code I had fs = 1/100 which ends up = 0 in the script.
Your Python code looks pretty reasonable. Without example data, I can't do much more than say that.
But I can guess. I looked up "complementary filters" and found a link explaining them:
https://sites.google.com/site/myimuestimationexperience/filters/complementary-filter
This link gives an example equation that is very similar to yours:
angle = (1-alpha)*(angle + gyro * dt) + (alpha)*(acc)
You have fs where this has dt, and dt is computed as 1/sampling_frequency. If fs is the sampling frequency, maybe you should try inverting it?
EDIT: Okay, now that you posted the data, I played around with this. Here is my program that gets a correct result.
Your code looks basically correct, so I think you must have made a mistake in your code that collected the values. I'm not quite sure because your variable names confuse me.
I used a namedtuple and for the names, I used the column headers from the CSV file (with spaces and periods removed to make a valid Python identifier).
import collections as coll
import csv
import matplotlib.pyplot as plt
import numpy as np
import sys
fs = 100.0
dt = 1.0/fs
alpha = 0.02
Sample = coll.namedtuple("Sample",
"accZ accY accX rotZ rotY rotX r acc_angZ acc_angY acc_angX cfZ cfY cfX")
def samples_from_file(fname):
with open(fname) as f:
next(f) # discard header row
csv_reader = csv.reader(f, dialect='excel')
for i, row in enumerate(csv_reader, 1):
try:
values = [float(x) for x in row]
yield Sample(*values)
except Exception:
lst = list(row)
print("Bad line %d: len %d '%s'" % (i, len(lst), str(lst)))
samples = list(samples_from_file("data.csv"))
cfx = np.zeros(len(samples))
# Excel formula: =R12
cfx[0] = samples[0].acc_angX
# Excel formula: =0.98*(U12+N13*0.01)+0.02*R13
# Excel: U is cfX N is rotX R is acc_angX
for i, s in enumerate(samples[1:], 1):
cfx[i] = (1.0 - alpha) * (cfx[i-1] + s.rotX*dt) + (alpha * s.acc_angX)
check_line = [s.cfX - cf for s, cf in zip(samples, cfx)]
plt.figure(1)
plt.plot(check_line)
plt.plot(cfx)
plt.show()
check_line is the difference between the saved cfX value from the CSV file, and the new computed cfx value. As you can see in the plot, this is a straight line at 0, so my calculation is agreeing quite well with yours.
So I guess the mapping of names is:
your_name my_name
________________________
r1_angle_cfx cfx
r1_alpha_x rotX
r1_angle_ax acc_angX