Why has the cluster of text mined in R gone fuzzy? - text-mining

Question: Why has the cluster dendrogram of text mined data gone fuzzy /messy (see link to the diagram below)?
Synopsis:I first harvested the original data of approximately 5500 e-scanned articles from a Mongo database, and saved in disk drive as a Json object (ode not shown here, harvested using Cran Mongolite package for R). What is shown here is the standard text processing (using Cran TM package) to clean “the”, “and”, “ing”, “;”, “:” etc.,). That lead to the ensuing hierarchical clustering, which looks fuzzy/MESSY because some of the words in the Json object were very long combinations of letters and not real words that can be separately identified.
Calling two of the libraries
library("tm")
library ("SnowballC")
Creating a path to the data and a corpus of text
cname <- file.path("C:", "texts")
docs <- Corpus(DirSource(cname))
Processing the text
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
tdm <- TermDocumentMatrix(docs)
Thirdly the clustering via dendrogram
d<-dist(tdm,method = "euclidean")
hc<-hclust(d, method="ward.D2")
library("rafalib")
myplclust(hc, labels=hc$labels)
Link to the image:
cluster/dendrogram/text mining

Answer is cut less frequent words, more than just the standard stopwords that I had already done (see how I cut them in the code below)
mystopwords <- findFreqTerms(tdm, 1, 20)
mystpwrds <- paste(mystopwords, collapse = "|")
tdm <- tdm[tdm$dimnames$Terms[!grepl(mystpwrds,tdm$dimnames$Terms)],]
The whole picture and code is published here:
http://rpubs.com/antonyama/180574

Related

How to generate a fits file from the beginning

In this post, they explain how to generate a fits file from ascii file. However, I also would like to know how to define header and data into fits file. (Converting ASCII Table to FITS image)
For example, when I call a spectral fits file with astropy (which is downloaded from a telescope), I can call data and header separately.
I.E
In [1]:hdu = fits.open('observation.fits', memmap=True)
In [2]:header = hdu[0].header
In [3]:header
Out [3]:
SIMPLE = T / conforms to FITS standard
BITPIX = 8
NAXIS = 1
NAXIS1 = 47356
EXTEND = T
DATE = 'date' / file creation date (YYYY-MM-DDThh:mm:ss UT)
ORIGIN = 'XXX ' / European Southern Observatory
TELESCOP= 'XXX' / ESO Telescope Name
INSTRUME= 'Instrument' / Instrument used.
OBJECT = 'ABC ' / Original target.
RA = 30.4993 / xx:xx:xx.x RA (J2000) pointing
DEC = -20.0009 / xx:xx:xx.x DEC (J2000) pointing
CTYPE1 = 'WAVE ' / wavelength axis in nm
CRPIX1 = 0. / Reference pixel in z
CRVAL1 = 298.903594970703 / central wavelength
CDELT1 = 0.0199999995529652 / nm per pixel
CUNIT1 = 'nm ' / spectral unit
..
bla bla
..
END
In [3]:data = hdu[0].data
In [4]:data
Out [4]:array([ 1000, 1001, 1002, ...,
5.18091546e-13, 4.99434453e-13, 4.91280864e-13])
Lets assume, I have data like below
WAVE FLUX
1000 2.02e-12
1001 3.03e-12
1002 4.04e-12
..
bla bla
..
So, I'd like to generate a spectral fits file with my own data (with its own header).
Mini question : Now lets assume, I generate spectral fits file correctly, but I realised that I forgot to take logarithm of WAVE values in X axis (1000, 1001, 1002, ....) . How can I do that without touching FLUX values of Y-axis (2.02e-12, 3.03e-13, 4.04e-13) ?
FITS files are organized as one or more HDUs (Header Data Units) consisting, as the name suggests, as one data object (generally, a single array for an observation, though sometimes something else like a table), and the header of metadata that goes with that data.
To create a file from scratch, especially an image, the simplest way is to directly create an ImageHDU object:
>>> from astropy.io import fits
>>> hdu = fits.ImageHDU()
Just as with an HDU read from an existing file, this HDU has a (mostly empty) header, and an empty data attribute that you can then assign to:
>>> hdu.data = np.array(<some numpy array>)
>>> hdu.header['TELESCOP'] = 'Gemini'
When you're satisfied you can write the HDU out to a file with:
>>> hdu.writeto('filename.fits')
(Note: A lot of the documentation you'll see demonstrates a more complex process of creating an HDUList object, appending the HDU to the HDU list, and then writing the full HDU list. This is only necessary if you're creating a multi-extension FITS file. For a single HDU, you can use hdu.writeto directly and the framework will handle the other structural details.)
In general you don't need to manipulate the headers that describe the format of the data itself--that is automatic and should not be touched by hand (FITS has the unfortunate misfeature of mixing information about data structure with actual metadata). You can see more examples on how to manipulate FITS data here: http://docs.astropy.org/en/stable/generated/examples/index.html#astropy-io
Your other question pertains to manipulating the WCS (World Coordinate System) of the image, and in particular for spectral data this can be non-trivial. I would ask a separate question about that with more details about what you hope to accomplish.

Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon

How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package?
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
e.g. we can see in a_sent that one term has been classified as anger, but how do we find what that term was? So I want to list all the sentiments and the terms associated in my example.
Thanks.
library(tidytext)
library(tm)
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
lexicon <- get_sentiments("nrc")
v <- sort(rowSums(a_tmx),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# List the words in common between the text provided and NRC lexicon
intersect(lexicon$word, d$word)
# Show the words in common and their sentiments
s <- cbind(lexicon$word[lexicon$word%in%d$word], lexicon$sentiment[lexicon$word%in%d$word])
print(s)

Importing a TermDocumentMatrix into R

I am working on a qualitative analysis project in the tm package of R. I have built a corpus and created a term document matrix and long story short I need to edit my term document matrix and conflate some of its rows. To do this I have exported it out of R using
write.csv()
I then have imported the csv file back into R but am struggling to figure out how to get R to read it as a TermDocumentMatrix or DocumentTermMatrix.
I tried using the suggestions of the following example code with no avail.
It seems to keep reading my matrix as if it was a corpus and each cell as a single document.
# change this file location to suit your machine
file_loc <- "C:\\Documents and Settings\\Administrator\\Desktop\\Book1.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
Is there any way to import in a csv matrix that will be read as a termdocumentmatrix or documenttermmatrix without having R read the csv as if each cell is a document?
You're not reading documents, so skip the Corpus() step. This should work directly:
myDTM <- as.DocumentTermMatrix(x, weighting = weightTf)
For next time, consider saving the TDM object as .RData as this will not require conversion, and is also much more efficient.
If you want to keep the format of any data, I would recommend to use the save() function.
You can save any R objects into a .RData file. And when you want to retrieve the data, you can use the load() function.

Handling paginated SQL query results

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").
So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1
# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {
# setup SQL pagination
if (rq$where == '') rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
# some code
# add current data frame to the list
dfList <- c(dfList, data)
if (DEBUG) message("*", appendLF = FALSE)
}
# merge all the result pages' data frames
data <- do.call("rbind", dfList)
# save current data frame to RDS file
saveRDS(data, rdataFile)
It probably falls into the category when presumably MySQL hinders LIMIT OFFSET:
Why does MYSQL higher LIMIT offset slow the query down?
Overall, fetching large data sets over HTTP repeatedly is not very reliable.
Since this is for your dissertation, here is a hand:
## Folder were to save the results to disk.
## Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist.
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)
## The larger this number, the more memory you will require.
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10
## The limit size per request
RQ_SIZE <- 1000
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())
TotalNumberOfRequests <- TotalRows %/% RQ_SIZE
TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1
## FYI: Total number of rows being requested is
## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups)
for (g in seq(TotalNumberOfGroups)) {
ret <-
lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {
## function(i) is the same code you have
## inside your for loop, but cleaned up.
# setup SQL pagination
if (rq$where == '')
rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
# retrieve result
data <- srdaGetData()
# some code
if (DEBUG) message("*", appendLF = FALSE)
### DONT ASSIGN TO dfList, JUST RETURN `data`
# xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
### INSTEAD:
## return
data
})
## save each iteration
file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
saveRDS(do.call(rbind, ret), file=file.out)
## OPTIONAL (this will be slower, but will keep your rams and goats in line)
# rm(ret)
# gc()
}
Then, once you are done scraping:
library(data.table)
folder.out <- "~/mydissertation/sql_data_scrape/"
files <- dir(folder.out, full=TRUE, pattern="\\.RDS$")
## Create an empty list
myData <- vector("list", length=length(files))
## Option 1, using data.frame
for (i in seq(myData))
myData[[i]] <- readRDS(files[[i]])
DT <- do.call(rbind, myData)
## Option 2, using data.table
for (i in seq(myData))
myData[[i]] <- as.data.table(readRDS(files[[i]]))
DT <- rbindlist(myData)
I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.
The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.
This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.

xts objects and split

I have many "problems" but i will try to split them up as good as i can.
First i will present my code:
# Laster pakker
library(RODBC)
library(plyr)
library(lattice)
library(colorRamps)
library(Perfor)
# Picking up data
query <- "select convert(varchar(10),fk_dim_date,103) fk_dim_date,fk_dim_portfolio, dtd_portfolio_return_pct, dtd_benchmark_return_pct, * from nbim_dm..v_fact_performance
where fk_dim_date > '20130103' and fk_dim_portfolio in ('6906', '1812964')
"
# Formatting SQLen
query <- strwrap(query,width=nchar(query),simplify=TRUE)
# quering
ch <- odbcDriverConnect("driver={SQL Server};server=XXXX;Database=XXXX;", rows_at_time = 1024)
result <- sqlQuery(ch, query, as.is=c(TRUE, TRUE, TRUE))
close(ch)
# Do some cleanup
`enter code here`resultt$v_d <- as.Date(as.POSIXct(t$v_d))
#split
y <- split(qt,qt$fk_dim_portfolio)
#making names
new_names <- c("one","two")
for (i in 1:length(y){assign(new_names[i],y[[i]])})
So far so good:
The table that my SQL is running on has approx 178 diff. port_ids, some of which are useless and others that are highly useful. However i want this code to pull all fk_dim_ports (pulling: '6906', '1812964 was just for example purposes). After pulling the data i want to seperate it into n (now 178 sets) and make them xts objects which i have run into some trouble using:
qt <- xts(t[,-1],order.by=t[,1])
But works perfectly well if i don`t split the data using:
y <- split(qt,qt$fk_dim_portfolio)
Assuming this will work, my intention is to create charts.PerformanceSummary(mydata) for every table of my previous created data frames.
If you have any tips on how to split, make timeseries objects and loop the generation of the charts i would highly appreciate this.
I am aware that this post probably don`t comply to your rules/customs etc, but thanks for helping.
Lars