Remove rows with character(0) from a data.frame before proceeding to dtm - text-mining

I'm analyzing a data frame of product reviews that contain some empty entries or text written in foreign language. The data also contain some customer attributes which can be used as "features" in later analysis.
To begin with, I will first convert the reviews column into DocumentTermMatrix and then convert it to lda format, I then plan to throw in the documents and vocab objects generated from the lda process along with selected columns from the original data frame into stm's prepDocuments() function such that I can leverage the more versatile estimation functions from that package, using customer attributes as features to predict topic salience.
However, because some of the empty cells, punctuation, and foreign characters might be removed during the pre-processing and thereby creating some character(0) rows in the lda's documents object, making those reviews unable to match their corresponding rows in the original data frame. Eventually, this will prevent me from generating the desired stm object from prepDocuments().
Methods to remove empty documents certainly exist (such as the methods recommended in this previous thread), but I am wondering if there're ways to also remove the rows correspond to the empty documents from the original data frame such that the number of lda documents and the row dimension of the data frame that will be used as meta in the stm functions are aligned? Will indexing help?
Part of my data is listed at below.
df = data.frame(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "##haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))

This is a situation where embracing tidy data principles can really offer a nice solution. To start with, "annotate" the dataframe you presented with a new column that keeps track of doc_id, which document each word belongs to, and then use unnest_tokens() to transform this to a tidy data structure.
library(tidyverse)
library(tidytext)
library(stm)
df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "##haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))
tidy_df <- df %>%
mutate(doc_id = row_number()) %>%
unnest_tokens(word, reviews)
tidy_df
#> # A tibble: 154 x 4
#> rating source doc_id word
#> <dbl> <chr> <int> <chr>
#> 1 4 amazon 1 buenisimoooooo
#> 2 4 bestbuy 2 excelente
#> 3 4 amazon 3 excelent
#> 4 4 newegg 4 awesome
#> 5 4 newegg 4 phone
#> 6 4 newegg 4 awesome
#> 7 4 newegg 4 price
#> 8 4 newegg 4 almost
#> 9 4 newegg 4 month
#> 10 4 newegg 4 issue
#> # … with 144 more rows
Notice that you still have all the information you had before; all the information is still there, but it is arranged in a different structure. You can fine-tune the tokenization process to fit your particular analysis needs, perhaps dealing with non-English however you need, or keeping/not keeping punctuation, etc. This is where empty documents get thrown out, if appropriate for you.
Next, transform this tidy data structure into a sparse matrix, for use in topic modeling. The columns correspond to the words and the rows correspond to the documents.
sparse_reviews <- tidy_df %>%
count(doc_id, word) %>%
cast_sparse(doc_id, word, n)
colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente" "excelent" "almost"
#> [5] "awesome" "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"
Next, make a dataframe of covariate (i.e. meta) information to use in topic modeling from the tidy dataset you already have.
covariates <- tidy_df %>%
distinct(doc_id, rating, source)
covariates
#> # A tibble: 18 x 3
#> doc_id rating source
#> <int> <dbl> <chr>
#> 1 1 4 amazon
#> 2 2 4 bestbuy
#> 3 3 4 amazon
#> 4 4 4 newegg
#> 5 5 4 amazon
#> 6 8 4 newegg
#> 7 9 1 amazon
#> 8 10 4 amazon
#> 9 11 3 amazon
#> 10 12 1 amazon
#> 11 13 4 amazon
#> 12 14 3 zappos
#> 13 15 1 amazon
#> 14 16 2 amazon
#> 15 17 4 newegg
#> 16 18 4 amazon
#> 17 19 1 amazon
#> 18 20 1 amazon
Now you can put this together into stm(). For example, if you want to train a topic model with the document-level covariates looking at whether topics change a) with source and b) smoothly with rating, you would do something like this:
topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral",
prevalence = ~source + s(rating),
data = covariates,
verbose = FALSE)
Created on 2019-08-03 by the reprex package (v0.3.0)

Related

Sum duplicate bigrams in dataframe

I currently have a data frame that contains values such as:
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 19
3 (galaxy, nexus) 14
4 (android, google) 12
There are values in there that I want to merge (like google, android and android,google) there are others like "ice, cream" and "cream, sandwich" but that's a different problem.
In order to sum up the duplicates I tried to do this:
def remove_duplicates(ngrams):
return {" ".join(sorted(key.split(" "))):ngrams[key] for key in ngrams}
freq_all_tw_pos_bg['Word'] = freq_all_tw_pos_bg['Word'].apply(remove_duplicates)
I looked around and found similar exercises which are marked as right answers but when I try to do it I get:
TypeError: tuple indices must be integers or slices, not str
Which makes sense but then I tried converting it to a string and it shuffled the bigrams in a weird way so I wonder, am I missing something that should be easier?
EDIT:
The input is the first values I show. A list of bigrams some which are repeated (due to the words in them being reversed. I.e. google, android vs android,google
I want to have this same output (that is a dataframe with the bigrams) but that it sums up the frequencies of the reversed words. If I grab the same list from above and process it then it should output.
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
Notice how it "merged" (google, android) and (android, google) and also summed up the frequencies.
If there ara tuples use sorted with convert to tuples:
freq_all_tw_pos_bg['Bigram'] = freq_all_tw_pos_bg['Bigram'].apply(lambda x:tuple(sorted(x)))
print (freq_all_tw_pos_bg)
Bigram Frequency
0 (cream, ice) 23
1 (cream, sandwich) 21
2 (android, google) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
And then aggregate sum:
df = freq_all_tw_pos_bg.groupby('Bigram', as_index=False)['Frequency'].sum()

Extractive Text Summarization: Weighting sentence location in document

I am looking at an extractive text summarization problem. Eventually, I want to generate a list of words (not sentences) that seem to be the most important. One of the ideas that I had was to the words that appear early in the document more heavily.
I have two dataframes. the first is a set of words with their occurrence counts:
words.head()
words occurrences
0 '' 2
1 11-1 1
2 2nd 1
3 april 1
4 b.
And the second is a set of sentences. 0 is the first sentence in the document, 1 is the secont.. etc.
sentences.head()
sentences
0 Site Menu expandHave a correction?...
1 This will be a chance for ...
2 The event will include...
3 Further, this...
4 Contact:Share:
I managed to accomplish my goal like this:
weights = []
for value in words.index.values:
weights.append(((len(sentences) - sentences.index.values) *
sentences['sentences'].str.contains(words['words'][value])).sum())
weights
[0,
5,
5,
0,
12,...]
words['occurrences'] *= weights
words.head()
words occurrences
0 '' 0
1 11-1 5
2 2nd 5
3 april 0
4 b. 12
However, this seems sort of sloppy. I know that I can use list comprehension (I thought it would be easier to read on here without it) - but, other than that, does anyone have thoughts on a more elegant solution to this problem?

How can I do SQL like operations on a R data frame?

For example, I have a data frame with data across categories and subcategories and I want to be able to get row with maximum value in a particular column etc.
SQL is what comes to mind first. But since I am not interested in joins or indices etc, python's list comprehensions would do the same thing better with a more modern syntax.
What's best practice in R for such operations?
EDIT:
For now I think I am fine with which.max. Why I asked the question the way I did is simply that I have come to learn that in R there are many libraries etc doing pretty much the same thing. Just by reading the documentation it's very hard to evaluate how popular (ie how well the library fulfills its purpose). My personal experience with Python is that the day you figure out how to use list comprehensions (with itertools as a bonus), you are pretty much covered. Over time this has evolved as best practice, you don't see lambda and filter for example that often in the general python debate these days as list comprehensions does the same thing easier and more uniform.
If you really mean SQL, a pretty straightforward answer is the 'sqldf' package:
http://cran.at.r-project.org/web/packages/sqldf/index.html
From the help for ?sqldf
library(sqldf)
a1s <- sqldf("select * from warpbreaks limit 6")
Some additional context would help, but from the sounds of it - you may be looking for which.max() or the related functions. For group by operations, I default to the plyr family of functions, but there are certainly faster alternatives in base R if speed is of utmost importance.
library(plyr)
#Make a local copy of mycars data and add the rownames as a column since ddply
#seems to drop them. I've never encountered that before actually...
myCars <- mtcars
myCars$carname <- rownames(myCars)
#Find the max mpg
myCars[which.max(myCars$mpg) ,]
mpg cyl disp hp drat wt qsec vs am gear carb carname
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 Toyota Corolla
#Find the max mpg by cylinder category
ddply(myCars, "cyl", function(x) x[which.max(x$mpg) ,])
mpg cyl disp hp drat wt qsec vs am gear carb carname
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird

A programming challenge with Mathematica

I am interfacing an external program with Mathematica. I am creating an input file for the external program. Its about converting geometry data from a Mathematica generated graphics into a predefined format. Here is an example Geometry.
Figure 1
The geometry can be described in many ways in Mathematica. One laborious way is the following.
dat={{1.,-1.,0.},{0.,-1.,0.5},{0.,-1.,-0.5},{1.,-0.3333,0.},{0.,-0.3333,0.5},
{0.,-0.3333,-0.5},{1.,0.3333,0.},{0.,0.3333,0.5},{0.,0.3333,-0.5},{1.,1.,0.},
{0.,1.,0.5},{0.,1.,-0.5},{10.,-1.,0.},{10.,-0.3333,0.},{10.,0.3333,0.},{10.,1.,0.}};
Show[ListPointPlot3D[dat,PlotStyle->{{Red,PointSize[Large]}}],Graphics3D[{Opacity[.8],
Cyan,GraphicsComplex[dat,Polygon[{{1,2,5,4},{1,3,6,4},{2,3,6,5},{4,5,8,7},{4,6,9,7},
{5,6,9,8},{7,8,11,10},{7,9,12,10},{8,9,12,11},{1,2,3},{10,12,11},{1,4,14,13},
{4,7,15,14},{7,10,16,15}}]]}],AspectRatio->GoldenRatio]
This generates the required 3D geometry in GraphicsComplex format of MMA.
This geometry is described as the following input file for my external program.
# GEOMETRY
# x y z [m]
NODES 16
1. -1. 0.
0. -1. 0.5
0. -1. -0.5
1. -0.3333 0.
0. -0.3333 0.50. -0.3333 -0.5
1. 0.3333 0.
0. 0.3333 0.5
0. 0.3333 -0.5
1. 1. 0.
0. 1. 0.5
0. 1. -0.5
10. -1. 0.
10. -0.3333 0.
10. 0.3333 0.
10. 1. -0.
# type node_id1 node_id2 node_id3 node_id4 elem_id1 elem_id2 elem_id3 elem_id4
PANELS 14
1 1 4 5 2 4 2 10 0
1 2 5 6 3 1 5 3 10
1 3 6 4 1 2 6 10 0
1 4 7 8 5 7 5 1 0
1 5 8 9 6 4 8 6 2
1 6 9 7 4 5 9 3 0
1 7 10 11 8 8 4 11 0
1 8 11 12 9 7 9 5 11
1 9 12 10 7 8 6 11 0
2 1 2 3 1 2 3
2 10 12 11 9 8 7
10 4 1 13 14 1 3
10 7 4 14 15 4 6
10 10 7 15 16 7 9
# end of input file
Now the description I have from the documentation of this external program is pretty short. I am quoting it here.
First keyword NODES states total number of
nodes. After this line there should be no comment or empty lines. Next lines consist of
three values x, y and z node coordinates and number of lines must be the same as number
of nodes.
Next keyword is PANEL and states how many panels we have. After that we have lines
defining each panel. First integer defines panel type
ID 1 – quadrilateral panel - is defined by four nodes and four neighboring panels.
Neighboring panels are panels that share same sides (pair of nodes) and is needed for
velocity and pressure calculation (methods 1 and 2). Missing neighbors (for example for
panels near the trailing edge) are filled with value 0 (see Figure 1).
ID 2 – triangular panel – is defined by three nodes and three neighboring panels.
ID 10 – wake panel – is quadrilateral panel defined with four nodes and with two
(neighboring) panels which are located on the trailing edge (panels to which wake panel is
applying Kutta condition).
Panel types 1 and 2 must be defined before type 10 in input file.
Important to notice is the surface normal; order of nodes defining panels should be
counter clockwise. By the right-hand rule if fingers are bended to follow numbering,
thumb will show normal vector that should point “outwards” geometry.
Challenge!!
We are given with a 3D CAD model in a file called One.obj and it is exported fine in MMA.
cd = Import["One.obj"]
The output is a MMA Graphics3D object
Now I can get easily access the geometry data as MMA internally reads them.
{ver1, pol1} = cd[[1]][[2]] /. GraphicsComplex -> List;
MyPol = pol1 // First // First;
Graphics3D[GraphicsComplex[ver1,MyPol],Axes-> True]
How we can use the vertices and polygon information contained in ver1 and pol1 and write them in a text file as described in the input file example above. In this case we will only have ID2 type (triangular) panels.
Using the Mathematica triangulation how to find the surface area of this 3D object. Is there any inbuilt function that can compute surface area in MMA?
No need to create the wake panel or ID10 type elements right now. A input file with only triangular elements will be fine.
Sorry for such a long post but its a puzzle that I am trying to solve for a long time. Hope some of you expert may have the right insight to crack it.
BR
Q1 and Q2 are easy enough that you could drop the "challenge" labels in your question. Q3 could use some clarification.
Q1
edges = cd[[1, 2, 1]];
polygons = cd[[1, 2, 2, 1, 1, 1]];
Update Q1
The main problem is to find the neighbor of each polygon. The following does this:
(* Split every triangle in 3 edges, with nodes in each edge sorted *)
triangleEdges = (Sort /# Subsets[#, {2}]) & /# polygons;
(* Generate a list of edges *)
singleEdges = Union[Flatten[triangleEdges, 1]];
(* Define a function which, given an edge (node number list), returns the bordering *)
(* triangle numbers. It's done by working through each of the triangles' edges *)
ClearAll[edgesNeighbors]
edgesNeighbors[_] = {};
MapIndexed[(
edgesNeighbors[#1[[1]]] = Flatten[{edgesNeighbors[#1[[1]]], #2[[1]]}];
edgesNeighbors[#1[[2]]] = Flatten[{edgesNeighbors[#1[[2]]], #2[[1]]}];
edgesNeighbors[#1[[3]]] = Flatten[{edgesNeighbors[#1[[3]]], #2[[1]]}];
) &, triangleEdges
];
(* Build a triangle relation table. Each '1' indicates a triangle relation *)
relations = ConstantArray[0, {triangleEdges // Length, triangleEdges // Length}];
Scan[
(n = edgesNeighbors[##];
If[Length[n] == 2,
{n1, n2} = n;
relations[[n1, n2]] = 1; relations[[n2, n1]] = 1];
) &, singleEdges
]
MatrixPlot[relations]
(* Build a neighborhood list *)
triangleNeigbours =
Table[Flatten[Position[relations[[i]], 1]], {i,triangleEdges // Length}];
(* Test: Which triangles border on triangle number 1? *)
triangleNeigbours[[1]]
(* ==> {32, 61, 83} *)
(* Check this *)
polygons[[{1, 32, 61, 83}]]
(* ==> {{1, 2, 3}, {3, 2, 52}, {1, 3, 50}, {19, 2, 1}} *)
(* Indeed, they all share an edge with #1 *)
You can use the low level output functions described here to output these. I'll leave the details to you (that's my challenge to you).
Q2
The area of the wing is the summed area of the individual polygons. The individual areas can be calculated as follows:
ClearAll[polygonArea];
polygonArea[pts_List] :=
Module[{dtpts = Append[pts, pts[[1]]]},
If[Length[pts] < 3,
0,
1/2 Sum[Det[{dtpts[[i]], dtpts[[i + 1]]}], {i, 1, Length[dtpts] - 1}]
]
]
based on this Mathworld page.
The area is signed BTW, so you may want to use Abs.
CORRECTION
The above area function is only usable for general polygons in 2D. For the area of a triangle in 3D the following can be used:
ClearAll[polygonArea];
polygonArea[pts_List?(Length[#] == 3 &)] :=
Norm[Cross[pts[[2]] - pts[[1]], pts[[3]] - pts[[1]]]]/2

How do I design a DB Table when my sources are VENN DIAGRAMS?

Imagine 3 circles. Each circle has some numbers
Circle 1 has the following numbers
1, 4, 7, 9
Circle 2 has the following numbers
2, 5, 8, 9
Circle 3 has the following numbers
3, 6, 7, 8, 9
Circle 1 and Circle 2 share the following numbers
10, 9
Circle 1 and Circle 3 share the following numbers
7, 9
Circle 2 and Circle 3 share the following numbers
8, 9
All three circles share
9
Each number represents symptoms so in my case
Circle 1's numbers could be symptoms for a short circuit
Circle 2 could be numbers for a component failure
Circle 3 could be numbers for external issues
each of the three issues share certain symptoms
If given #9, we wouldn't be able to deduce the problem but could display a list of all issues involving #9
If given more #'s, we can attempt to show relevant issues.
My problem is how do I put this into a table so my code can look things up.
My Database of choice is SQLite3
#Vincent, the only issue I have is that there are several variables. I have variables called t1, t2, t3, a1, a2, a3. Each of these variables are symptoms. The user interface for my application allows the user to input a value for each variable then I want to check the DB. All values for each symptom can be any value in the 3 circles (mentioned in original problem)
Create 3 tables as:
symptom = (symptom_id, descr)
problem = (problem_id, descr)
problem_symptom = (problem_id, symptom_id)
e.g.
Symptom
Symptom_id Desc
1 doda
2 dado
3 dada
Problem
Problem_id Descr
1 Short Circuit
2 Component Failure
Symptom_Problem
Symptom_id Problem_id
1 1 --- doda is a symptom of Short circuit
2 1 --- dado is a symptom of short circuit
etc.
You can then query and join to determine the problems based on the symptoms.
I would suggest 3 tables: symptom, problem, and problem_symptoms. The latter would be a join table between the first two.