How to update record in R - problem with sqldf - sql

I would like to change some records in my table. I think the easest way is to use sqldf and Update. But when i using it i get warning (the table b isn't empty):
c<-sqldf("UPDATE b
SET l_all = ''
where id='12293' ")
# In result_fetch(res#ptr, n = n) :
# SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
Can you help me how to change chosen records in the easest way?

The query worked but there are several possible problems:
The message is a spurious warning, not an error, caused by backwardly incompatible changes to RSQLite. You can ignore the warning or use the sqldf2 workaround here: https://github.com/ggrothendieck/sqldf/issues/40
The SQL update command does not return anything so one would not expect the command shown in the question to return anything. To return the updated value ask for it.
1) Using the built in BOD data frame, defining sqldf2 from (1) and taking into account (2) we have:
sqldf2(c("update BOD set demand = 0 where Time = 1", "select * from BOD"))
giving:
Time demand
1 1 0.0
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
2) Another approach to do it is to use select giving the same result.
sqldf("select Time, iif(Time == 1, 0, demand) demand from BOD")

Related

sqldf R Error create a table

I am doing some experiments with SQL in R using the sqldf package.
I am trying to test some commands to check the output, in particular I am trying to create tables.
Here the code:
sqldf("CREATE TABLE tbl1 AS
SELECT cut
FROM diamonds")
Very simple code, however I get this error
sqldf("CREATE TABLE tbl1 AS
+ SELECT cut
+ FROM diamonds")
data frame with 0 columns and 0 rows
Warning message:
In result_fetch(res#ptr, n = n) :
Don't need to call dbFetch() for statements, only for queries
Why is it saying the the table create as 0 columns and 0 rows?
Can someone help?
That is a warning, not an error. The warning is caused by a backward incompatibility in recent versions of RSQLite. You can ignore it since it works anyways.
The sqldf statement that is shown in the question
creates an empty database
uploads the diamonds data frame to a table of the same name in that database
runs the create statement which creates a second table tbl1 in the database
returns nothing (actually a 0 column 0 row data frame) since a create statement has no value
destroys the database
When using sqldf you don't need create statements. It automatically creates a table in the backend database for any data frame referenced in your sql statement so the following sqldf statement
sqldf("select * from diamonds")
will
create an empty database
upload diamonds to it
run the select statement
return the result of the select statement as a data frame
destroy the database
You can use the verbose=TRUE argument to see the individual calls to the lower level RSQLite (or other backend database if you specify a different backend):
sqldf("select * from diamonds limit 3", verbose = TRUE)
giving:
sqldf: library(RSQLite)
sqldf: m <- dbDriver("SQLite")
sqldf: connection <- dbConnect(m, dbname = ":memory:")
sqldf: initExtension(connection)
sqldf: dbWriteTable(connection, 'diamonds', diamonds, row.names = FALSE)
sqldf: dbGetQuery(connection, 'select * from diamonds limit 3')
sqldf: dbDisconnect(connection)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Suggest you thoroughly review help("sqldf") as well as the info on the sqldf github home page

How to proceed with my Spark / Scala project

I am new to Spark and Scala. I am working on a Scala project where I will have data access from SQL Server.
There is a table in SQL Server has info about clothes. itemCode is the primary key and several attributes with Boolean value 0/1 - Designer, Exclusive, Handloom and several other columns having attributes of the product etc.
Code Designer Exclusive Handloom
A 1 0 1
B 1 0 0
C 0 0 1
D 0 1 0
E 0 1 0
F 1 0 1
G 0 1 0
H 0 0 0
I 1 1 1
J 1 1 1
K 0 0 1
L 0 1 0
M 0 1 0
N 1 1 0
O 0 1 1
P 1 1 0
and the list continues.
I have to select a collection of 32 items out of 320 items that have ATLEAST:
8 Designer, 8 Exclusive, 8 Handloom, 8 WeddingStyle, 8 PartyStyle,
8 Silk, 8 Georgette
I had solved the problem in MS Excel solver (it uses Gradient Descent algo) by adding an extra column and using sumproduct function between added column and required columns. So, the problem was solved there and it took around 1 minute 30 seconds for the same.
Also, the problem can be solved by writing an SQL query with 32 joins (so many), for example, if i want to select 6 items out of those 16 above with atleast 4 items designer, 4 exclusive, 4 handloom, the query would be like in my post: MYSQL - Select rows fulfilling many count conditions
In production, I have to fetch 32 rows like this way, So my question is how do I proceed further with the project.
I am working on Scala IDE for Eclipse, and have added spark mllib there. I have fetched data via JDBC and stored in a dataframe, and the created a temporary table:
dataFrame.registerTempTable("Data")
There is a class optimizer in mllib optimization that uses gradient descent (like excel solver does) to solve problems. But, that is for machine learning and takes as input training data.
I am not able to understand how do I proceed with my project. Can i use mllib, or use a better simple version of the sql with sparkSQL. I need serious help.
I'd recommend you to use https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes rather than MLLib.
I solved this problem through linear programming. I have now used lpsolver library for java in my scala project. It is giving almost the same result as in excel solver.

RDLC Sum-Function for distinct values

I have a specific question for my RDLC Report and a table, which data comes from a StoredProcedure. The looks like this:
Object Price More Data here...
====== ======== =======
X 10 $ ...
X 10 $ ...
Y 50 $ ...
Y 50 $ ...
Y 50 $ ...
Y 50 $ ...
Z 20 $ ...
Z 20 $ ...
Sum(expr)
What I now need is not the total Sum of those values, but the sum of all distinct values grouped by each object. So the result should be 80 $ (10+50+20)
I have no specific row or column groups. Grouping after Object and adding a grouprow was not my solution, because the sum only has to be at the end of the table. And I didn't find out how to sum after group values...
I tried different functions like Previous(compared the objects) and RunningValue(). But maybe I used them wrong or they showed me that this function can't be used in an Aggregate function. Maybe Maximun() for each Object is another idea, but it gave me the same error.
Within my tunnelview I have now no idea anymore, what can help me with this case. So maybe one of you can help me.
Thanks in advance
// I am currently working with Visual Studio 2013
Its better to do the distinct sum in the stored procedure like this. If you are using the mysql stored procedure
sum(distinctp.Price)
In rdlc there is aggregate function called the CountDistinct to count the distinct values without the repetition.

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem

Comparing vectors

I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.