How can I do SQL like operations on a R data frame? - sql

For example, I have a data frame with data across categories and subcategories and I want to be able to get row with maximum value in a particular column etc.
SQL is what comes to mind first. But since I am not interested in joins or indices etc, python's list comprehensions would do the same thing better with a more modern syntax.
What's best practice in R for such operations?
EDIT:
For now I think I am fine with which.max. Why I asked the question the way I did is simply that I have come to learn that in R there are many libraries etc doing pretty much the same thing. Just by reading the documentation it's very hard to evaluate how popular (ie how well the library fulfills its purpose). My personal experience with Python is that the day you figure out how to use list comprehensions (with itertools as a bonus), you are pretty much covered. Over time this has evolved as best practice, you don't see lambda and filter for example that often in the general python debate these days as list comprehensions does the same thing easier and more uniform.

If you really mean SQL, a pretty straightforward answer is the 'sqldf' package:
http://cran.at.r-project.org/web/packages/sqldf/index.html
From the help for ?sqldf
library(sqldf)
a1s <- sqldf("select * from warpbreaks limit 6")

Some additional context would help, but from the sounds of it - you may be looking for which.max() or the related functions. For group by operations, I default to the plyr family of functions, but there are certainly faster alternatives in base R if speed is of utmost importance.
library(plyr)
#Make a local copy of mycars data and add the rownames as a column since ddply
#seems to drop them. I've never encountered that before actually...
myCars <- mtcars
myCars$carname <- rownames(myCars)
#Find the max mpg
myCars[which.max(myCars$mpg) ,]
mpg cyl disp hp drat wt qsec vs am gear carb carname
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 Toyota Corolla
#Find the max mpg by cylinder category
ddply(myCars, "cyl", function(x) x[which.max(x$mpg) ,])
mpg cyl disp hp drat wt qsec vs am gear carb carname
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird

Related

How can I detect similarity of names in the same columns

Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

Suitable Clustering Approach

I've got a total of 9 sensors in the ground, which measure the water content of the soil. 1-3 are in a depth of 1m, 4-6 are in a depth of 2m and sensors 7-9 are in a depth of 3m.
My dataset also contains the precipiation of the location. It is hourly data:
Time
Sensor-ID
Precipitation
Soil Water Content
2022-01-01 11:00
1
74
120
2022-01-01 11:00
2
74
100
2022-01-01 11:00
3
74
110
...
...
...
...
2022-01-01 11:00
9
74
30
The goal now is to find out if the different ground / soil depths behave differently regarding the water content after raining (over time).
I thought about a clustering method to find out if the sensors can be clustered based on the data and confirm this. Since I'm not very experienced in data science, would that be the right approach and is it even possible to analyse it with clustering?
For clustering, you can add a new column with three new classes to your data - for 1-3 sensors : Class 1, for 4-6 sensors : Class 2, for 7-9 sensors : Class 3 and perform your analysis using the new classes. Either can be done using Python, Power BI or Excel.
You should start by analyzing different variables w.r.t to the sensors at different ground depths: Use univariate, Bi-Variate and Multi-Variate plots to derive your goal.

How to access the BIC value in Stata after being calculated by estat ic?

This document explains that the values of AIC and BIC are stored in r(S), but when I try display r(S), it returns "type mismatch" and when I try sum r(S), it returns "r ambiguous abbreviation".
Sorry for my misunderstanding this r(S), but I'll appreciate it if you let me know how I can access the calculated BIC value.
The document you refer to mentions that r(S) is a matrix. The display command does not work with matrices. Try matrix list. Also see help matrix.
For example:
clear
sysuse auto
regress mpg weight foreign
estat ic
matrix list r(S)
matrix S=r(S)
scalar aic=S[1,5]
di aic
The same document that you cited explains that r(S) is a matrix. That explains the failure of your commands, as summarize is for summarizing variables and display is for displaying strings and scalar expressions, as their help explains. Matrices are neither.
Note that the document you cited
http://www.stata.com/manuals13/restatic.pdf
is at the time of writing not the most recent version
http://www.stata.com/manuals14/restatic.pdf
although the advice is the same either way.
Copy r(S) to a matrix that will not disappear when you run the next r-class command, and then list it directly. For basic help on matrices, start with
help matrix
Here is a reproducible example. I use the Stata 13 version of the dataset because your question hints that you may be using that version:
. use http://www.stata-press.com/data/r13/sysdsn1
(Health insurance data)
. mlogit insure age male nonwhite
Iteration 0: log likelihood = -555.85446
Iteration 1: log likelihood = -545.60089
Iteration 2: log likelihood = -545.58328
Iteration 3: log likelihood = -545.58328
Multinomial logistic regression Number of obs = 615
LR chi2(6) = 20.54
Prob > chi2 = 0.0022
Log likelihood = -545.58328 Pseudo R2 = 0.0185
------------------------------------------------------------------------------
insure | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Indemnity | (base outcome)
-------------+----------------------------------------------------------------
Prepaid |
age | -.0111915 .0060915 -1.84 0.066 -.0231305 .0007475
male | .5739825 .2005221 2.86 0.004 .1809665 .9669985
nonwhite | .7312659 .218978 3.34 0.001 .302077 1.160455
_cons | .1567003 .2828509 0.55 0.580 -.3976773 .7110778
-------------+----------------------------------------------------------------
Uninsure |
age | -.0058414 .0114114 -0.51 0.609 -.0282073 .0165245
male | .5102237 .3639793 1.40 0.161 -.2031626 1.22361
nonwhite | .4333141 .4106255 1.06 0.291 -.371497 1.238125
_cons | -1.811165 .5348606 -3.39 0.001 -2.859473 -.7628578
------------------------------------------------------------------------------
. estat ic
Akaike's information criterion and Bayesian information criterion
-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 615 -555.8545 -545.5833 8 1107.167 1142.54
-----------------------------------------------------------------------------
Note: N=Obs used in calculating BIC; see [R] BIC note.
. ret li
matrices:
r(S) : 1 x 6
. mat S = r(S)
. mat li S
S[1,6]
N ll0 ll df AIC BIC
. 615 -555.85446 -545.58328 8 1107.1666 1142.5395
The BIC value is now in S[1,6].

Grouping nearby data in pandas

Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.
Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()

Nearest Neighbor Search on large database table - SQL and/or ArcGis

Sorry for posting something that's probably obvious, but I don't have much database experience. Any help would be greatly appreciated - but remember, I'm a beginner :-)
I have a table like this:
Table.fruit
ID type Xcoordinate Ycoordinate Taste Fruitiness
1 Apple 3 3 Good 1,5
2 Orange 5 4 Bad 2,9
3 Apple 7 77 Medium 1,4
4 Banana 4 69 Bad 9,5
5 Pear 9 15 Medium 0,1
6 Apple 3 38 Good -5,8
7 Apple 1 4 Good 3
8 Banana 15 99 Bad 6,8
9 Pear 298 18789 Medium 10,01
… … … … … …
1000 Apple 1344 1388 Bad 5
… … … … … …
1958 Banana 759 1239 Good 1
1959 Banana 3 4 Medium 5,2
I need:
A table that gives me
The n (eg.: n=5) closest points to EACH point in the original table, including distance
Table.5nearest (please note that the distances are fake). So the resulting table has ID1, ID2 and distance between ID1 and ID2 (can't post images yet, unfortunately).
ID.Fruit1 ID.Fruit2 Distance
1 1959 1
1 7 2
1 2 2
1 5 30
1 14 50
2 1959 1
2 1 2
… … …
1000 1958 400
1000 Xxx Xxx
… … …
How can I do this (ideally with SQL/database management) or in ArcGis or similar? Any ideas?
Unfortunately, my table contains 15000 datasets, so the resulting table will have 75000 datasets if I choose n=5.
Any suggestions GREATLY appreciated.
EDIT:
Thank you very much for your comments and suggestions so far. Let me expand on it a little:
The first proposed method is sort of a brute-force scan of the whole table rendering huge filesizes or, likely, crashes, correct?
Now, the fruit is just a dummy, the real table contains a fix ID, nominal attributes ("fruit types" etc), X and Y spatial columns (in Gauss-Krueger) and some numeric attributes.
Now, I guess there is a way to code a "bounding box" into this, so the distances calculation is done for my point in question (let's say 1) and every other point within a square with a certain edge length. I can imagine (remotely) coding or querying for that, but how do I get the script to do that for EVERY point in my ID column. The way I understand it, this should either create a "subtable" for each record/point in my "Table.Fruit" containing all points within the square around the record/point with a distance field added - or, one big new table ("Table.5nearest"). I hope this makes some kind of sense. Any ideas? THanks again
To get all the distances between all fruit is fairly straightforward. In Access SQL (although you may need to add parentheses everywhere to get it to work :P):
select fruit1.id,
fruit2.id,
sqr(((fruit2.xcoordinate - fruit1.xcoordinate)^2) + ((fruit2.ycoordinate - fruit1.ycoordinate)^2)) as distance
from fruit as fruit1
join fruit as fruit2
on fruit2.id <> fruit1.id
order by distance;
I don't know if Access has the necessary sophistication to limit this to the "top n" records for each fruit; so this query, on your recordset, will return 225 million records (or, more likely, crash while trying)!
Thank you for your comments so far; in the meantime, I have gone for a pre-fabricated solution, an add-in for ArcGis called Hawth's Tools. This really works like a breeze to find the n closest neighbors to any point feature with an x and y value. So I hope it can help someone with similar problems and questions.
However, it leaves me with a more database-related issue now. Do you have an idea how I can get any DBMS (preferably Access), to give me a list of all my combinations? That is, if I have a point feature with 15000 fruits arranged in space, how do I get all "pure banana neighborhoods" (apple, lemon, etc.) and all other combinations?
Cheers and best wishes.