I have a table containing user names (~1 000 rows) called "potential_users" and another one called "actual_users" (~ 10 million rows). All records are exclusively made up of [a-z] characters, no white space. Additionally, I know that none of the potential_users are in the actual_users tables.
I would like to be able to calculate for each row in potential_users what is the closest record in actual_users based on the Levenshtein distance. For example:
| potential_users|
|----------------|
| user1 |
| kajd |
| bbbbb |
and
| actual_users |
|--------------|
| kaj |
| bbbbbbb |
| user |
Would return:
| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1 | user | 1 |
| kajd | kaj | 1 |
| bbbbb | bbbbbbb | 2 |
If the tables were short, I could make a cross join that would calculate for each record in potential_users the Levenshtein distance in actual_users and then return the one with the lowest value. However, in my case this would create an intermediary table of 1 000 x 10 000 000 rows, which is a little impractical.
Is there a cleaner way to perform such operation with creating a cross join?
Unfortunately, there's no way to do it without a cross join. At the end of the day, every potential user needs to be tested against every actual user.
However, Trino (formerly known as Presto SQL)will execute the join in parallel across many threads and machines, so it can execute very quickly given enough hardware. Note that in Trino, the intermediate results are streamed from operator to operator, so there's no "intermediate table" with 10M x 1k rows for this query.
For a query like
SELECT potential, min_by(actual, distance), min(distance)
FROM (
SELECT *, levenshtein_distance(potential, actual) distance
FROM actual_users, potential_users
)
GROUP BY potential
This is the query plan:
Query Plan
----------------------------------------------------------------------------------------------------------------
Fragment 0 [SINGLE]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[potential, _col1, _col2]
│ Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ _col1 := min_by
│ _col2 := min
└─ RemoteSource[1]
Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
Fragment 1 [HASH]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(FINAL)[potential]
│ Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ min := min("min_1")
│ min_by := min_by("min_by_0")
└─ LocalExchange[HASH] ("potential")
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[2]
Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
Fragment 2 [SOURCE]
Output layout: [potential, min_1, min_by_0]
Output partitioning: HASH [potential]
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
├─ TableScan[memory:9, grouped = false]
│ Layout: [actual:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
│ actual := 0
└─ LocalExchange[SINGLE] ()
│ Layout: [potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[3]
Layout: [potential:varchar(5)]
Fragment 3 [SOURCE]
Output layout: [potential]
Output partitioning: BROADCAST []
Stage Execution Strategy: UNGROUPED_EXECUTION
TableScan[memory:8, grouped = false]
Layout: [potential:varchar(5)]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
potential := 0
(1 row)
In particular, for this section, as soon as a row is produced by the cross join, it is fed into the projection operator that calculates the Levenshtein distance between the two values and then into the aggregation, which only stores one group per "potential" user. Therefore, the amount of memory required by this query should be low.
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
I think you cannot do it with a simple join , there is whole algorithm to calculate that. look at this article shows Levenshtein distance algorithm implementation in sql:
https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=1
Related
I have a DataFrame that's 659 x 2 in its size, and is sorted according to its Low column. Its first 20 rows can be seen below:
julia> size(dfl)
(659, 2)
julia> first(dfl, 20)
20×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-08-25 0.783125
6 │ 2010-05-25 0.808333
7 │ 2010-06-08 0.820938
8 │ 2010-07-20 0.82375
9 │ 2010-05-21 0.824792
10 │ 2010-08-16 0.842188
11 │ 2010-08-12 0.849688
12 │ 2010-02-25 0.871979
13 │ 2010-02-23 0.879896
14 │ 2010-07-30 0.890729
15 │ 2010-06-01 0.916667
16 │ 2010-08-06 0.949271
17 │ 2010-09-10 0.949792
18 │ 2010-03-04 0.969375
19 │ 2010-05-17 0.9875
20 │ 2010-03-09 1.0349
What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:
julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-09-10 0.949792
Is there some high-level way to achieve this using Julia and DataFrames.jl?
I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.
I'm now trying to write this better in Julia, and looked into filter and subset methods in DataFrames.jl. Intuitively filter doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for loop here too.
You need to use for loop for this task in the end (you have to loop all values)
In Julia loops are fast so using your own for loop does not hinder performance.
If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:
dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]
dataframes in pandas are indexed in one or more numerical and/or string columns. Particularly, after a groupby operation, the output is a dataframe where the new index is given by the groups.
Similarly, julia dataframes always have a column named Row which I think is equivalent to the index in pandas. However, after groupby operations, julia dataframes don't use the groups as the new index. Here is a working example:
using RDatasets;
using DataFrames;
using StatsBase;
df = dataset("Ecdat","Cigarette");
gdf = groupby(df, "Year");
combine(gdf, "Income" => mean)
Output:
11×2 DataFrame
│ Row │ Year │ Income_mean │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────────┤
│ 1 │ 1985 │ 7.20845e7 │
│ 2 │ 1986 │ 7.61923e7 │
│ 3 │ 1987 │ 8.13253e7 │
│ 4 │ 1988 │ 8.77016e7 │
│ 5 │ 1989 │ 9.44374e7 │
│ 6 │ 1990 │ 1.00666e8 │
│ 7 │ 1991 │ 1.04361e8 │
│ 8 │ 1992 │ 1.10775e8 │
│ 9 │ 1993 │ 1.1534e8 │
│ 10 │ 1994 │ 1.21145e8 │
│ 11 │ 1995 │ 1.27673e8 │
Even if the creation of the new index isn't done automatically, I wonder if there is a way to manually set a chosen column as index. I discover the method setindex! reading the documentation. However, I wasn't able to use this method. I tried:
#create new df
income = combine(gdf, "Income" => mean)
#set index
setindex!(income, "Year")
which gives the error:
ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::String)
I think that I have misused the command. What am I doing wrong here? Is it possible to manually set an index in a julia dataframe using one or more chosen columns?
DataFrames.jl does not currently allow specifying an index for a data frame. The Row column is just there for printing---it's not actually part of the data frame.
However, DataFrames.jl provides all the usual table operations, such as joins, transformations, filters, aggregations, and pivots. Support for these operations does not require having a table index. A table index is a structure used by databases (and by Pandas) to speed up certain table operations, at the cost of additional memory usage and the cost of creating the index.
The setindex! function you discovered is actually a method from Base Julia that is used to customize the indexing behavior for custom types. For example, x[1] = 42 is equivalent to setindex!(x, 42, 1). Overloading this method allows you to customize the indexing behavior for types that you create.
The docstrings for Base.setindex! can be found here and here.
If you really need a table with an index, you could try IndexedTables.jl.
I have a vector of vectors that contain some indices, and a character vector which I want to use them on.
A←(1 2 3)(3 2 1)
B←'ABC'
I have tried:
B[A]
RANK ERROR
B[A]
∧
A⌷B
LENGTH ERROR
A⌷B
∧
and
A⌷B
LENGTH ERROR
A⌷¨B
∧
I would like
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ │ABC│ │CBA│ │
│ └───┘ └───┘ │
└∊────────────┘
to be returned, but if i need to find another way, let me know.
The index function ⌷ is a bit odd. To select multiple major cells from an array, you need to enclose the array of indices:
(⊂3 2 1)⌷'ABC'
CBA
In order to use each of two vectors of indices, the array you're selecting from needs to be distributed among the two. You can use APL's scalar extension for this, but then the array you're selecting from needs to be packaged as a scalar:
(⊂1 2 3)(⊂3 2 1)⌷¨⊂'ABC'
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ │ABC│ │CBA│ │
│ └───┘ └───┘ │
└∊────────────┘
So to use your variables:
A←(1 2 3)(3 2 1)
B←'ABC'
(⊂¨A)⌷¨⊂B
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ │ABC│ │CBA│ │
│ └───┘ └───┘ │
└∊────────────┘
Note that, if you are generating permutations which all have the same length, you may be better off avoiding nested arrays. Nested arrays force the system to follow pointers, while simple arrays allow sequential access to densely packed data. This only really matters when you have a LOT of data, of course:
⎕←SIMPLE←↑A ⍝ A 2×3 matrix of indices
1 2 3
3 2 1
(⊂SIMPLE)⌷B
ABC
CBA
B[SIMPLE] ⍝ IMHO bracket indexing is nicer for this
ABC
CBA
↓B[SIMPLE] ⍝ Split if you must
┌───┬───┐
│ABC│CBA│
└───┴───┘
In NARS2000, easy:
A←(1 3 2)(3 2 1)
B←'ABC'
⎕fmt {B[⍵]}¨¨A
┌2────────────┐
│┌3───┐ ┌3───┐│
││ ACB│ │ CBA││
│└────┘ └────┘2
└∊────────────┘
C←(1 3 2 3 2 1)(3 2 1)
⎕fmt {B[⍵]}¨¨C
┌2───────────────┐
│┌6──────┐ ┌3───┐│
││ ACBCBA│ │ CBA││
│└───────┘ └────┘2
└∊───────────────┘
I'm looking for a solution in GNU sed, but POSIX sed is OK and awk will be OK but probably more complicated than necessary. I prefer sed for this, it should be easy but I'm stuck. Seems like a one-liner can do this, no need to create a python/bash script or anything.
my attempted solution
sed -i '218,226140d; 218i ...REMOVED...' psql.log
This deletes the desired rows, but the insert gets lost. If I move the insert to line 217 I get:
sed -i '218,226140d; 217i ...REMOVED...' psql.log
result:
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
...REMOVED...
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217 in original file)
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141 in original file)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
I know - this should be good enough, but I'm annoyed that I can't get this simple one-liner to work right. What am I missing?
the problem
I keep the psql.log file as a reference for work I am doing developing SQL code. It's very useful to see iterations of the query and the results.
The problem is that sometimes I forget to limit the output and the query will generate 100k+ rows of results that aren't a helpful reference, and I'd like to delete them from my file, leaving a note that reminds me the query output has been excised.
It would be nice to match the pattern, say every output more than 50 rows I could squash down to just the first 5 rows and the last 5. However, its easy for me to mark the line numbers where I've blown up the file, so I'd be happy with just using sed to delete lines N through M, and insert the message ...REMOVED... where line N was.
Here is an example log file, added notes are in parentheses. The query text can change and the number of columns can be from 1 to 100 or more:
...
********* QUERY **********
select *
from table
where rnk <= 3
**************************
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217)
│ CC6XF2SVWT │ 13182280615 │ [null] │
(many rows)
│ CC6XF2XWDT │ 995086081 │ [null] │
│ CC6XFX3TL1 │ 25195177405 │ [null] │
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
(225926 rows)
********* QUERY **********
/* another query begins */
select * from table where X = 1 limit 20;
/* well done you remembered to limit the output */
**************************
...
acceptable output
the query text should all be untouched, and the top/bottom three rows of output kept. The annotation ...REMOVED... has been added and rows 218 through 226140 have been deleted:
********* QUERY **********
select *
from table
where rnk <= 3
**************************
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217 in original file)
...REMOVED...
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141 in original file)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
(225926 rows)
********* QUERY **********
(etc just like example above)
update
the border comes from my .psqlrc with \pset border 2
therefore solutions depending on the ┌ character are fragile but OK
over time i've learned that manually flagging the line numbers is too time consuming, so the best solution needs a pattern match
There is example 'every output more than 50 rows I could squash down to just the first 5 rows and the last 5'.
With test input:
$ seq 160 | awk -vstart=10 -vmax=50 -vleft=5 '{if(NR < start) {print; next} {i++; if(i <= left || i > max - left){print}; if(i == left + 1){print "...REMOVED..."}if(i == max){i = 0}}}'
If you line put script in file, store this to squash.awk
BEGIN {
start=10;
max=50;
left=5;
}
{
if(NR < start) {
print;
next
}
i++;
if(i <= left || i > max - left) {
print
}
if(i == left + 1) {
print "...REMOVED...";
}
if(i == max) {
i = 0
}
}
For testing:
$ seq 160 | awk -f squash.awk
Variable start is line number from which squashing line will begin.
Variable max is maximum rows (in your example 50).
Variable left is how many rows will left from max first and last.
if(NR < start) {print; next} if line number less then start (in our case 10), we just print them and go to next line.
Here you can put any condition to skip squashing.
i++ it's rows counter increment.
if(i <= left || i > max - left){print} if rows counter less then 5 or more then max - 5 - print it.
if(i == left + 1){print "...REMOVED..."} when we starting skip rows - put "...REMOVED..." message
if(i == max){i = 0} if rows counter reach max, zero it
One in awk:
$ awk '
/^ └/ { # at the end marker
for(j=1;j<=6;j++) # output from the buffer b the wanted records
print b[j]
for(j=(i-2);j<=i;j++)
print b[j]
delete b # reset buffer
i=0 # and flag / counter
}
/^ ┌/ || i { # at the start marker or when flag up
b[++i]=$0 # gather records to buffer
next
} 1' file # print records which are not between the markers
This might work for you (GNU sed):
sed -r '/\o342[^\n]*$/{:a;N;//ba;s/^(([^\n]*\n){6}).*((\n[^\n]*){5})$/\1 ... REMOVED ...\3/}' file
Focus only on table data which will always contains the octal value 342. Gather up table lines in the pattern space, substitute the required value ... REMOVED ... and print. The number of lines above and below the required string can be altered here 6 (headings + 3 rows) and 5 (required string + 3 rows + table count).
To change a range use:
sed 'm,nc ... REMOVE ...' file # where m,n from and to line numbers
or:
sed -e 'ma ...REMOVE ...' -e 'm,nd' file
N.B. the d command terminates any following commands.
The sed man page is more helpful than you might think at first glance. The [addr]c command is exactly what is needed (note the whitespace after c is ignored):
sed -i '218,226141c ...REMOVED...' psql.log
So there is the solution for known line numbers.
Does anyone want to provide a generic solution where the line numbers aren't known? Probably awk would be the better tool but maybe sed can remove output that is too long.
Is there something like R's table function in Julia? I've read about xtab, but do not know how to use it.
Suppose we have R's data.frame rdata which col6 is of the Factor type.
R sample code:
rdata <- read.csv("mycsv.csv") #1
table(rdata$col6) #2
In order to read data and make factors in Julia I do it like this:
using DataFrames
jldata = readtable("mycsv.csv", makefactors=true) #1 :col6 will be now pooled.
..., but how to build R's table like in julia (how to achieve #2)?
You can use the countmap function from StatsBase.jl to count the entries of a single variable. General cross tabulation and statistical tests for contingency tables are lacking at this point. As Ismael points out, this has been discussed in the issue tracker for StatsBase.jl.
I came to the conclusion that a similar effect can be achieved using by:
Let jldata consists of :gender column.
julia> by(jldata, :gender, nrow)
3x2 DataFrames.DataFrame
| Row | gender | x1 |
|-----|----------|-------|
| 1 | NA | 175 |
| 2 | "female" | 40254 |
| 3 | "male" | 58574 |
Of course it's not a table but at least I get the same data type as the datasource. Surprisingly by seems to be faster than countmap.
I believe, "by" is depreciated in Julia as of 1.5.3 (It says: ERROR: ArgumentError: by function was removed from DataFrames.jl).
So here are some alternatives, we can use split apply combine to do a cross tabs as well or use FreqTables.
Using Split Combine:
Example 1 - SingleColumn:
using RDatasets
using DataFrames
mtcars = dataset("datasets", "mtcars")
## To do a table on cyl column
gdf = groupby(mtcars, :Cyl)
combine(gdf, nrow)
Output:
# 3×2 DataFrame
# Row │ Cyl nrow
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 6 7
# 2 │ 4 11
# 3 │ 8 14
Example 2 - CrossTabs Between 2 columns:
## we have to just change the groupby code a little bit and rest is same
gdf = groupby(mtcars, [:Cyl, :AM])
combine(gdf, nrow)
Output:
#6×3 DataFrame
# Row │ Cyl AM nrow
# │ Int64 Int64 Int64
#─────┼─────────────────────
# 1 │ 6 1 3
# 2 │ 4 1 8
# 3 │ 6 0 4
# 4 │ 8 0 12
# 5 │ 4 0 3
# 6 │ 8 1 2
Also on a side note if you don't like the name as nrow on top, you can use :
combine(gdf, nrow => :Count)
to change the name to Count
Alternate way: Using FreqTables
You can use package, FreqTables like below to do count and proportion very easily, to add it you can use Pkg.add("FreqTables") :
## Cross tab between cyl and am
freqtable(mtcars.Cyl, mtcars.AM)
## Proportion between cyl and am
prop(freqtable(mtcars.Cyl, mtcars.AM))
## with margin like R you can use it too in this (columnwise proportion: margin=2)
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=2)
## with margin for rowwise proportion: margin = 1
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=1)
Outputs:
## count cross tabs
#3×2 Named Array{Int64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────
#4 │ 3 8
#6 │ 4 3
#8 │ 12 2
## proportion wise (overall)
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼─────────────────
#4 │ 0.09375 0.25
#6 │ 0.125 0.09375
#8 │ 0.375 0.0625
## Column wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.157895 0.615385
#6 │ 0.210526 0.230769
#8 │ 0.631579 0.153846
## Row wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.272727 0.727273
#6 │ 0.571429 0.428571
#8 │ 0.857143 0.142857