I need convert string 2022-10-26T00:00:00.654199+00:00
to Unix timestamp, is it possible with clickhouse?
toUnixTimestamp64Milli(visitParamExtractString(msg, 'time'))
where time is 2022-10-26T00:00:00.654199+00:00
This way it doesn't work.
Try this way:
SELECT
json,
JSONExtractString(json, 'time') AS time,
parseDateTime64BestEffort(time, 6) AS dt,
toUnixTimestamp64Milli(dt) AS ts
FROM
(
WITH [
'{"time": "2022-10-26T00:00:00.654199+00:00"}',
'{"time": "2022-10-26T00:00:00.654199+08:00"}'] AS jsons
SELECT arrayJoin(jsons) AS json
)
/*
┌─json─────────────────────────────────────────┬─time─────────────────────────────┬─────────────────────────dt─┬────────────ts─┐
│ {"time": "2022-10-26T00:00:00.654199+00:00"} │ 2022-10-26T00:00:00.654199+00:00 │ 2022-10-26 00:00:00.654199 │ 1666742400654 │
│ {"time": "2022-10-26T00:00:00.654199+08:00"} │ 2022-10-26T00:00:00.654199+08:00 │ 2022-10-25 16:00:00.654199 │ 1666713600654 │
└──────────────────────────────────────────────┴──────────────────────────────────┴────────────────────────────┴───────────────┘
*/
I have a table containing user names (~1 000 rows) called "potential_users" and another one called "actual_users" (~ 10 million rows). All records are exclusively made up of [a-z] characters, no white space. Additionally, I know that none of the potential_users are in the actual_users tables.
I would like to be able to calculate for each row in potential_users what is the closest record in actual_users based on the Levenshtein distance. For example:
| potential_users|
|----------------|
| user1 |
| kajd |
| bbbbb |
and
| actual_users |
|--------------|
| kaj |
| bbbbbbb |
| user |
Would return:
| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1 | user | 1 |
| kajd | kaj | 1 |
| bbbbb | bbbbbbb | 2 |
If the tables were short, I could make a cross join that would calculate for each record in potential_users the Levenshtein distance in actual_users and then return the one with the lowest value. However, in my case this would create an intermediary table of 1 000 x 10 000 000 rows, which is a little impractical.
Is there a cleaner way to perform such operation with creating a cross join?
Unfortunately, there's no way to do it without a cross join. At the end of the day, every potential user needs to be tested against every actual user.
However, Trino (formerly known as Presto SQL)will execute the join in parallel across many threads and machines, so it can execute very quickly given enough hardware. Note that in Trino, the intermediate results are streamed from operator to operator, so there's no "intermediate table" with 10M x 1k rows for this query.
For a query like
SELECT potential, min_by(actual, distance), min(distance)
FROM (
SELECT *, levenshtein_distance(potential, actual) distance
FROM actual_users, potential_users
)
GROUP BY potential
This is the query plan:
Query Plan
----------------------------------------------------------------------------------------------------------------
Fragment 0 [SINGLE]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[potential, _col1, _col2]
│ Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ _col1 := min_by
│ _col2 := min
└─ RemoteSource[1]
Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
Fragment 1 [HASH]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(FINAL)[potential]
│ Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ min := min("min_1")
│ min_by := min_by("min_by_0")
└─ LocalExchange[HASH] ("potential")
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[2]
Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
Fragment 2 [SOURCE]
Output layout: [potential, min_1, min_by_0]
Output partitioning: HASH [potential]
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
├─ TableScan[memory:9, grouped = false]
│ Layout: [actual:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
│ actual := 0
└─ LocalExchange[SINGLE] ()
│ Layout: [potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[3]
Layout: [potential:varchar(5)]
Fragment 3 [SOURCE]
Output layout: [potential]
Output partitioning: BROADCAST []
Stage Execution Strategy: UNGROUPED_EXECUTION
TableScan[memory:8, grouped = false]
Layout: [potential:varchar(5)]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
potential := 0
(1 row)
In particular, for this section, as soon as a row is produced by the cross join, it is fed into the projection operator that calculates the Levenshtein distance between the two values and then into the aggregation, which only stores one group per "potential" user. Therefore, the amount of memory required by this query should be low.
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
I think you cannot do it with a simple join , there is whole algorithm to calculate that. look at this article shows Levenshtein distance algorithm implementation in sql:
https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=1
I am trying to run a mixed-effects model in julia (R is too slow for my data), but I keep getting this error.
I have installed DataArrays, DataFrames , MixedModels, and RDatasets packages and am following the tutorial here --> http://dmbates.github.io/MixedModels.jl/latest/man/fitting/#Fitting-linear-mixed-effects-models-1
These are my steps:
using DataArrays, DataFrames , MixedModels, RDatasets
I get these warnings
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in
module Base at nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition model_response(DataFrames.ModelFrame) in module
DataFrames at
/home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65. WARNING: Method
definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at
nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module
Base at nullable.jl:238 overwritten in module NullableArrays at
/home/home/.julia/v0.6/NullableArrays/src/operators.jl:128. WARNING:
Method definition model_response(DataFrames.ModelFrame) in module
DataFrames at
/home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65. WARNING: Method
definition model_response(DataFrames.ModelFrame) in module DataFrames
at /home/home/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:352
overwritten in module MixedModels at
/home/home/.julia/v0.6/MixedModels/src/pls.jl:65.
I get an R dataset from lme4 package (used in the tutorial)
inst = dataset("lme4", "InstEval")
julia> head(inst)
6×7 DataFrames.DataFrame
│ Row │ S │ D │ Studage │ Lectage │ Service │ Dept │ Y │
├─────┼─────┼────────┼─────────┼─────────┼─────────┼──────┼───┤
│ 1 │ "1" │ "1002" │ "2" │ "2" │ "0" │ "2" │ 5 │
│ 2 │ "1" │ "1050" │ "2" │ "1" │ "1" │ "6" │ 2 │
│ 3 │ "1" │ "1582" │ "2" │ "2" │ "0" │ "2" │ 5 │
│ 4 │ "1" │ "2050" │ "2" │ "2" │ "1" │ "3" │ 3 │
│ 5 │ "2" │ "115" │ "2" │ "1" │ "0" │ "5" │ 2 │
│ 6 │ "2" │ "756" │ "2" │ "1" │ "0" │ "5" │ 4 │
I run the model as shown in the tutorial
m2 = fit!(lmm(y ~ 1 + dept*service + (1|s) + (1|d), inst))
and get
ERROR: UndefVarError: y not defined
Stacktrace:
[1] macro expansion at ./REPL.jl:97 [inlined]
[2] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
The same thing happens when I try it with my own data loaded using "readtable" from DataFrames package
I am running julia 0.6.0 and all packages are freshly installed. My system is arch linux 4.11.7-1 with all the latest packages. Julia installs without a problem, but some packages give warnings (see above).
have a go with the #formula macro:
julia> fit!(lmm(#formula(Y ~ (1 | Dept)), inst), true)
f_1: 250160.38873 [1.0]
f_2: 250175.99074 [1.75]
f_3: 250123.06531 [0.25]
f_4: 250602.3424 [0.0]
f_5: 250137.66303 [0.4375]
f_6: 250129.76244 [0.325]
f_7: 250125.94066 [0.280268]
f_8: 250121.15016 [0.23125]
f_9: 250119.12389 [0.2125]
f_10: 250114.7257 [0.175]
f_11: 250105.61264 [0.1]
f_12: 250602.3424 [0.0]
f_13: 250107.52714 [0.118027]
f_14: 250106.36924 [0.107778]
f_15: 250105.04638 [0.0925]
f_16: 250104.72722 [0.085]
f_17: 250104.93086 [0.0749222]
f_18: 250104.70046 [0.0831588]
f_19: 250104.70849 [0.0839088]
f_20: 250104.69659 [0.0824088]
f_21: 250104.69632 [0.0822501]
f_22: 250104.69625 [0.0821409]
f_23: 250104.69625 [0.0820659]
f_24: 250104.69624 [0.0821118]
f_25: 250104.69624 [0.0821193]
f_26: 250104.69624 [0.082111]
f_27: 250104.69624 [0.0821118]
f_28: 250104.69624 [0.0821117]
f_29: 250104.69624 [0.0821118]
f_30: 250104.69624 [0.0821118]
Linear mixed model fit by maximum likelihood
Formula: Y ~ 1 | Dept
logLik -2 logLik AIC BIC
-1.25052348×10⁵ 2.50104696×10⁵ 2.50110696×10⁵ 2.50138308×10⁵
Variance components:
Column Variance Std.Dev.
Dept (Intercept) 0.011897242 0.10907448
Residual 1.764556375 1.32836605
Number of obs: 73421; levels of grouping factors: 14
Fixed-effects parameters:
Estimate Std.Error z value P(>|z|)
(Intercept) 3.21373 0.029632 108.455 <1e-99
The warnings are the usual "Julia (and the package ecosystem) is still in flux" messages. But I wonder whether the docs always keep pace with the code.
I've been reading up on the many uses of dialog to create interactive shell scripts, but I'm stumped on how to use the --buildlist option. Read the man pages, searched google, searched stackoverflow, even read through some old articles of Linux Journal from 1994, to no avail.
Can some give me a clear example of how to use it properly?
Lets imagine a directory with 5 files which you'd want to select from, to copy to another directory. Can someone give a working example?
Thankyou!
Consider the following:
dialog --buildlist "Select a directory" 20 50 5 \
f1 "Directory One" off \
f2 "Directory Two" on \
f3 "Directory Three" on
This will display something like
┌────────────────────────────────────────────────┐
│ Select a directory │
│ ┌─────────────────────┐ ┌────^(-)─────────────┐│
│ │Directory One │ │Directory Two ││
│ │ │ │Directory Three ││
│ │ │ │ ││
│ │ │ │ ││
│ │ │ │ ││
│ └─────────────────────┘ └─────────────100%────┘│
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
├────────────────────────────────────────────────┤
│ <OK> <Cancel> │
└────────────────────────────────────────────────┘
The box is 50 characters wide and 20 rows tall; each column displays 5 items. off/on determines if the item starts in the left or right column, respectively.
The controls:
^ selects the left column
$ selects the right column
Move up and down the selected column with the arrow keys
Move the selected item to the other column with the space bar
Toggle between OK and Cancel with the tab key. If you use the --visit-items option, the tab key lets you cycle through the lists as well as the buttons.
Hit enter to select OK or cancel.
If you select OK, the tags (f1, f2, etc) associated with each item in the right column is printed to standard error.
cat raw.txt
Name country IP Cost
sam us 10.10.10.10 $250
jack India 10.10.10.12 $190
joy Australia 10.10.10.13 $230
christ canada 10.10.10.15 $190
jackson africa 10.10.10.20 $230
I need to output like a table list four column and four row, i.e Name Country IP Cost
http://res.cloudinary.com/dzy8bgton/image/upload/v1413617325/Screenshot_from_2014-10-18_12_35_11_h6wjsu.png
please anyone can help me out.
Here's an old school answer :-)
#!/bin/sh
# use tbl|nroff to make an ASCII table
# use sed to change multiple spaces into a single tab for tbl(1)
sed 's/ */\t/g' < raw.txt | awk '
BEGIN {
print ".TS" # beginning of table
print "allbox;" # allbox format
print "c s s s" # Table name format - centered and spanning 4 columns
print "lb lb lb lb" # bold column headers
print "l l l l." # table with 4 left justified columns. "." means repeat for next line
print "My Table" # Table name
}
{print} # print each line of 4 values
END {
print ".TE" # end of table
}' | tbl | nroff -Tdumb
which generates
┌─────────────────────────────────────────┐
│ My Table │
├────────┬───────────┬─────────────┬──────┤
│Name │ country │ IP │ Cost │
├────────┼───────────┼─────────────┼──────┤
│sam │ us │ 10.10.10.10 │ $250 │
├────────┼───────────┼─────────────┼──────┤
│jack │ India │ 10.10.10.12 │ $190 │
├────────┼───────────┼─────────────┼──────┤
│joy │ Australia │ 10.10.10.13 │ $230 │
├────────┼───────────┼─────────────┼──────┤
│christ │ canada │ 10.10.10.15 │ $190 │
├────────┼───────────┼─────────────┼──────┤
│jackson │ africa │ 10.10.10.20 │ $230 │
└────────┴───────────┴─────────────┴──────┘
You can try the column command:
column -t file
Name country IP Cost
sam us 10.10.10.10 $250
jack India 10.10.10.12 $190
joy Australia 10.10.10.13 $230
christ canada 10.10.10.15 $190
jackson africa 10.10.10.20 $230