I need to generate 50 Millions Rows csv file with random data: how to optimize this program? - rebol

The program below can generate random data according to some specs (example here is for 2 columns)
It works with a few hundred of thousand lines on my PC (should depend on RAM). I need to scale to dozen of millions row.
How can I optimize the program to write directly to disk ? Subsidiarily how can I "cache" the parsing rule execution as it is always the same pattern repeated 50 Millions times ?
Note: to use the program below, just type generate-blocks and save-blocks output will be db.txt
Rebol[]
specs: [
[3 digits 4 digits 4 letters]
[2 letters 2 digits]
]
;====================================================================================================================
digits: charset "0123456789"
letters: charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
separator: charset ";"
block-letters: [A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]
blocks: copy []
generate-row: func[][
Foreach spec specs [
rule: [
any [
[
set times integer! [['digits (
repeat n times [
block: rejoin [block random 9]
]
)
|
'letters (repeat n times [
block: rejoin [ block to-string pick block-letters random 24]
]
)
]
|
[
'letters (repeat n times [block: rejoin [ block to-string pick block-letters random 24]
]
)
|
'digits (repeat n times [block: rejoin [block random 9]]
)
]
]
|
{"} any separator {"}
]
]
to end
]
block: copy ""
parse spec rule
append blocks block
]
]
generate-blocks: func[m][
repeat num m [
generate-row
]
]
quote: func[string][
rejoin [{"} string {"}]
]
save-blocks: func[file][
if exists? to-rebol-file file [
answer: ask rejoin ["delete " file "? (Y/N): "]
if (answer = "Y") [
delete %db.txt
]
]
foreach [field1 field2] blocks [
write/lines/append %db.txt rejoin [quote field1 ";" quote field2]
]
]

Use open with /direct and /lines refinement to write directly to file without buffering the content:
file: open/direct/lines/write %myfile.txt
loop 1000 [
t: random "abcdefghi"
append file t
]
Close file
This will write 1000 random lines without buffering.
You can also prepare a block of lines (lets say 10000 rows) then write it directly to file, this will be faster than writing line-by-line.
file: open/direct/lines/write %myfile.txt
loop 100 [
b: copy []
loop 1000 [append b random "abcdef"]
append file b
]
close file
this will be much faster, 100000 rows less than a second.
Hope this will help.
Note that, you can change the number 100 and 1000 according to your needs an memory of your pc, and use b: make block! 1000 instead of b: copy [], it will be faster.

Related

jq reducing stream to an array of all leaf values using input

I want to receive streamed json inputs and reduce them to an array containing the leaf values.
Demo: https://jqplay.org/s/cZxLguJFxv
Please consider
filter:
try reduce range(30) as $i ( []; (.+[until(length==2;input)[1]] // error(.)) )
catch empty
input:
[
[
0,
0,
"a"
],
null
]
[
[
0,
0,
"a"
]
]
[
[
0,
1,
"b"
],
null
]
[
[
0,
1,
"b"
]
]
[
[
0,
1
]
]
[
[
1
],
0
]
...
output:
empty
I expect the output: [null, null, 0, ...] but I get empty instead.
I told reduce to iterate 30 times but the size of inputs is less than that. I'm expecting it will empty those input of length other than 2 and produce an array containing all leaf values.
I don't know how this will behave when there is no more input with length 2 left and there are iterations of reduce left.
I want to know why my filter returns empty. What am I doing wrong? Thanks!
These filters should do what you want:
jq -n 'reduce inputs as $in ([]; if $in | has(1) then . + [$in[1]] else . end)'
Demo
jq -n '[inputs | select(has(1))[1]]'
Demo

Inverse of `split` function: `join` a string using a delimeter

IN Red and Rebol(3), you can use the split function to split a string into a list of items:
>> items: split {1, 2, 3, 4} {,}
== ["1" " 2" " 3" " 4"]
What is the corresponding inverse function to join a list of items into a string? It should work similar to the following:
>> join items {, }
== "1, 2, 3, 4"
There's no inbuild function yet, you have to implement it yourself:
>> join: function [series delimiter][length: either char? delimiter [1][length? delimiter] out: collect/into [foreach value series [keep rejoin [value delimiter]]] copy {} remove/part skip tail out negate length length out]
== func [series delimiter /local length out value][length: either char? delimiter [1] [length? delimiter] out: collect/into [foreach value series [keep rejoin [value delimiter]]] copy "" remove/part skip tail out negate length length out]
>> join [1 2 3] #","
== "1,2,3"
>> join [1 2 3] {, }
== "1, 2, 3"
per request, here is the function split into more lines:
join: function [
series
delimiter
][
length: either char? delimiter [1][length? delimiter]
out: collect/into [
foreach value series [keep rejoin [value delimiter]]
] copy {}
remove/part skip tail out negate length length
out
]
There is an old modification of rejoin doing that
rejoin: func [
"Reduces and joins a block of values - allows /with refinement."
block [block!] "Values to reduce and join"
/with join-thing "Value to place in between each element"
][
block: reduce block
if with [
while [not tail? block: next block][
insert block join-thing
block: next block
]
block: head block
]
append either series? first block [
copy first block
] [
form first block
]
next block
]
call it like this rejoin/with [..] delimiter
But I am pretty sure, there are other, even older solutions.
Following function works:
myjoin: function[blk[block!] delim [string!]][
outstr: ""
repeat i ((length? blk) - 1)[
append outstr blk/1
append outstr delim
blk: next blk ]
append outstr blk ]
probe myjoin ["A" "B" "C" "D" "E"] ", "
Output:
"A, B, C, D, E"

Alternative rules in PARSE with SKIP only did not output expected results

I encounter a snippet of code:
blk: [1 #[none] 2 #[none] 3 #[none]]
probe parse blk [
any [
set s integer! (print 'integer) | (print 'none) skip
]
]
the output is:
integer
none
integer
none
integer
none
none
true
Note that in front of true there are two nones. While the next code snippet output the expected output:
blk: [1 #[none] 2 #[none] 3 #[none]]
probe parse blk [
any [
set s integer! (print 'integer) | and none! (print 'none) skip
]
]
output:
integer
none
integer
none
integer
none
true
Why the previous one could not output the same result with the last one?
Your first rule should better be
probe parse blk [
any [
set s integer! (print 'integer) | skip (print 'none)
]
]
as in your first rule you print none, if there is no integer and just skip after. This causes, that you print none even when the cursor is at the end. The skip just ends the parsing.
In your second rule the none! is not true at the end, so the parsing stops. It can be written shorter
probe parse blk [
any [
set s integer! (print 'integer) | none! (print 'none)
]
]
In your second rule the and does not move the cursor forward, so you need the additional skip. Without and the none! eats one item already.

How do I convert set-words in a block to words

I want to convert a block from block: [ a: 1 b: 2 ] to [a 1 b 2].
Is there an easier way of than this?
map-each word block [ either set-word? word [ to-word word ] [ word ] ]
Keeping it simple:
>> block: [a: 1 b: 2]
== [a: 1 b: 2]
>> forskip block 2 [block/1: to word! block/1]
== b
>> block
== [a 1 b 2]
I had same problem so I wrote this function. Maybe there's some simpler solution I do not know of.
flat-body-of: function [
"Change all set-words to words"
object [object! map!]
][
parse body: body-of object [
any [
change [set key set-word! (key: to word! key)] key
| skip
]
]
body
]
These'd create new blocks, but are fairly concise. For known set-word/value pairs:
collect [foreach [word val] block [keep to word! word keep val]]
Otherwise, you can use 'either as in your case:
collect [foreach val block [keep either set-word? val [to word! val][val]]]
I'd suggest that your map-each is itself fairly concise also.
I like DocKimbel's answer, but for the sake of another alternative...
for i 1 length? block 2 [poke block i to word! pick block i]
Answer from Graham Chiu:
In R2 you can do this:
>> to block! form [ a: 1 b: 2 c: 3]
== [a 1 b 2 c 3]
Or using PARSE:
block: [ a: 1 b: 2 ]
parse block [some [m: set-word! (change m to-word first m) any-type!]]

How do I refer to variable in func argument when same is used in foreach

How can I refer to date as argument in f within the foreach loop if date is also used as block element var ? Am I obliged to rename my date var ?
f: func[data [block!] date [date!]][
foreach [date o h l c v] data [
]
]
A: simple, compose is your best friend.
f: func[data [block!] date [date!]][
foreach [date str] data compose [
print (date)
print date
]
]
>> f [2010-09-01 "first of sept" 2010-10-01 "first of october"] now
7-Sep-2010/21:19:05-4:00
1-Sep-2010
7-Sep-2010/21:19:05-4:00
1-Oct-2010
You need to either change the parameter name from date or assign it to a local variable.
You can access the date argument inside the foreach loop by binding the 'date word from the function specification to the data argument:
>> f: func[data [block!] date [date!]][
[ foreach [date o h l c v] data [
[ print last reduce bind find first :f 'date 'data
[ print date
[ ]
[ ]
>> f [1-1-10 1 2 3 4 5 2-1-10 1 2 3 4 5] 8-9-10
8-Sep-2010
1-Jan-2010
8-Sep-2010
2-Jan-2010
It makes the code very difficult to read though. I think it would be better to assign the date argument to a local variable inside the function as Graham suggested.
>> f: func [data [block!] date [date!] /local the-date ][
[ the-date: :date
[ foreach [date o h l c v] data [
[ print the-date
[ print date
[ ]
[ ]
>> f [1-1-10 1 2 3 4 5 2-1-10 1 2 3 4 5] 8-9-10
8-Sep-2010
1-Jan-2010
8-Sep-2010
2-Jan-2010