Read parquet file compressed with zstd - file-io

I am new to Julia and I am trying to port some stuff I did in Python.
I have a file I wrote in Python, with a DataFrame to a parquet file using the zstd compression lib (supported by both pandas and fastparquet, parquet file writing).
It gives an error since ParquetFiles or FileIO (not sure which one is responsible for the decompression), doesn't support zstd.
Any ideas on how to read this file in Julia?
using DataFrames
using ParquetFiles
using FileIO
test = DataFrame(load("test.parquet"))
Unknown compression codec for column chunk: 6
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] bytes at /home/morgado/.julia/packages/Parquet/qSvbc/src/reader.jl:149 [inlined]
[3] bytes at /home/morgado/.julia/packages/Parquet/qSvbc/src/reader.jl:140 [inlined]
[4] values(::ParFile, ::Parquet.Page) at /home/morgado/.julia/packages/Parquet/qSvbc/src/reader.jl:232
[5] values(::ParFile, ::Parquet.PAR2.ColumnChunk) at /home/morgado/.julia/packages/Parquet/qSvbc/src/reader.jl:178
[6] setrow(::ColCursor{Int64}, ::Int64) at /home/morgado/.julia/packages/Parquet/qSvbc/src/cursor.jl:144
[7] ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at /home/morgado/.julia/packages/Parquet/qSvbc/src/cursor.jl:115
[8] (::getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64})(::String) at ./none:0
[9] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64}}) at ./generator.jl:47
[10] RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType361}, ::Int64) at /home/morgado/.julia/packages/Parquet/qSvbc/src/cursor.jl:269 (repeats 2 times)
[11] getiterator(::ParquetFiles.ParquetFile) at /home/morgado/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:74
[12] nondatavaluerows(::ParquetFiles.ParquetFile) at /home/morgado/.julia/packages/Tables/IT0t3/src/tofromdatavalues.jl:16
[13] columns at /home/morgado/.julia/packages/Tables/IT0t3/src/fallbacks.jl:173 [inlined]
[14] #DataFrame#393(::Bool, ::Type, ::ParquetFiles.ParquetFile) at /home/morgado/.julia/packages/DataFrames/VrZOl/src/other/tables.jl:34
[15] DataFrame(::ParquetFiles.ParquetFile) at /home/morgado/.julia/packages/DataFrames/VrZOl/src/other/tables.jl:25
[16] top-level scope at In[25]:8

Related

How do I read a gzipped XLSX file in Julia?

I have a gz file which I downloaded from here, using HTTP. Now I want to read the xlsx file contained in the gz file and convert it to a DataFrame. I tried this:
julia> using HTTP, XLSX, DataFrames, GZip
julia> file = HTTP.get("http://www.tsetmc.com/tsev2/excel/IntraDayPrice.aspx?i=35425587644337450&m=30")
julia> write("c:/users/shayan/desktop/file.xlsx.gz", file.body);
julia> df = GZip.open("c:/users/shayan/desktop/file.xlsx.gz", "r") do io
XLSX.readxlsx(io)
end
But this throws a MethodError:
ERROR: MethodError: no method matching readxlsx(::GZipStream)
Closest candidates are:
readxlsx(::AbstractString) at C:\Users\Shayan\.julia\packages\XLSX\FFzH0\src\read.jl:37
Stacktrace:
[1] (::var"#23#24")(io::GZipStream)
# Main c:\Users\Shayan\Documents\Python Scripts\test.jl:15
[2] gzopen(::var"#23#24", ::String, ::String)
# GZip C:\Users\Shayan\.julia\packages\GZip\JNmGn\src\GZip.jl:269
[3] open(::Function, ::Vararg{Any})
# GZip C:\Users\Shayan\.julia\packages\GZip\JNmGn\src\GZip.jl:265
[4] top-level scope
# c:\Users\Shayan\Documents\Python Scripts\test.jl:14
XLSX.jl does not work on streams. So you would need to ungzip the file to some temporary location and then read it.
tname = tempname() * ".xlsx"
GZip.open("c://temp//journals.xlsx.gz", "r") do io
open(tname, "w") do out
write(out, read(io))
end
end
df = XLSX.readxlsx(tname)

Julia #eval world age missmatch

I'm trying to use the julia #eval functionality to only load the PyPlot package on demand. However I verry often run into world age missmatch.
Here is a minimal example where i try and plot on demand
function CreateMatrix(Ncount;Plot=true)
TheMatrix = fill(0.0,Ncount,Ncount)
if Plot
#eval using PyPlot
###"Plot the Matrix"
PyPlot.figure()
PyPlot.imshow(abs.(TheMatrix))
PyPlot.colorbar()
end
return TheMatrix
end
CreateMatrix(10;Plot=false)
CreateMatrix(10;Plot=true)
With the output
ERROR: LoadError: MethodError: no method matching figure()
The applicable method may be too new: running in world age 25063, while current world is 25079.
Closest candidates are:
figure(!Matched::Any...; kws...) at ~/.julia/packages/PyPlot/fZuOQ/src/PyPlot.jl:148 (method too new to be called from this world context.)
Stacktrace:
[1] #CreateMatrix#3(::Bool, ::Function, ::Int64) at myfile.jl:7
[2] (::getfield(Main, Symbol("#kw##CreateMatrix")))(::NamedTuple{(:Plot,),Tuple{Bool}}, ::typeof(CreateMatrix), ::Int64) at ./none:0
[3] top-level scope at none:0
[4] include at ./boot.jl:317 [inlined]
[5] include_relative(::Module, ::String) at ./loading.jl:1044
[6] include(::Module, ::String) at ./sysimg.jl:29
[7] exec_options(::Base.JLOptions) at ./client.jl:231
[8] _start() at ./client.jl:425
in expression starting at myfile.jl:16
Does anyone know how uses the #eval functionality properly?
EDIT
One of the comments suggested wrapping the plotting command and annotating with #noinline as below, but this does not work.
function CreateMatrix(Ncount;Plot=false)
TheMatrix = fill(0.0,Ncount,Ncount)
if Plot
#eval using PyPlot
###"Plot the Matrix"
ThePlotting(TheMatrix)
end
return TheMatrix
end
#noinline function ThePlotting(TheMatrix)
PyPlot.figure()
PyPlot.imshow(abs.(TheMatrix))
PyPlot.colorbar()
end
CreateMatrix(10;Plot=false)
CreateMatrix(10;Plot=true)
I'm running julia version 1.0.2
You can implement it like this:
function CreateMatrix(Ncount;Plot=true)
TheMatrix = fill(0.0,Ncount,Ncount)
if Plot
if isdefined(Main, :PyPlot)
println("PyPlot already loaded")
PyPlot.figure()
PyPlot.imshow(abs.(TheMatrix))
PyPlot.colorbar()
else
println("PyPlot loading PyPlot")
#eval using PyPlot
Base.invokelatest(PyPlot.figure)
Base.invokelatest(PyPlot.imshow, abs.(TheMatrix))
Base.invokelatest(PyPlot.colorbar)
end
end
return TheMatrix
end
I have used conditional to allow you to see which branch gets executed on repeated call to the function.
Initially I thought that when calling non-inlined function Julia allows world age change (but it turns out that it is strict).
Finally - in general it is probably safer not to write code like this but simply load the module in top-level scope (possibly conditionally).

Is Tensor Flow compatible with Julia 0.6.4?

So I have a project in my machine learning class and we are using Julia as our programming language. We can use any packages we want to build neural networks but I can't seem to get Tensor Flow to test correctly. Pkg.add("TensorFlow") works seemingly fine but here is the output for Pkg.test("TensorFlow")
julia> Pkg.test("TensorFlow")
INFO: Testing TensorFlow
ERROR: LoadError: LoadError: could not load library "C:\Users\Ryan .LAPTOP-
KJUJGIC7\.julia\v0.6\TensorFlow\src\..\deps\usr\bin\libtensorflow"
The specified module could not be found.
Stacktrace:
[1] dlopen(::String, ::UInt32) at .\libdl.jl:97
[2] TensorFlow.Graph() at C:\Users\Ryan .LAPTOP- KJUJGIC7\.julia\v0.6\TensorFlow\src\core.jl:21
[3] include_from_node1(::String) at .\loading.jl:576
[4] include(::String) at .\sysimg.jl:14
[5] include_from_node1(::String) at .\loading.jl:576
[6] include(::String) at .\sysimg.jl:14
[7] process_options(::Base.JLOptions) at .\client.jl:305
[8] _start() at .\client.jl:371
while loading C:\Users\Ryan .LAPTOP-KJUJGIC7\.julia\v0.6\TensorFlow\test\..\examples\logistic.jl, in expression starting on line 22
while loading C:\Users\Ryan .LAPTOP-KJUJGIC7\.julia\v0.6\TensorFlow\test\runtests.jl, in expression starting on line 6
=================================================[ ERROR: TensorFlow ]==================================================
failed process: Process(`'C:\Users\Ryan .LAPTOP-KJUJGIC7\AppData\Local\Julia-0.6.4\bin\julia.exe' -Cgeneric '-JC:\Users\Ryan .LAPTOP-KJUJGIC7\AppData\Local\Julia-0.6.4\lib\julia\sys.dll' --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes 'C:\Users\Ryan .LAPTOP-KJUJGIC7\.julia\v0.6\TensorFlow\test\runtests.jl'`, ProcessExited(1)) [1]
========================================================================================================================
ERROR: TensorFlow had test errors
I'm running Julia Version 0.6.4 on Windows 10; if there's a way to resolve this error or a workaround I'd love some suggestions.
TensorFlow.jl does not support Windows.
You have two options:
(1) Try using TensorFlow via PyCall.jl:
using Conda
Conda.runconda("install -c conda-forge tensorflow")
(2) Use Flux.jl instead

Trying to implement experience replay in Tensorflow

I am trying to implement experience replay in Tensorflow. The problem I am having is in storing outputs for the models trial and then updating the gradient simultaneously. A couple approaches I have tried are to store the resulting values from sess.run(model), however, these are not tensors and cannot be used for gradient descent as far as tensorflow is concerned. I am currently trying to use tf.assign(), however, The difficulty I am having is best shown through this example.
import tensorflow as tf
import numpy as np
def get_model(input):
return input
a = tf.Variable(0)
b = get_model(a)
d = tf.Variable(0)
for i in range(10):
assign = tf.assign(a, tf.Variable(i))
b = tf.Print(b, [assign], "print b: ")
c = b
d = tf.assign_add(d, c)
e = d
with tf.Session() as sess:
tf.global_variables_initializer().run()
print(sess.run(e))
The issue I have with the above code is as follows:
-It prints different values on each run which seems odd
-It does not correctly update at each step in the for loop
Part of why I am confused is the fact that I understand you have to run the assign operation to update the prior reference, however, I just can't figure out how to correctly do that in each step of the for loop. If there is an easier way I am open to suggestions. This example is the same as how I am currently trying to feed in an array of inputs and get a sum based on each prediction the model makes. If clarification on any of the above would help I will be more than happy to provide it.
The following is the results from running the code above three times.
$ python test3.py
2018-07-03 13:35:08.380077: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
80
$ python test3.py
2018-07-03 13:35:14.055827: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
print b: [7]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
60
$ python test3.py
2018-07-03 13:35:20.120661: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
90
The result I am expecting is as follows:
print b: [0]
print b: [1]
print b: [2]
print b: [3]
print b: [4]
print b: [5]
print b: [6]
print b: [7]
print b: [8]
print b: [9]
45
The main reason I am confused is that sometimes it provides all nines which makes me think that it loads the last value assigned 10 times, however, sometimes it loads different values which seems to contrast this theory.
What I would like to do is to feed in an array of input examples and compute the gradient for all examples at the same time. It needs to be concurrently because the reward used is dependent on the outputs of the model, so if the model changes the resulting rewards would also change.
When you call tf.assign(a, tf.Variable(i)) this does not actually immediately assign the value of the second variable to the first one. It just create an operation in the NN to do the assignment when sess.run(...) is called.
When it is called all 10 assignments try to do their assignment at the same time. One of them randomly wins and then gets passed to the 10 assign_add operations which in effect multiplies it 10 times.
As to your motivating problem of implementing experience replay, most approaches I came across use tf.placeholder() to feed the experience buffer content into the network on training.

ERROR: LoadError: LoadError: syntax: "()" is not a valid function argument name

While trying to load TensorFlow in Julia, I gets the error
ERROR: LoadError: LoadError: syntax: "()" is not a valid function argument name
I could not get the solution. I am using Ubuntu and Julia version 0.6.0
The problems:
julia> using TensorFlow
INFO: Precompiling module TensorFlow.
WARNING: Loading a new version of TensorFlow.jl for the first time. This initial load can take around 5 minutes as code is precompiled; subsequent usage will only take a few seconds.
ERROR: LoadError: LoadError: syntax: "()" is not a valid function argument name
Stacktrace:
[1] include_from_node1(::String) at ./loading.jl:569
[2] include(::String) at ./sysimg.jl:14
[3] include_from_node1(::String) at ./loading.jl:569
[4] include(::String) at ./sysimg.jl:14
[5] anonymous at ./<missing>:2
while loading /home/spg/.julia/v0.6/TensorFlow/src/ops.jl, in expression starting on line 119
while loading /home/spg/.julia/v0.6/TensorFlow/src/TensorFlow.jl, in expression starting on line 184
ERROR: Failed to precompile TensorFlow to /home/spg/.julia/lib/v0.6/TensorFlow.ji.
Stacktrace:
[1] compilecache(::String) at ./loading.jl:703
[2] _require(::Symbol) at ./loading.jl:490
[3] require(::Symbol) at ./loading.jl:398
Following this link, run Pkg.update()