Camelot ignoring backend request - python-camelot

I recently got into camelot and but for some reason need to use my own backend to remove text during the imaging process.
Here is what I tried based on the documentation:
class ConversionBackend(object):
def convert(pdf_path, png_path):
arg2 = '-sOutputFile=' + png_path
p = subprocess.Popen(['/usr/bin/gs', '-sDEVICE=png16m', '-dNOPAUSE', '-dBATCH', '-dQUIET', "-dFILTERIMAGE", "-dFILTERTEXT", "-r300",str(arg2), pdf_path], stdout = subprocess.PIPE)
pass`
Aside from the fact that my ghostscript use is not optimal, all of this runs does work when I run it step by step in the python console. However when calling camelot using
tables = camelot.read_pdf(path, line_scale=100, split_text=True, flag_size=True, layout_kwargs={'detect_vertical': False, 'char_margin': 2.0}, pages='all', backend=ConversionBackend())
Camelot still executes my command but with complete disregard for the backend=ConversionBackend()
Any ideas on how to fix this?
Sh4yce

Related

Karate - Results from two series of tests aren't merged anymore after upgrading version from 0.9.3 to 1.2.0

We are facing an issue related to the tests results after upgrading our Karate project from V0.9.3 to V1.2.0. We are testing an API after the execution of two batchs. Therefore we have a first series of tests (Runner 1) executed on our API after our first batch run, then a second series of tests (Runner 2 on new features files) executed ofter our 2nd batch.
On the version we were using, the tests results were merged but on the updated version, we cannot achieve to get all the results on the same report : the results of the first run are deleted so we’re left with only the results of the second run.
Previously working code :
Results results1 = Runner.parallel(
Arrays.asList("#tag1,#tag2", "#ignore"),
Collections.singletonList("classpath:features"),
5,
"target/sources-rapports");
int totalFailCount = results1.getFailCount();
Results results2 = Runner.parallel(Arrays.asList("#tag3,#tag4", "#ignore"),Collections.singletonList("classpath:features"),5,"target/sources-rapports");
totalFailCount += results2.getFailCount();
generateReport(results2.getReportDir());
The report would contain all test features of results1 and results2. Whereas now, each execution seems to remove previous karate json files before generating the new ones.
New non-working code with the following syntax :
Runner.path("classpath:features")
.tags(Arrays.asList("#tag1,#tag2", "#ignore"))
.outputCucumberJson(true)
.parallel(5);
I'm looking for help to solve this problem. Do not hesitate to ask for more informations if you need.
Try this change:
Runner.path("classpath:features")
.tags(Arrays.asList("#tag1,#tag2", "#ignore"))
.outputCucumberJson(true)
.backupReportDir(false)
.parallel(5);
For further info: https://stackoverflow.com/a/66685944/143475

Pylint: same pylint and pandas version on 2 machines, 1 fails

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.
A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

Adding new datasources to an existing .rrd

I have a .rrd db which is collecting data from a temperature gauge. Now I have a second gauge so I'd like to add this new gauge to the existing .rrd database. I tried many times with the "rrdtool tune" command, but after that I run a "rrdtool info" on my database, and I see that there's not the last data source (another gauge) that I tried to insert.
How can I do this?
The command you need is, as you say, rrdtool tune. The documentation is available online at https://oss.oetiker.ch/rrdtool/doc/rrdtune.en.html
The ability to extend an RRA and to add or remove a DS was only added late in RRDTool 1.4. Check that you are not using an older version of RRDTool, as if you are, you will not be able to use this feature until you upgrade.
I just checked, and I see I'm using RRDTOOL 1.4 so I would not have problems. Anyway, the fact is that I used this command:
/usr/bin/rrdtool tune TEMPCucina.rrd DS:METEOTEMPEXT:GAUGE:1200:U:U RRA:AVERAGE:0.5:1:180000
I got this back from the computer:
DS[TEMPCucina] typ: GAUGE hbt: 1200 min: nan max: nan
But it seems that I'm not able to write into TEMPCucina.rrd
And if I try to perform the following command:
rrdtool info TEMPCucina.rrd
I just get the following, and it seems that no new gauge has been created
filename = "TEMPCucina.rrd"
rrd_version = "0003"
step = 60
last_update = 1510780261
header_size = 556
ds[TEMPCucina].index = 0
ds[TEMPCucina].type = "GAUGE"
ds[TEMPCucina].minimal_heartbeat = 1200
ds[TEMPCucina].min = NaN
ds[TEMPCucina].max = NaN
ds[TEMPCucina].last_ds = "18"
ds[TEMPCucina].value = 1,8000000000e+01
ds[TEMPCucina].unknown_sec = 0
rra[0].cf = "AVERAGE"
rra[0].rows = 30000
rra[0].cur_row = 1304
rra[0].pdp_per_row = 1
rra[0].xff = 0,0000000000e+00
rra[0].cdp_prep[0].value = NaN
rra[0].cdp_prep[0].unknown_datapoints = 0
(when I try to write I get this, but I don't know how to proceed at this point)
ERROR: TEMPCucina.rrd: illegal attempt to update using time 1510780527 when last update time is 1510780527 (minimum one second step)
I finally did it, but I wasn't able to use the rrdtool tune function.
I finally found here how to perform a dump of the database, how to modify it, and finally restore it to its original location (so I could also correct some data).
This is not what I was searching for, but it solved my problem so I want to share it.

JVM is not ready after 10 seconds

I configured sparkr normally from the tutorials, and everything was working. I was able to read the database with read.df, but suddenly nothing else works, and the following error appears:
Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds
Why does it appear now suddenly? I've read other users with the same problem, but the solutions given did not work. Below is my code:
Sys.setenv(SPARK_HOME= "C:/Spark")
Sys.setenv(HADOOP_HOME = "C:/Hadoop")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
#initialeze SparkR environment
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
Sys.setenv(SPARK_MEM="4g")
#Create a spark context and a SQL context
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Try to do few things below:
Check if c:/Windows/System32/ is there in the PATH.
Check if spark-submit.cmd has proper execute permissions.
If both the above things are true and even if it is giving the same error, then delete spark directory and again create a fresh one by unzipping spark gzip file.
I'm a beginner of R, and I have solved the same problem "JVM is not ready after 10 seconds" by installing JDK(version 7+) before installing sparkr in my mac. And it works well now. Hope this can help you with your problem.

Memory error when running medium sized merge function ipython notebook jupyter

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook
Here is a sample of the data:
timestamp Namecoin_cap
0 2013-04-28 5969081
1 2013-04-29 7006114
2 2013-04-30 7049003
Each frame is around 1000 lines long
Here's the error in detail, I've also include my merge function.
My system is currently using up 64% of it memory
I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.
EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary
EDIT2: I have exported the scrypt to .py
http://pastebin.com/GqaHr7xc
This is a .xlsx that cointains asset names the script needs
The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary
I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.
EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.
EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes
EDIT5: The problem is in the merge function for coin 37
for coin in coins[:37]:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?
EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is
EDIT7: goCards suggestions seemed to work however the next part of the code:
merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)
Produces a memory error. I'm stumped
In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.
file_template_string='%s.csv'
for eachKey in dfDict:
filename = file_template_string%(eachKey)
dfDict[eachKey].to_csv(filename)
If you need to date the files you can also put a timestamp in the filename.
import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))
There are some obvious errors in your code.
for coin in coins: #line 61,89
for coin in data: #should be
df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
Same issue happened to me!
"MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.
Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!