Different table_areas on multiple page pdf - python-camelot

I would like to extract tables from a multiple page pdf. Because of the table properties, I need to use the flavor='stream' and table_areas properties to read_pdf for my table to be properly detected. My problem is that the position of the table is different on each page (the first page has an address head and not the other)
I have tried to provide several areas to the read_pdf function such as follows:
camelot.read_pdf(file, pages='all', flavor='stream', table_areas=['60, 740, 580, 50','60, 470, 580, 50'])
but this result as having 2 tables per page. How can I specify the table_areas for each page separately?
I have also tried to run several times read_pdf with different pages/table_areas, how ever then I cannot append the several result together to have a single objet:
tables = camelot.read_pdf(file, pages='1', flavor='stream', table_areas=['60, 470, 580, 50'])
tables.append(camelot.read_pdf(file, pages='2-end', flavor='stream', table_areas=['60, 740, 580, 50']))
gives an error as append is not a method of resulting tables
Is there a way to concatenate the results of several call of the read_pdf function?

Actually, as you noticed, you can't add items directly to the TableList object.
Instead, you can manipulate TableList _tables property (_tables is a list), in the following way:
my_tables = camelot.read_pdf(file, pages='1', flavor='stream', table_areas=['60, 470, 580, 50'])
my_tables._tables.append(camelot.read_pdf(file, pages='2-end', flavor='stream', table_areas=['60, 740, 580, 50']))
Now my_tables should consist of two tables.

Related

How to search for a specific word within a text?

I have a file, of type txt, with the following text:
The dataset is available at: https://archive.ics.uci.edu/ml/datasets.php
The file name is Cancer_Data.xml
This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.
I need to search within this text the word that accompanies the "xml". I tried to do the following implementation:
import pandas as pd
with open(local_arquivo, "r") as file_read:
for line in file_read:
var_split = line.split()
for i in range(0, len(var_split)):
if(var_split[i].str.contains('xml')):
archive_name = var_split.iloc[i]
The idea was to separate the text using the split function and then look for the part that contains the 'xml'. However, when I run it, the following error appears:
AttributeError: 'str' object has no attribute 'str'
I would like the output to be:
archive_name = Cancer_Data.xml
Try
if('xml' in var_split[i]):
source: https://docs.python.org/3/reference/expressions.html#in

Spark (Databricks) unmanaged table from SQL not processing headers

Trying to create an unmanaged table in Spark (Databricks) from a CSV file using the SQL API. But first row is not being used as headers.
Image 2, shows that the first row is correct when using the Dataframe API to create an unmanaged table. The Dataframe was loaded from the same csv file.
However, Image 1, shows that when creating an unmanaged table from a CSV file data source in SQL, does not process the first row as headers. Am I leaving out some "headers" option?
And if so, how would that be coded?
Dataframe API
You just need to provide OPTIONS as it's specified in the documentation.
In the that options block you can list key/value pairs that matches to the options specific to the Spark CSV reader, for example, options ('header' = 'true', 'sep' = ',') will force Spark to ignore header line, and set separator to comma. You can also add the 'inferSchema' = true into options, in this case you can just omit the columns declaration - Spark will infer it for you (it's ok for small datasets, but not for the big ones):
create table test.test using csv
options ('header' = 'true', 'sep' = ',', 'inferSchema' = true)
location '/databricks-datasets/Rdatasets/data-001/csv/COUNT/affairs.csv'

Python write function saves dataframe.__repr__ output but truncated?

I have a dataframe output as a result of running some code, like so
df = pd.DataFrame({
"i": self.direct_hit_i,
"domain name": self.domain_list,
"j": self.direct_hit_j,
"domain name 2": self.domain_list2,
"domain name cleaned": self.clean_domain_list,
"domain name cleaned 2": self.clean_domain_list2
})
All I was really looking for was a way to save these data to whatever file e.g. txt, csv but in a way where the columns of data align with the header. I was using df.to_csv() with \t delimeter but due to the data have different lengths of string and numbers, the elements within each row never quite line up as a column with the corresponding header. So I resulted to using
with open('./filename.txt', 'w') as fo:
fo.write(df.__repr__())
But bear in mind the data in the dataframe are lists with really long length. So for small lengths it returns
which is exactly what I want. However, when I have very big lists it gives me
So as seen below the outputs are truncated. I would like it to not be truncated since I'll need to manually scroll down and verify things.
Try the syntax:
with open('./filename.txt', 'w') as fo:
fo.write(f'{df!r}')
Another way of doing this export to csv would be to use a too like Mito, which full disclosure I'm the author of. It should allow you to export ot CSV easier than the process here!

Group Pandas: How to concat or merge/join/append two csv files with same index but different extensions in grouped data?

I'd like to concat or merge or append/join two csv files with the same indix ID but different extensions on the same ID. The data are grouped by ID also. The 1st file looks like this:
ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25
and the 2nd file looks like this:
ID,year,from1,to1
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15825,16523
810033622,2000,13211,14876
810033622,2001,14761,14987
I have set index of ID for both files after reading it to dataframe, and then concat them together, but it gets error message of "ValueError: Shape of passed values is (25, 2914), indices imply (25, 251)"
I've tried the following codes:
sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)
I've also tried concat the two files together without setting up index, and the result seemed having correct rows, but the horizontal values were incorrect related.
I've also tried merge as well, but it came with many unnecessary duplicated rows within each group. Since each group has different year and age, I found quite difficult to delete those newly generated rows using this method.
full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)
Maybe somebody can suggest ways to easily delete these duplicates, and this will also work for me, because the merged file became so huge! Thanks in advance!
Expected results would be including all different values from both csv files. It is somewhat like this:
ID,year,age,from1,to1
810006862,2000,49,,
810006862,2001,,,
810006862,2002,,15341,15705
810006862,2003,52,15706,16070
810006862,2004,,16071,16436
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15825,16523
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,13211,14876
810033622,2001,25,14761,14987
I've searched online for similar posts along for quite some time, but unable to solve my problem. Can anybody offer any clue how to do this? Thanks a lot!

Append new columns to HDFStore with pandas

I'm using Pandas, and making a HDFStore object. I calculate 500 columns of data, and write it to a table format HDFStore object. Then I close the file, delete the data from memory, do the next 500 columns (labelled by an increasing integer), open up the store, and try to append the new columns. However, it doesn't like this. It gives me an error
invalid combinate of [non_index_axes] on appending data [[(1, [500, 501, 502, ...])]] vs current table [[(1, [0, 1, 2, ...])]]
I'm assuming it only allows appending of more rows not columns. So how do I add more columns?
HDF5 files have a fixed structure, and you cannot easily add a column , but the workaround is to concatenate different DFs and the re-write them into the HDF5 file.
hdf5_files = ['data1.h5', 'data2.h5', 'data3.h5']
df_list = []
for file in hdf5_files:
df = pd.read_hdf(file)
df_list.append(df)
result = pd.concat(df_list)
# You can now use the result DataFrame to access all of the data from the HDF5 files
Does this solve your problem ?
Remind HDF5 is not designed for efficient append operations, you should consider database system if you need to frequently add new columns to your data , imho.
You have kept your column titles in the code [1, 2, 3, ...] and trying to append a DataFrame with different columns [500, 501, 502, ...].