extracting data into columns using pdfplumber

extracting data into columns using pdfplumber - pandas

I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells.
How could I do that?
For your reference:
15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr
Thanks in advance

You can use the extract_tables() method the get the tables into the Data frame.
Here I can just mention the code for the 0th page.use the for loop to extract table from the all the pages.
import pdfplumber
path = file_path
pdf = pdfplumber.open(path,password="")
table = pd.DataFrame(pdf.pages[0].extract_tables())
Change the code as per your requirements.

Related

Time series data

I have an Excel file and there are two columns in it, I want to combine them, but one of them is in datetime form and the other is object (actually time). What I want to do is convert the object one to datetime format.enter image description here
I've tried everything I can think of but I keep getting an error.
Edit :enter image description here
import pandas as pd
dataFrame = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data.xlsx')
dataFrame.head()
output:
enter image description here
and my error
enter image description here

If I'm understanding? You'd want to split "Time" column on space and take 0 index. Finally use .cat to concatenate the string columns together. Next .pop old columns and finally wrap it all in to_datetime.
df["Time"] = df["Time"].str.split(r"\s+").str[0]
df["Datetime"] = pd.to_datetime(df.pop("Date").astype(str).str.cat(df.pop("Time"), sep=" "))

Issue when importing dataset: Rows that contain more elements/columns than the previous row are divided between two rows

For a project I receive datasets in the form of text files. These text files are generated by the measuring software from a machine. The data in the files is seperated by spaces and has no header, example of a row:
Mo 27.06.2022 12:01:11 MP2 mv:(mean. 5s): 4,824 mg/mü org.C
When loading this data using
my_data <- read.table("File.txt", header = FALSE, sep = "", dec = ",", fill=TRUE, na.strings=c("","NA"))
I obtain 9 columns in the following format (example), as intended.
|Mo|27.06.2022|12:01:11|MP2|mv:(mean.| 5s):| 4,824| mg/mü| org.C|
However, sometimes the data set starts with a notification from the machine (example):
Mo 27.06.2022 11:42:04 {SE14} service requestend
When this happens, the 'regular' 9 column rows are seperated between two rows (example):
Row 1: Mo|27.06.2022|11:58:26|MP1|mv:(mean.| 5s):|
Row 2: 7,858| mg/mü |org.C
How do I tell R to not perform this seperation between two rows? As I understand, it does this because earlier in the text file, an input of only 6 columns is recognized.
This is a script that we will use for years to come, so help is greatly appreciated!
I've tried removing the fill function from the read.table function, I have tried removing the na.strings, and ofcourse looking for answers on stack overflow, but was not able to encounter this specific problem.

Populate the data in its own field

How can the data be populated in its own field? See the field called 'vSoup Text'.
SELECT [iFormID]
,[iPracID]
,[iPatID]
,[vFormName]
,[vSoapText]
,[tHTML]
,[iUserID]
,[dDATE] FROM CustomForm where iPatID = 40 and vFormName = 'Diabetes Screening'
I need to break the data when you see '/br'

The only way that I've found to do this is by using a function. You could use SSIS, but the way that I've used is using a function.
https://sqlperformance.com/2012/07/t-sql-queries/split-strings

Splunk : formatting a csv file during indexing, values are being treated as new columns?

I am trying to create a new field during indexing however the fields become columns instead of values when i try to concat. What am i doing wrong ? I have looked in the docs and seems according ..
Would appreciate some help on this.
e.g.
.csv file
**Header1**, **Header2**
Value1 ,121244
transform.config
[test_transformstanza]
SOURCE_KEY = fields:Header1,Header2
REGEX =^(\w+\s+)(\d+)
FORMAT =
testresult::$1.$2
WRITE_META = true
fields.config
[testresult]
INDEXED = True
The regex is good, creates two groups from the data, but why is it creating a new field instead of assigning the value to result?. If i was to do ... testresult::$1 or testresult::$2 it works fine, but when concatenating it creates multiple headers with the value as headername. Is there an easier way to concat fields , e.g. if you have a csv file with header names can you just not refer to the header names? (i know how to do these using calculated fields but want to do it during indexing)
Thanks

Importing timeseries datasets to MATLAB (all values are displayed as NaN)

I am stuck trying to run an economic model using MATLAB - at the data importing part. For most of my code I'm using a freeware toolbox called IRIS.
I have quarterly dataset with 14 variables and 160 datapoints. Essentially the dataset is a 15X161 matrix- including the dates(col1) and variable names(B1:O1).
The command used for uploading data on IRIS is
d = dbload('filename.csv')
but this isn't working. Although MATLAB is creating a 1X1 array called d and creating fields under it (one for each variable). All cells display NaN - not a number.
Why is this happening?
I checked the tutorials on the IRIS toolbox website and tried running and loading a sample dataset from there using this command, but it leads to the same problem. Everywhere I checked- including MATLAB help, this seems to be the correct command to use when using IRIS, but somehow it isn't working.
I also tried uploading the data directly using MATLAB functions and not IRIS. The command I'm using is:
d = dataset('XLSFile','filename.xls','ReadVarNames', true).
Although this is working, and I can see all the variable names, but MATLAB can't read the dates. I tried xlsread and importdata as well, but they don't read the variable names. Is there any way for me to upload the entire Excel sheet with the variable names and dates?
It would be best if I could get the IRIS command to work, since the rest of my code would be compatible with that.
The dataset looks somewhat like this..
HO_GDP HO_CPI HO_CPI HO_RS HO_ER HO_POIL....
4/1/1970 82.33 85.01 55.00 99.87 08.77
7/1/1970 54.22 8.98 25.22 95.11 91.77
10/1/1970 85.41 85.00 85.22 95.34 55.00
1/1/1971 85.99 899 8.89 85.1

You can use the TEXTSCAN function to read the CSV file in MATLAB:
%# some options
numCols = 15; %# number of columns
opts = {'Delimiter',',', 'MultipleDelimsAsOne',true, 'CollectOutput',true};
%# open file for reading
fid = fopen('filename.csv','rt');
%# read header line
headers = textscan(fid, repmat('%s',1,numCols), 1, opts{:});
%# read rest of data rows
%# 1st column as string, the other 14 as floating point
data = textscan(fid, ['%s' repmat('%f',1,numCols-1)], opts{:});
%# close file
fclose(fid);
%# collect data
headers = headers{1};
data = [datenum(data{1},'mm/dd/yyyy') data{2}];
The result for the above sample you posted (assuming values are comma-separated):
>> headers
headers =
'HO_GDP' 'HO_CPI' 'HO_CPI' 'HO_RS' 'HO_ER' 'HO_POIL'
>> data
data =
7.1962e+05 82.33 85.01 55 99.87 8.77
7.1971e+05 54.22 8.98 25.22 95.11 91.77
7.198e+05 85.41 85 85.22 95.34 55
7.1989e+05 85.99 899 8.89 85.1 0
Note how in the last line of the code we convert the date column to serial date number, so that we can store the entire data in one numeric matrix. You can always go back to string representation of dates using DATESTR function:
>> datestr(data(:,1))
ans =
01-Apr-1970
01-Jul-1970
01-Oct-1970
01-Jan-1971

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas