Using iTextSharp v5.5.13
I have a huge amount of PDF files I need to parse. About 5% of them have a table with data I also need.
The table looks like this:
Most of the time the line I need is parsed as
2 januari 15 januari € 49,49 € 21,57 € 15,09 € 34,39
I can work with that. I split by space and it works.
But sometimes the month name has an extra space: janu ari
I know I can override the strategies to get rid of these extra spaces. I'm already using it with the rest of the pdf (ITextExtractionStrategy), but for this table, I'm using a rectangle strategy:
var rect = new System.util.RectangleJ(70, 425, 460, 200);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(new MyLocationTextExtractionStrategy(), filter);
var lines = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy).Split('\n');
My override looks like this:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
{
var dist = chunk.DistanceFromEndOf(previousChunk);
return dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f;
}
}
I found this Googling. But it doesn't solve my problem.
In the case of janu ari dist is larger than -chunk.CharSpaceWidth and I'm not sure what to do next.
Please let me know when I should not use a rectangle strategy for this table but a different approach.
If your data in this type of table is always going to be in the same format, then you could take a different approach: just accept whatever data your MyLocationTextExtractionStrategy is throwing at you, and then massage that data into a format that you can use.
In this case, you data is always:
2 groups of:
1 or 2 digits (day of the month)
some characters (name of the month)
4 groups of:
Euro symbol
some digits (at least one)
comma
2 digits
In 2 januari 15 januari € 49,49 € 21,57 € 15,09 € 34,39 the spaces are separation characters, but with such well structured data, you don't even need spaces. So just drop them, and then your data becomes 2januari15januari€49,49€21,57€15,09€34,39.
Now you can use a regular expression with some capture groups, to massage your data into something palatable.
2 groups of:
[0-9]{1,2}
[a-z]*
4 groups of:
€
[0-9]{1,}
,
[0-9]{2}
As you wrote yourself in the comments, one possible resulting regular expression could be:
new Regex(#"([0-9]{1,2})([a-z]*)([0-9]{1,2})([a-z]*)(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})")
Related
I have column in a df with a size range with different sizeformats.
artikelkleurnummer size
6725 0161810ZWA B080
6726 0161810ZWA B085
6727 0161810ZWA B090
6728 0161810ZWA B095
6729 0161810ZWA B100
in the sizerange are also these other size formats like XS - XXL, 36-50 , 36/38 - 52/54, ONE, XS/S - XL/XXL, 363-545
I have tried to get the prefix '0' out of all sizes with start with a letter in range (A:K). For exemple: Want to change B080 into B80. B100 stays B100.
steps:
1 look for items in column ['size'] with first letter of string in range (A:K),
2 if True change second position in string into ''
for range I use:
from string import ascii_letters
def range_alpha(start_letter, end_letter):
return ascii_letters[ascii_letters.index(start_letter):ascii_letters.index(end_letter) + 1]
then I've tried a for loop
for items in df['size']:
if df.loc[df['size'].str[0] in range_alpha('A','K'):
df.loc[df['size'].str[1] == ''
message
SyntaxError: unexpected EOF while parsing
what's wrong?
You can do it with regex and the pd.Series.str.replace -
df = pd.DataFrame([['0161810ZWA']*5, ['B080', 'B085', 'B090', 'B095', 'B100']]).T
df.columns = "artikelkleurnummer size".split()
replacement = lambda mpat: ''.join(g for g in mpat.groups() if mpat.groups().index(g) != 1)
df['size_cleaned'] = df['size'].str.replace(r'([a-kA-K])(0*)(\d+)', replacement)
Output
artikelkleurnummer size size_cleaned
0 0161810ZWA B080 B80
1 0161810ZWA B085 B85
2 0161810ZWA B090 B90
3 0161810ZWA B095 B95
4 0161810ZWA B100 B100
TL;DR
Find a pattern "LetterZeroDigits" and change it to "LetterDigits" using a regular expression.
Slightly longer explanation
Regexes are very handy but also hard. In the solution above, we are trying to find the pattern of interest and then replace it. In our case, the pattern of interest is made of 3 parts -
A letter in from A-K
Zero or more 0's
Some more digits
In regex terms - this can be written as r'([a-kA-K])(0*)(\d+)'. Note that the 3 brackets make up the 3 parts - they are called groups. It might make a little or no sense depending on how exposed you have been to regexes in the past - but you can get it from any introduction to regexes online.
Once we have the parts, what we want to do is retain everything else except part-2, which is the 0s.
The pd.Series.str.replace documentation has the details on the replacement portion. In essence replacement is a function that takes all the matching groups as the input and produces an output.
In the first part - where we identified three groups or parts. These groups are accessed with the mpat.groups() function - which returns a tuple containing the match for each group. We want to reconstruct a string with the middle part excluded, which is what the replacement function does
sizes = [{"size": "B080"},{"size": "B085"},{"size": "B090"},{"size": "B095"},{"size": "B100"}]
def range_char(start, stop):
return (chr(n) for n in range(ord(start), ord(stop) + 1))
for s in sizes:
if s['size'][0].upper() in range_char("A", "K"):
s['size'] = s['size'][0]+s['size'][1:].lstrip('0')
print(sizes)
Using a List/Dict here for example.
I got a price decimal which sometimes can be either 0.00002001 or 0.00002.
I want to display always 3 zeros from the right if the number is like 0.00002 so I'm looking it to be 0.00002000. If the number is 0.00002001 do not add anything.
I came accross some examples and other examplesin msdn and tried with
price.ToString.Format("{0:F4}", price)
but It doesn't actually change anything in the number.
And in the case number is like 123456789 I want it to display 123.456.789 which I've half solved using ToString("N2") but it's displaying also a .00 decimals which I don't want.
Some special cases here between the fractional and whole numbers, so they need to be handled differently.
Private Function formatWithTrailingZeros(number As Double) As String
If number Mod 1 > 0 Then ' has a fractional component
Return $"{number:0.00000000}"
Else
Dim formattedString = $"{number:N2}"
Return formattedString.Substring(0, formattedString.Length - 3)
End If
End Function
Dim price = 0.00002001
Console.WriteLine(formatWithTrailingZeros(price))
price = 0.00002
Console.WriteLine(formatWithTrailingZeros(price))
price = 123456789
Console.WriteLine(formatWithTrailingZeros(price))
price = 123456789.012345
Console.WriteLine(formatWithTrailingZeros(price))
0.00002001
0.00002000
123,456,789
123456789.01234500
If your second case with 123.456.789 is not based on your current culture, then you may need to replace , with . such as
Return formattedString.Substring(0, formattedString.Length - 3).Replace(",", ".")
Since you are using . both as a decimal separator and a thousands separator, I'm not sure how my example of 123456789.012345000 should look, but since you didn't ask, I'm not going to guess.
I'm trying to create a new column called 'team'. In the image below you see different type of codes. The first number of the code is the team someone's in, IF the number consists out of 3 characters. E.G: 315 = team 3, 240 = team 2, and 3300 = NULL.
In the image below you can see my data flow so far and the expression I have tried, but doesn't work.
You forget parenthesis () in your regex :
Try :
^([0-9]{3})$
Demo
1 - 2 of 2
Above is my text. This is from paging of a web application. How do i extract the last number of the above text. SO i will get the count of list in that page and i can run a loop with respect to the number.
You can use substring
Let's consider your example. You have a String 1 - 2 of 2 (pagination probably)
Each of individual character is a specified index of a String
1 = 0
space = 1
- = 2
space = 3
etc.
String has a set of methods to perform various tasks. One of them is length() which gives you number of characters in your String
What you can do is to pass your length of String to substring.
Example:
myString.substring(0,1) will give you results of 1
myString.substring(0,myString.length()) wil give you results of 1 - 2 of 5
Additional info: myString.length() is an int type so you can perform math operations like + or -
myString.substring(0,myString.length()-1) will give you results of 1 - 2 of
I gave you the tools, now it's time for you to find the solutions.
You could just split the string using the spaces and then grab the last element of the split array. That should cover you even if the last number has more than one digit. Throw in a trim, just in case, to remove any leading/trailing white space.
String[] splitter = pageCount.trim().split(" ");
System.out.println(splitter[splitter.length - 1]);
How can I pad a sentence with white space so that it is printed in a 'block'.
I want to print a receipt. for a given item I want to print the quantity, item name, and price.
12 x Example short name £2.00
1 x This is an example of
a long name £10.00
The 'block' to which I refer to is shown below.
12 x |Example short name | £2.00
1 x |This is an example of |
|a long name | £10.00
Using the example above I can easily handle the formatting of the quantity and the price. Using string formatting. Then for the item name the best method to use, I think, is to split the item name in code, and use the wrap functionality of StringTemplate, but I don't know how to pad the remainder of the line with whitespace
I am using .Net and StringTemplate 4. Here is a simplified version of the template I have. Assuming I have an item with the properties Quantity, ItemName (split into an array of strings), and Price.
<item.Quantity; format="{0,4}"> x <item.ItemName; wrap, separator=" ", anchor><item.Price; format="£{0,8}">
Now at the moment the only way I can think to get it to work is to calculate the white space in code and add it to the item name array.
So is there a neat way to do it in StringTemplate?