I would like to know if it possible to format the following text with links:
match 1
1: http://www.google.it
2: http://www.example.it
3: http://www.example2.it
match2 15:30
1: http://www.google.it
2: http://www.example.it
3: http://www.example2.it
into
match 1
1: LINK1
2: LINK2
3: LINK3
match2 15:30
1: LINK1
2: LINK2
3: LINK3
I would like maybe to display it in a richtextbox or whatever control.. Just wondering if it's possible to "hide" the real link under a link label. Thanks
Related
I was trying to read one csv file using;
df= pd.read_csv('Diff_Report.csv',on_bad_lines='skip',encoding='cp1252',index_col=None)
Input Example
But the code outputs as in the following screenshot. Why is it happening like this?
Output
#Ammu07
Try using pd.read_excel()
Solution 1: -
It looks like you are displaying the first 5 rows of df2
Solution 2: -
Check if it is in another encoding or just utf encoding.
Solution 3: -
CSV files are separated by commas, but maybe your data contains a comma, which should be cleared.
Solution 4: -
Check if your data is exactly like the input or if it is separated by commas.
Tip: - Try adding index_col as id.
I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/
I've OCRed the index of a book, and it's worked well, apart from not recognising some line breaks. I would like to scan the indexes a load of books, and so need to add line breaks in using Notepad ++, ideally.
I have tried this in Find and Replace:
Find what: [0-9]+
Replace with: \r\n
Which almost did what I want, but it removed the numbers.
it's more 'find numbers and insert line break after them' that I'm trying to do.
I would be so grateful for any help! Thank you!
Here's an example of the index before:
Bengali-style baked fish 77 biscuits: fennel seed drop-biscuits 155 bread: naan 129 roti 127 simple layered flat breads 126 broad bean thoran 112 burgers: chicken burger 43
And how I'd like it to look after:
Bengali-style baked fish 77
biscuits: fennel seed drop-biscuits 155
bread: naan 129
roti 127
simple layered flat breads 126
broad bean thoran 112
burgers: chicken burger 43
Using Notepad++
Ctrl+H
Find what: \d+ \K
Replace with: \n or \r\n for Windows EOL
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
\d+ # 1 or more digits and a space
\K # forget all we have seen until this position
Screenshot:
I have the following regex which displays the following output:
Desired green text.But when added to Splunk, shows exact same output as previous regex.
What I want is to make it such that the exact highlighted green text gets highlighted in Splunk for field extraction.
Previous regex:
The highlighted green text got misaligned.
What I suspect is that I have to make the '< 37 > 1' highlighted in blue too, so that Splunk will extract the green text correctly. As this regex was done by another user when I asked in Splunk, the user did not add '< 37 > 1' in his sample regex which affected the alignment when I added it to Splunk. I've tried different variations to highlight the '< 37 > 1' but to no avail.
Some examples of my variations:
(?:[^\s][^\s][^\s]+\s+){2}(?P[^\s]+(?:\s\w+)?)\s\d+\s+<
(?:[^\s][^\s]+\s+){2}(?P[^\s]+(?:\s\w+\s)?)\s\d+\s+<
(?:[^\s][^\s]+\s+){2}(?P[^\s]+(?:\s\w+\w)?)\s\d+\s+<
(?:[^\s][^\s]+\s+){2}(?P[^\s]+(?:\s\w+\w\w)?)\s\d+\s+<
\<(?:[^\s][^\s]+\s+){2}(?P[^\s]+(?:\s\w+)?)\s\d+\s+<
(?:[^\s][^\s]+\s+\s){2}(?P[^\s]+(?:\s\w+)?)\s\d+\s+<
Link of the regex:
https://regex101.com/r/biHi9a/5
You know Splunk doesn't use color when parsing text, right?
If I understand your question and examples correctly, you want to extract the text between "E876876876" and the first digit. Assuming "E876876876" is not a fixed string, this regex should extract "auth" and "SECURE AUDIT".
":\d\dZ\s\w+\s(?P<field1>[^\d]+)"
I am trying to read a text file and replace all occurrence of a "search term" with "replace term" with regular expression and write the new file.
I am relatively new to pentaho kettle and not sure which transform or set of steps will suite best for this use case? Most transforms read data by rows or columns so I am not sure how to read a text file and do a find replace to work? Most transforms either file line by line or by fields.
Thanks for time and attention.
Step 1 - Input-> Text file input : file type: Fixed, just one String field big enough
Step 2 - Transform -> Replace in String : Use regexp: Y, Find: "search term", Replace: "replace term"
Step 3 - Output -> Text file output
Not sure it's the best, but it works. Hope it helps.