I know there is a few threads on this topic but none of their solutions seems to work for me. I have a table in a PDF document from which I would like to be able to extract information. I can copy and paste the text into textedit and it is legible but not really useable. By this I mean all the text is readable but the data is all separated by spaces with no way to differentiate columns from spaces within text within a cell.
But whenever I try to use tools like tabula or scraper wiki the text extracted is garbage.
Is anyone able to give me any pointers as to how I might go about this?
Here's a solution using Python and Unix
In Python:
import urllib
# download pdf
testfile = urllib.URLopener()
testfile.retrieve('http://www.european-athletics.org/mm/Document/EventsMeetings/General/01/27/52/10/EICH-FinalEntriesforwebsite_Neutral.pdf', 'test.pdf')
In Unix:
$ pdftotext -layout test.pdf
Snippet of output to test.txt:
Lastname Firstname Country DOB PB SB
1500m Men
Rowe Brenton AUT 17/08/1987
Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09
Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16
Friš Jan CZE 19/12/1995 3:43.76 3:43.76
Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54
Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97
Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78
Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79
Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30
Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66
Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60
Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22
Grice Charlie GBR 07/11/1993 3:39.44 3:39.44
O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42
Orth Florian GER 24/07/1989 3:39.97 3:40.20
Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13
Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28
Mooney Danny IRL 20/06/1988 3:42.69 3:42.69
Travers John IRL 16/03/1991 3:42.52 3:43.74
Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58
Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95
Ingebrigtsen Henrik NOR 24/02/1991 3:44.00
Ingebrigtsen Filip NOR 20/04/1993
Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64
Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36
ebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49
Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74
Nava Goran SRB 15/04/1981 3:40.65 3:44.49
Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51
Ek Staffan SWE 13/11/1991 3:43.54 3:43.54
Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03
Özbilen lham Tanui TUR 05/03/1990 3:34.76 3:38.05
Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35
You can also download a simple command line tool to deal with the PDF file you linked to. The run this command to extract the table(s) on the first page:
pdftotext \
-enc UTF-8 \
-l 1 \
-table \
EICH-FinalEntriesforwebsite_Neutral.pdf \
EICH-FinalEntriesforwebsite_Neutral.txt
-enc UTF-8: sets the text encoding so that the Ö, Ä, Ü and İ (as well as ö, ä, ü, ß, á, š, ē, í and č) characters in the text get correctly extracted.
-l 1: tells the command to extract as the last page the page number 1.
-table: this is the decisive parameter.
The command produces this output:
EUROPEAN ATHLETICS INDOOR CHAMPIONSHIPS
PRAGUE / CZE, 6-8 MARCH 2015
FINAL ENTRIES - MEN
Lastname Firstname Country DOB PB SB
1500m Men
Rowe Brenton AUT 17/08/1987
Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09
Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16
Friš Jan CZE 19/12/1995 3:43.76 3:43.76
Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54
Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97
Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78
Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79
Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30
Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66
Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60
Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22
Grice Charlie GBR 07/11/1993 3:39.44 3:39.44
O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42
Orth Florian GER 24/07/1989 3:39.97 3:40.20
Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13
Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28
Mooney Danny IRL 20/06/1988 3:42.69 3:42.69
Travers John IRL 16/03/1991 3:42.52 3:43.74
Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58
Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95
Ingebrigtsen Henrik NOR 24/02/1991 3:44.00
Ingebrigtsen Filip NOR 20/04/1993
Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64
Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36
Żebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49
Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74
Nava Goran SRB 15/04/1981 3:40.65 3:44.49
Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51
Ek Staffan SWE 13/11/1991 3:43.54 3:43.54
Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03
Özbilen İlham Tanui TUR 05/03/1990 3:34.76 3:38.05
Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35
3000m Men
Rowe Brenton AUT 17/08/1987
Vojta Andreas AUT 09/06/1989 7:59.95 7:59.95
Note, however:
The -table parameter to the pdftotext command line tool is only available in the XPDF-version 3.04, which you can download here: www.foolabs.com/xpdf/download.html. It is NOT (yet) available in Poppler's fork of pdftotext (latest version of which is 0.43.0).
If you only have Poppler's pdftotext, you'd have to use the -layout parameter (instead of -table), which gives you a similarly good result for the PDF file in question:
pdftotext \
-enc UTF-8 \
-l 1 \
-layout \
EICH-FinalEntriesforwebsite_Neutral.pdf \
EICH-FinalEntriesforwebsite_Neutral.txt
However, I have seen PDFs where the result is much better with -table (and XPDF) than it is with -layout (and Poppler).
(XPDF has the -layout parameter too -- so you can see the difference if you try both.)
Related
This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows
Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.
I'd like to import a raw data using "input" in SAS. My following program doesn't work well. How do I do that? Please give me some advice.
data dt00;
infile datalines;
input Year School & $27. Enrolled : comma.;
datalines;
1868 U OF CALIFORNIA BERKELEY 31,612
1906 U OF CALIFORNIA DAVIS 21,838
1965 U OF CALIFORNIA IRVINE 15,874
1919 U OF CALIFORNIA LOS ANGELES 35,730
;
run;
The & modifier in your input statement says to look for two or more delimiters in a row to mark the end of the next "word" in the line. Make sure the lines of data actually have the extra space. Also make sure to include the : modifier in front of any informat specification in the INPUT statement.
data dt00;
input Year School & :$27. Enrolled : comma.;
datalines;
1868 U OF CALIFORNIA BERKELEY 31,612
1906 U OF CALIFORNIA DAVIS 21,838
1965 U OF CALIFORNIA IRVINE 15,874
1919 U OF CALIFORNIA LOS ANGELES 35,730
;
datalines is space-separated by default. You can specify specific line lengths as you are doing and do additional post-processing cleanup, but the easiest thing to do is add a different delimiter and include the dlm option in your infile statement.
data dt00;
infile datalines dlm='|';
length Year 8. School $27. Enrolled 8.;
input Year School$ Enrolled : comma.;
datalines;
1868|U OF CALIFORNIA BERKELEY|31,612
1906|U OF CALIFORNIA DAVIS|21,838
1965|U OF CALIFORNIA IRVINE|15,874
1919|U OF CALIFORNIA LOS ANGELES|35,730
;
run;
Output:
Year School Enrolled
1868 U OF CALIFORNIA BERKELEY 31612
1906 U OF CALIFORNIA DAVIS 21838
1965 U OF CALIFORNIA IRVINE 15874
1919 U OF CALIFORNIA LOS ANGELES 35730
SAS has a ton of options on the input statement for reading both structured and unstructured data, but at the end of the day, it's easiest to get it in a delimited format whenever possible.
df_lyrics['Lyrics']
0 \n\n--Male--\nAaaaa Aaaaa\n--Female--\nAaaaaa\...
1 \n\n--Male1--\nAnkhiyon Hi Ankhiyon Mein\nRati...
2 \n\n--Male1--\nAray Peeli Chotiyaan,\nHawaeyn ...
3 \n\nPyar Itna Na Kar\nYeh Dil Jaata Hai Bhar\n...
4 Zaraa Maara Maara Sa\nJaane Kyun Dil Ye Ban Ba...
...
1286 \n\nKaara fankaara kab aaye re\nKaara fankaar...
1287 \n\nZameen-o-aasmaan ne kya baat ki hai\nGira...
1288 \n\nMaula Wa Sallim wassalim da-iman abadan\n...
1289 \n\nBhavara\nRe ga re ga re ga re ga pa ma ga...
1290 \n\nArre udi udi udi... udi jaye..\n\nUdi udi...
Name: Lyrics, Length: 1291, dtype: object
I want to convert all these into a single paragraph....
Please help me.
You can join the strings using pd.Series sum:
print(df_lyrics['Lyrics'].sum())
I have a csv file that looks like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
"1,0,3,""Braund, Mr. Owen Harris"",male,22,1,0,A/5 21171,7.25,,S"
Can I use pandas to read the csv such that it gets read in the obvious way?
In other words, I want the csv file to be read as if it looked like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
Any suggestions?
pd.read_csv(data)
is the answer to your problem.
Here is the code I used for this Kaggle dataset:
training_set = pd.read_csv('train.csv')
Output (Just first row)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
I have read a csv file into python 2.7 (windows machine). Sales Price column seems to be mixture of string and float. And some rows contains a euro symbol €. Python sees € as �.
df = pd.read_csv('sales.csv', thousands=',')
print df
Gender Size Color Category Sales Price
Female 36-38 Blue Socks 25
Female 44-46 Pink Socks 13.2
Unisex 36-38 Black Socks � 19.00
Unisex 40-42 Pink Socks � 18.50
Female 38 Yellow Pants � 89,00
Female 43 Black Pants � 89,00
I was under the assumption that a simple line with replace will solve it
df=df.replace('\�','',regex=True).astype(float)
But I got encoding error
SyntaxError: Non-ASCII character
Would appreciate hearing your thoughts on this
I faced a similar problem where one of the column in my dataframe had lots of currency symbols. Euro, Dollar, Yen, Pound etc. I tried multiple solutions but the easiest one was to use unicodedata module.
df['Sales Price'] = df['Sales Price'].str.replace(unicodedata.lookup('EURO SIGN'), 'Euro')
The above will replace € with Euro in Sales Price column.
I think #jezrael comment is valid. First you need to read the file with encoding(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html under encoding section)
df=pd.read_csv('sales.csv', thousands=',', encoding='utf-8')
but for replacing Euro sign try this:
df=df.replace('\u20AC','',regex=True).astype(float)