move string starting with specific characters into another column in pandas df - pandas

I have a dataframe containing addresses.
Addressess
10 Pentland Drive, Comiston, Edinburgh, EH10 6PX.
Moray Place, Edinburgh, EH3
Carlton Street, Edinburgh
The Bourse Apartments, 47 Timber Bush, Leith EH6 6QH
I wish to write code in python and pandas that identifies if there is an 'EH' in the row and then moves this and all subsequent characters into another column. Thus achieving this:
Addressess
Post Code
10 Pentland Drive, Comiston, Edinburgh, .
EH10 6PX
Moray Place, Edinburgh,
EH3
Carlton Street, Edinburgh
The Bourse Apartments, 47 Timber Bush, Leith EH6 6QH
EH6 6QH
can anyone help?

You can use str.extract:
df[['Addressess', 'Post Code']] = df['Addressess'].str.extract(r'(.*?)\s*(\bEH\d+[\s\w]*)?\W*$')
regex demo
Or str.split, if there is at least one row with a post code:
df[['Addressess', 'Post Code']] = df['Addressess'].str.split(r'\s*(?=\bEH\d*)', n=1, expand=True)
regex demo
Output:
Addressess Post Code
0 10 Pentland Drive, Comiston, Edinburgh, EH10 6PX
1 Moray Place, Edinburgh, EH3
2 Carlton Street, Edinburgh NaN
3 The Bourse Apartments, 47 Timber Bush, Leith EH6 6QH

With simple regex matching:
df['Post Code'] = df['Addressess'].str.extract('(EH.+)').fillna('')
Addressess Post Code
0 10 Pentland Drive, Comiston, Edinburgh, EH10 6PX. EH10 6PX.
1 Moray Place, Edinburgh, EH3 EH3
2 Carlton Street, Edinburgh
3 The Bourse Apartments, 47 Timber Bush, Leith E... EH6 6QH

Here is a way using a regex positive lookahead:
df['Addressess'].str.split(r'[0-9A-Za-z,] (?=EH)',expand=True).rename({0:'Addressess',1:'Post Code'},axis=1)

Related

Pop the first element in a pandas column

I have a pandas column like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': [['William J. Clare', '290 Valley Dr.', 'Casper, WY 82604','USA, United States'],
['1180 Shelard Tower', 'Minneapolis, MN 55426', 'USA, United States'],
['William N. Barnard', '145 S. Durbin', 'Casper, WY 82601', 'USA, United States']]
}
df = pd.DataFrame(data)
I wanted to pop the 1st element in the address column list if its name or if it doesn't contain any number.
output:
[['290 Valley Dr.', 'Casper, WY 82604','USA, United States'], ['1180 Shelard Tower', 'Minneapolis, MN 55426', 'USA, United States'], ['145 S. Durbin', 'Casper, WY 82601', 'USA, United States']]
This is continuation of my previous post. I am learning python and this is my 2nd project and I am struggling with this from morning please help me.
Assuming you define an address as a string starting with a number (you can change the logic):
for l in df['address']:
if not l[0][0].isdigit():
l.pop(0)
print(df)
updated df:
id address
0 001 [290 Valley Dr., Casper, WY 82604, USA, United...
1 002 [1180 Shelard Tower, Minneapolis, MN 55426, US...
2 003 [145 S. Durbin, Casper, WY 82601, USA, United ...

Slicing pandas dataframe using index values

I'm trying to select the rows who's index values are congruent to 1 mod 24. How can I best do this?
This is my dataframe:
ticker date open high low close volume momo nextDayLogReturn
335582 ETH/USD 2021-11-05 00:00:00+00:00 4535.3 4539.3 4495.8 4507.1 9.938260e+06 9.094134 -9.160928
186854 BTC/USD 2021-11-05 00:00:00+00:00 61437.0 61528.0 61111.0 61170.0 1.191233e+07 10.640513 -10.825763
186853 BTC/USD 2021-11-04 23:00:00+00:00 61190.0 61541.0 61130.0 61437.0 1.395133e+07 10.645757 -10.842114
335581 ETH/USD 2021-11-04 23:00:00+00:00 4518.8 4539.4 4513.6 4535.3 1.296507e+07 9.087243 -9.139240
186852 BTC/USD 2021-11-04 22:00:00+00:00 61393.0 61426.0 61044.0 61190.0 1.360557e+07 10.639201 -10.812127
This was my attempt:
newindex = []
for i in range(0,df2.shape[0]+1):
if(i%24 ==1):
newindex.append(i)
df2.iloc[[newindex]]
Essentially, I need to select the rows using a boolean but i'm not sure how to do it.
Many thanks

convert a column of text into a paragraph

df_lyrics['Lyrics']
0 \n\n--Male--\nAaaaa Aaaaa\n--Female--\nAaaaaa\...
1 \n\n--Male1--\nAnkhiyon Hi Ankhiyon Mein\nRati...
2 \n\n--Male1--\nAray Peeli Chotiyaan,\nHawaeyn ...
3 \n\nPyar Itna Na Kar\nYeh Dil Jaata Hai Bhar\n...
4 Zaraa Maara Maara Sa\nJaane Kyun Dil Ye Ban Ba...
...
1286 \n\nKaara fankaara kab aaye re\nKaara fankaar...
1287 \n\nZameen-o-aasmaan ne kya baat ki hai\nGira...
1288 \n\nMaula Wa Sallim wassalim da-iman abadan\n...
1289 \n\nBhavara\nRe ga re ga re ga re ga pa ma ga...
1290 \n\nArre udi udi udi... udi jaye..\n\nUdi udi...
Name: Lyrics, Length: 1291, dtype: object
I want to convert all these into a single paragraph....
Please help me.
You can join the strings using pd.Series sum:
print(df_lyrics['Lyrics'].sum())

Merge multiple excel to single worksheet with options

I have 2 sheets in one excel file, the first one is :
Sheet: Person
Code date start end
2301 12/08/1993 08:02 08:17
4221 12/08/1993 09:04 09:25
2312 12/08/1993 10:02 10:28
1284 19/09/1994 11:02 11:21
2312 19/09/1994 15:57 16:20
1284 23/06/1995 17:12 17:35
2312 22/06/1996 13:14 13:32
4221 22/06/1996 15:53 16:13
4221 05/05/1999 08:06 08:22
2418 05/05/1999 08:10 08:33
2301 05/05/1999 09:12 09:37
2301 05/05/1999 09:28 10:28
2301 05/05/1999 13:28 13:38
Is a list of person of a company and anyone of them is identified by badge [row Code], what I hope is to Merge data by code to a costume sheet of a person, for example, for the person who have a number of badge 2301 he have his own sheet called B2301, so based on the first sheet "Person" I hope import data of a person like that grouped by code number of this person :
sheet B2301
date Period(min)
12/08/1987 12
.... ...
So Period will be calculated from start and end rows.
I tried by using this formula but it's not working for me :
=IFERROR(INDEX(Sheet1!A$2:A$14,SMALL(IF(Sheet1!$A$2:$A$14=INT(RIGHT(CELL("filename",A1),LEN(CELL("filename",A1))-FIND("]",CELL("filename",A1)))),ROW(Sheet1!A$2:A$14)-ROW(Sheet1!A$2)+1),ROWS(Sheet1!A$2:A2))),"")
Any Idea?
This will require a lot of research on your part. You'll need to:
create a VBA Macro
define variables and create a loop to look at your main sheet.
create a sheet name based on the code.
check if the sheet already exists, if not, create it.
copy the values from the first sheet to the "code" sheet.
once all values are processed, go through each sheet, loop through your values and calculate your periods.
This is not a trivial amount of code. Do research on these 6 items and write the code. When you have that, display it and we can give you more direction.
To populate the dates, in A2 put:
=IFERROR(INDEX(Sheet1!$B$2:$B$14,MATCH(SMALL(IF(--MID(MID(CELL("filename",A1),FIND("]",CELL("filename",A1))+1,255),2,999) = Sheet1!$A$2:$A$14,Sheet1!$B$2:$B$14),ROW()-1),IF(--MID(MID(CELL("filename",A1),FIND("]",CELL("filename",A1))+1,255),2,999) = Sheet1!$A$2:$A$14,Sheet1!$B$2:$B$14),0)),"")
To populate the period put this in B2:
=IFERROR(TEXT(INDEX(Sheet1!$D$2:$D$14,MATCH(SMALL(IF(--MID(MID(CELL("filename",A1),FIND("]",CELL("filename",A1))+1,255),2,999) = Sheet1!$A$2:$A$14,IF(Sheet1!$B$2:$B$14=A2,Sheet1!$C$2:$C$14)),COUNTIF($A$1:$A2,A2)),IF(--MID(MID(CELL("filename",A1),FIND("]",CELL("filename",A1))+1,255),2,999) = Sheet1!$A$2:$A$14,IF(Sheet1!$B$2:$B$14=A2,Sheet1!$C$2:$C$14)),0))-INDEX(Sheet1!$C$2:$C$14,MATCH(SMALL(IF(--MID(MID(CELL("filename",A1),FIND("]",CELL("filename",A1))+1,255),2,999) = Sheet1!$A$2:$A$14,IF(Sheet1!$B$2:$B$14=A2,Sheet1!$C$2:$C$14)),COUNTIF($A$1:$A2,A2)),IF(--MID(MID(CELL("filename",A1),FIND("]",CELL("filename",A1))+1,255),2,999) = Sheet1!$A$2:$A$14,IF(Sheet1!$B$2:$B$14=A2,Sheet1!$C$2:$C$14)),0)),"[m]"),"")
Both are array formulas and need to be confirmed with Ctrl-Shift-Enter. Then Copy both down to desired rows.

Problems with extracting table from PDF

I know there is a few threads on this topic but none of their solutions seems to work for me. I have a table in a PDF document from which I would like to be able to extract information. I can copy and paste the text into textedit and it is legible but not really useable. By this I mean all the text is readable but the data is all separated by spaces with no way to differentiate columns from spaces within text within a cell.
But whenever I try to use tools like tabula or scraper wiki the text extracted is garbage.
Is anyone able to give me any pointers as to how I might go about this?
Here's a solution using Python and Unix
In Python:
import urllib
# download pdf
testfile = urllib.URLopener()
testfile.retrieve('http://www.european-athletics.org/mm/Document/EventsMeetings/General/01/27/52/10/EICH-FinalEntriesforwebsite_Neutral.pdf', 'test.pdf')
In Unix:
$ pdftotext -layout test.pdf
Snippet of output to test.txt:
Lastname Firstname Country DOB PB SB
1500m Men
Rowe Brenton AUT 17/08/1987
Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09
Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16
Friš Jan CZE 19/12/1995 3:43.76 3:43.76
Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54
Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97
Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78
Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79
Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30
Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66
Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60
Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22
Grice Charlie GBR 07/11/1993 3:39.44 3:39.44
O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42
Orth Florian GER 24/07/1989 3:39.97 3:40.20
Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13
Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28
Mooney Danny IRL 20/06/1988 3:42.69 3:42.69
Travers John IRL 16/03/1991 3:42.52 3:43.74
Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58
Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95
Ingebrigtsen Henrik NOR 24/02/1991 3:44.00
Ingebrigtsen Filip NOR 20/04/1993
Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64
Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36
ebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49
Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74
Nava Goran SRB 15/04/1981 3:40.65 3:44.49
Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51
Ek Staffan SWE 13/11/1991 3:43.54 3:43.54
Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03
Özbilen lham Tanui TUR 05/03/1990 3:34.76 3:38.05
Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35
You can also download a simple command line tool to deal with the PDF file you linked to. The run this command to extract the table(s) on the first page:
pdftotext \
-enc UTF-8 \
-l 1 \
-table \
EICH-FinalEntriesforwebsite_Neutral.pdf \
EICH-FinalEntriesforwebsite_Neutral.txt
-enc UTF-8: sets the text encoding so that the Ö, Ä, Ü and İ (as well as ö, ä, ü, ß, á, š, ē, í and č) characters in the text get correctly extracted.
-l 1: tells the command to extract as the last page the page number 1.
-table: this is the decisive parameter.
The command produces this output:
EUROPEAN ATHLETICS INDOOR CHAMPIONSHIPS
PRAGUE / CZE, 6-8 MARCH 2015
FINAL ENTRIES - MEN
Lastname Firstname Country DOB PB SB
1500m Men
Rowe Brenton AUT 17/08/1987
Vojta Andreas AUT 09/06/1989 3:38.99 3:41.09
Khadiri Amine CYP 20/11/1988 3:45.16 3:45.16
Friš Jan CZE 19/12/1995 3:43.76 3:43.76
Holuša Jakub CZE 20/02/1988 3:38.79 3:41.54
Kocourek Milan CZE 06/12/1987 3:43.97 3:43.97
Bueno Andreas DEN 07/07/1988 3:42.78 3:42.78
Alcalá Marc ESP 07/11/1994 3:41.79 3:41.79
Mechaal Adel ESP 05/12/1990 3:38.30 3:38.30
Olmedo Manuel ESP 17/05/1983 3:39.82 3:40.66
Ruíz Diego ESP 05/02/1982 3:36.42 3:40.60
Kowal Yoann FRA 28/05/1987 3:38.07 3:39.22
Grice Charlie GBR 07/11/1993 3:39.44 3:39.44
O'Hare Chris GBR 23/11/1990 3:37.25 3:40.42
Orth Florian GER 24/07/1989 3:39.97 3:40.20
Tesfaye Homiyu GER 23/06/1993 3:34.13 3:34.13
Kazi Tamás HUN 16/05/1985 3:44.28 3:44.28
Mooney Danny IRL 20/06/1988 3:42.69 3:42.69
Travers John IRL 16/03/1991 3:42.52 3:43.74
Bussotti Neves Junior Joao Capistrano M. ITA 10/05/1993 3:47.58 3:47.58
Jurkēvičs Dmitrijs LAT 07/01/1987 3:45.95 3:45.95
Ingebrigtsen Henrik NOR 24/02/1991 3:44.00
Ingebrigtsen Filip NOR 20/04/1993
Krawczyk Szymon POL 29/12/1988 3:41.64 3:41.64
Ostrowski Artur POL 10/07/1988 3:41.36 3:41.36
Żebrowski Krzysztof POL 09/07/1990 3:41.49 3:41.49
Smirnov Valentin RUS 13/02/1986 3:37.55 3:38.74
Nava Goran SRB 15/04/1981 3:40.65 3:44.49
Pelikán Jozef SVK 29/07/1984 3:43.85 3:45.51
Ek Staffan SWE 13/11/1991 3:43.54 3:43.54
Rogestedt Johan SWE 27/01/1993 3:40.03 3:40.03
Özbilen İlham Tanui TUR 05/03/1990 3:34.76 3:38.05
Özdemir Ramazan TUR 06/07/1991 3:44.35 3:44.35
3000m Men
Rowe Brenton AUT 17/08/1987
Vojta Andreas AUT 09/06/1989 7:59.95 7:59.95
Note, however:
The -table parameter to the pdftotext command line tool is only available in the XPDF-version 3.04, which you can download here: www.foolabs.com/xpdf/download.html. It is NOT (yet) available in Poppler's fork of pdftotext (latest version of which is 0.43.0).
If you only have Poppler's pdftotext, you'd have to use the -layout parameter (instead of -table), which gives you a similarly good result for the PDF file in question:
pdftotext \
-enc UTF-8 \
-l 1 \
-layout \
EICH-FinalEntriesforwebsite_Neutral.pdf \
EICH-FinalEntriesforwebsite_Neutral.txt
However, I have seen PDFs where the result is much better with -table (and XPDF) than it is with -layout (and Poppler).
(XPDF has the -layout parameter too -- so you can see the difference if you try both.)