Format of data in a column in a data frame - pandas

I have read a fixed width file and created a dataframe.
I have a field called claim number which is of length 15.In the data frame I see this field appearing as "1.902431e+14" rather than full 15 length claim number.
how can I resolve this so that I can see entire 15 length of claim number in data frame ?

For example, use pandas float_format option as follows:
#data claim_number dictionary
dictionary = {'claim_number': [1.902431111111141]}
#specification of claim number format
pd.options.display.float_format = '{:,.15f}'.format
#Create dataframe
df = pd.DataFrame(data=dictionary)
If you want to apply your format specifically to one column only, you can use style.format instead of pd.options.float_format as follows:
#data claim_number dictionary
dictionary = {'claim_number_1': [1.902431111111141], 'claim_number_2': ['some_string'], 'claim_number_3': [0.2323]}
#Create dataframe
df = pd.DataFrame(data=dictionary)
#style format single column
df.style.format({'claim_number_1': "{:,.15f}"})
More options on how to use style.format can be found here https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to extract float number from data frame string

In my data frame every entry is a string which consists of at least one number. Sometimes there are multiple and identical entries in one cell.
data = {'INTERVAL': ['0,60', '0,8 0,8', '0,5 0,5 0,5']}
df = pd.DataFrame(data)
print(df)
How can I extract the value as floating number and replace the original column with the new simplified representation? I've tried to use the extract
df['INTERVAL'].str.extract('((\d+))')
command, however I failed.
Thank you in advance
This seems to work for me -
floats = df['INTERVAL'].str.extract("(^[0-9]*,[0-9]*) ?.*")
df['INTERVAL'] = floats[0].str.replace(",",".").astype(float)

How to apply custom string matching function to pandas dataframe and return summary dataframe about correct/ incorrect patterns?

I have written a pattern matching function to classify weather a dataframe column value matches a given pattern or not. I created a column 'Correct_Pattern' to store the boolean answers in that dataframe. I also created a new dataframe called Incorrect_Pattern_df, which only contains the values that do not match the desired pattern. I did this, because I later on would like to see if I can correct those incorrect numbers. Now, every time I corrected a batch of numbers I would like to check the number format again and regenerate the Incorrect_Pattern_df. Please see my code below. What do I need to do to make it work?
#data
mylist = ['850/07-498745', '850/07-148465', '07-499015']
#create dataframe
df = pd.DataFrame(mylist)
df.rename(columns={ df.columns[0]: "mycolumn" }, inplace = True)
#function to check if my numbers follow the correct pattern
def check_number_format(dataframe, rm_pattern, column_name):
#create a column Correct_pattern that contains a boolean 'true or false' depending wheather the
pattern was matched or not
dataframe['Correct_pattern'] = dataframe[column_name].str.match(pattern)
#filter all incorrect patterns and put them in a dataframe called Incorrect-Pattern_df
Incorrect_Pattern_df = dataframe[dataframe.Correct_pattern == False]
#return both the original dataframe with the added Correct_pattern_df and the dataframe containing
the Incorrect_Pattern_df
return Incorrect_Pattern_df
#apply the check_Schadennumer_Format to a dataframe
Incorrect_Pattern_df = df['mycolumn'].apply(check_number_format, args=(df, r'^\d{2}-\d+$',
'mycolumn'))
The desired output should look as follows:

Python3.4 Pandas DataFrame from function

I wrote a function that outputs selected data from a parsing function. I am trying to put this information into a DataFrame using pandas.DataFrame but I am having trouble.
The headers are listed below as well as the function.head() data output
QUESTION
How will I be able to place the function output within the pandas DataFrame so the headers are linked to the output
HEADERS
--TICK---------NI----------CAPEXP----------GW---------------OE---------------RE-------
OUTPUT
['MMM', ['4,956,000'], ['(1,493,000)'], ['7,050,000'], ['13,109,000'], ['34,317,000']]
['ABT', ['2,284,000'], ['(1,077,000)'], ['10,067,000'], ['21,526,000'], ['22,874,000']]
['ABBV', ['1,774,000'], ['(612,000)'], ['5,862,000'], ['1,742,000'], ['535,000']]
-Loop through each item (I'm assuming data is a list with each element being one of the lists shown above)
-Take the first element as the ticker and convert the rest into numbers using translate to undo the string formatting
-Make a DataFrame per row and then concat all at the end, then transpose
-Set the columns by parsing the header string (I've called it headers)
dflist = list()
for x in data:
h = x[0]
rest = [float(z[0].translate(str.maketrans('(','-','),'))) for z in x[1:]]
dflist.append(pd.DataFrame([h]+rest))
df = pd.concat(dflist, 1).T
df.columns = [x for x in headers.split('-') if len(x) > 0]
But this might be a bit slow - would be easier if you could get your input into a more consistent format.