Data slicing a pandas frame - I'm having problems with unique - pandas

I am facing issues trying to select a subset of columns and running unique on it.
Source Data:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()
Produces:
(10000, 86)
Process Data:
df = df_raw[['A','B','C']]
df.shape()
Produces:
(10000, 3)
Furthermore, doing:
df_raw.head()
df.head()
produces a correct list of rows and columns.
However,
print('RAW:',sorted(df_raw['A'].unique()))
works perfectly
Whilst:
print('PROCESSED:',sorted(df['A'].unique()))
produces:
AttributeError: 'DataFrame' object has no attribute 'unique'
What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.

This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)
My problem was this:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.
This was causing problems when doing a unique on 'A'

From the entire dataframe to extract a subset a data based on a column ID. This works!!
df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?