I am using replace function, but it does not work. It is not doing the replacement,I still see original string. On the pandas documentation, the replace function does not even have an inplace argument, so I wonder if inplace actually works?
df["Name"].replace(["Bill"], "William", inplace=True)
I still see: Bill
Try the following, passing your rename as a dictionary:
import pandas as pd
df = pd.DataFrame({'Name': ['Bill','James','Joe','John','Bill'], 'Age': [34, 21, 34, 45, 23]})
df.replace({'Bill': 'William'}, inplace=True)
#OR
df['Name'].replace({'Bill': 'William'}, inplace=True)
Indeed, this produces:
Name Age
0 William 34
1 James 21
2 Joe 34
3 John 45
4 William 23
Related
I saw that it is possible to do groupby and then agg to let pandas produce a new dataframe that groups the old dataframe by the fields you specified, and then aggregate the fields you specified, on some function (sum in the example below).
However, when I wrote the following:
# initialize list of lists
data = [['tom', 10, 100], ['tom', 15, 200], ['nick', 15, 150], ['juli', 14, 140]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age', 'salary'])
# trying to groupby and agg
grouping_vars = ['Name']
nlg_study_grouped = df(grouping_vars,axis = 0).agg({'Name': sum}).reset_index()
Name
Age
salary
tom
10
100
tom
15
200
nick
15
150
juli
14
140
I am expecting the output to look like this (because it is grouping by Name then summing the field salary:
Name
salary
tom
300
nick
150
juli
140
The code works in someone else's example, but my toy example is producing this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-6fb9c0ade242> in <module>
1 grouping_vars = ['Name']
2
----> 3 nlg_study_grouped = df(grouping_vars,axis = 0).agg({'Name': sum}).reset_index()
TypeError: 'DataFrame' object is not callable
I wonder if I missed something dumb.
You can try this
print(df.groupby('Name').sum()['salary'])
To use multiple functions
print(df.groupby(['Name'])['salary']
.agg([('average','mean'),('total','sum'),('product','prod')])
.reset_index())
If you want to group by multiple columns, then you can try adding multiple column names within groupby list
Ex: df.groupby(['Name','AnotherColumn'])...
Further, you can refer this question
Aggregation in Pandas
1,100
2,200
3,300
...
many datas
...
9934,321
9935,111
2021-01-01, jane doe, 321
2021-01-10, john doe, 211
2021-01-30, jack doe, 911
...
many datas
...
2021-11-30, jick doe, 921
If I meet csv file like above,
How can I separate it as 2 dataframes? without loop or something calculate
I see this like that:
import pandas as pd
data = 'file.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c']) # I have to name columns
df_1 = df[~df['c'].isnull()] #This is with 3rd column
df_2 = df[df['c'].isnull()] #This is where are only two columns
Second idea was to first find the index of the row where data will switch from 2 to 3 column.
import pandas as pd
import numpy as np
data = 'stack.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c'])
rows = df['c'].index[df['c'].apply(np.isnan)]
df_1 = pd.read_csv(data ,names=['a', 'b','c'],skiprows=rows[-1]+1)
df_2 = pd.read_csv(data ,names=['a', 'b'],nrows = rows[-1]+1)
I think you can easily modify the code when the files will change.
Here is the reason why I named columns link
I'm quite new with pandas and need a bit help. I have a column with ages and need to make groups of these:
Young people: age≤30
Middle-aged people: 30<age≤60
Old people:60<age
Here is the code, but it gives me an error:
def get_num_people_by_age_category(dataframe):
young, middle_aged, old = (0, 0, 0)
dataframe["age"] = pd.cut(x=dataframe['age'], bins=[30,31,60,61], labels=["young","middle_aged","old"])
return young, middle_aged, old
ages = get_num_people_by_age_category(dataframe)
print(dataframe)
Code below gets the age groups using pd.cut().
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'age': [1,20,30,31,50,60,61,80,90] #np.random.randint(1,100,50)
})
# Function: Copy-pasted from question and modified
def get_num_people_by_age_category(df):
df["age_group"] = pd.cut(x=df['age'], bins=[0,30,60,100], labels=["young","middle_aged","old"])
return df
# Call function
df = get_num_people_by_age_category(df)
Output
print(df)
age age_group
0 1 young
1 20 young
2 30 young
3 31 middle_aged
4 50 middle_aged
5 60 middle_aged
6 61 old
7 80 old
8 90 old
I would like to select specifics rows when reading a csv with pandas but I also would like to keep the last 5 to 8 columns as a one column because they all represent "genres" in my case.
I have tried to put the flag usecols=[0,1,2,np.arange(5,8)] when using pd.read_csv bubt it does not work.
If I use the flag usecols=[0,1,2,5], I just get one genre in the last column and the others (6, 7, 8) are lost.
I have tried the following but without succeeding:
items = pd.read_csv(filename_item,
sep='|',
engine='python',
encoding='latin-1',
usecols=[0,1,2,np.arange(5,23)],
names=['movie_id', 'title', 'date','genres'])
My CSV looks like:
2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic
And I would like to get:
2 - Scream of Stone (Schrei aus Stein) - (1991) - 08-Mar-1996 - drama|comedia|fun|romantic
, where what I drew separated by "-" should be a column of the dataframe.
Thank you
You may need to do this in 2-passes. Firstly read the csv in as is:
In[56]:
import pandas as pd
import io
t="""2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic"""
df = pd.read_csv(io.StringIO(t), sep='|', usecols=[0,1,2,3,*np.arange(6,10)], header=None)
df
Out[56]:
0 1 2 3 6 7 \
0 2 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 drama comedia
8 9
0 fun romantic
Then we can join all the genres together using apply:
In[57]:
df['genres'] = df.iloc[:,4:].apply('|'.join,axis=1)
df
Out[57]:
0 1 2 3 6 7 \
0 2 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 drama comedia
8 9 genres
0 fun romantic drama|comedia|fun|romantic
My solution is based on a piece of code proposed at:
How to pre-process data before pandas.read_csv()
The idea is to write a "file wrapper" class, which can be passed
to read_csv.
class InFile(object):
def __init__(self, infile):
self.infile = open(infile)
def __next__(self):
return self.next()
def __iter__(self):
return self
def read(self, *args, **kwargs):
return self.__next__()
def next(self):
try:
line = self.infile.readline()
return re.sub('\|', ',', line, count=6)
except:
self.infile.close()
raise StopIteration
Reformatting of each source line is performed by:
re.sub('\|', ',', line, count=6)
which changes first 6 | chars into commas, so you can read it
without sep='|'.
To read your CSV file, run:
df = pd.read_csv(InFile('Films.csv'), usecols=[0, 1, 2, 3, 6],
names=['movie_id', 'title', 'prod', 'date', 'genres'])
This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 4 years ago.
I'm new to pandas and that's my first question on stackoverflow, I'm trying to do some analytics with pandas.
I have some text files with data records that I want to process. Each line of the file match to a record which fields are in a fixed place and have a length of a fixed number of characters. There are different kinds of records on the same file, all records share the first field that are two characters depending of the type of record. As an example:
Some file:
01Jhon Smith 555-1234
03Cow Bos primigenius taurus 00401
01Jannette Jhonson 00100000000
...
field start length
type 1 2 *common to all records, example: 01 = person, 03 = animal
name 3 10
surname 13 10
phone 23 8
credit 31 11
fill of spaces
I'm writing some code to convert one record to a dictionary:
person1 = {'type': 01, 'name': = 'Jhon', 'surname': = 'Smith', 'phone': '555-1234'}
person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
If a field is empty (filled with spaces) there will not be in the dictionary).
With all records of one kind I want to create a pandas DataFrame with the dicts keys as columns names, I've try with pandas.DataFrame.from_dict() without success.
And here comes my question: Is any way to do this with pandas so dict keys become column names? Are any other standard method to deal with this kind of files?
To make a DataFrame from a dictionary, you can pass a list of dictionaries:
>>> person1 = {'type': 01, 'name': 'Jhon', 'surname': 'Smith', 'phone': '555-1234'}
>>> person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
>>> animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
>>> pd.DataFrame([person1])
name phone surname type
0 Jhon 555-1234 Smith 1
>>> pd.DataFrame([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
>>> pd.DataFrame.from_dict([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
For the more fundamental issue of two differently-formatted files intermixed, and assuming the files aren't so big that we can't read them and store them in memory, I'd use StringIO to make an object which is sort of like a file but which only has the lines we want, and then use read_fwf (fixed-width-file). For example:
from StringIO import StringIO
def get_filelike_object(filename, line_prefix):
s = StringIO()
with open(filename, "r") as fp:
for line in fp:
if line.startswith(line_prefix):
s.write(line)
s.seek(0)
return s
and then
>>> type01 = get_filelike_object("animal.dat", "01")
>>> df = pd.read_fwf(type01, names="type name surname phone credit".split(),
widths=[2, 10, 10, 8, 11], header=None)
>>> df
type name surname phone credit
0 1 Jhon Smith 555-1234 NaN
1 1 Jannette Jhonson NaN 100000000
should work. Of course you could also separate the files into different types before pandas ever sees them, which might be easiest of all.