Create a pandas DataFrame from multiple dicts [duplicate] - pandas

This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 4 years ago.
I'm new to pandas and that's my first question on stackoverflow, I'm trying to do some analytics with pandas.
I have some text files with data records that I want to process. Each line of the file match to a record which fields are in a fixed place and have a length of a fixed number of characters. There are different kinds of records on the same file, all records share the first field that are two characters depending of the type of record. As an example:
Some file:
01Jhon Smith 555-1234
03Cow Bos primigenius taurus 00401
01Jannette Jhonson 00100000000
...
field start length
type 1 2 *common to all records, example: 01 = person, 03 = animal
name 3 10
surname 13 10
phone 23 8
credit 31 11
fill of spaces
I'm writing some code to convert one record to a dictionary:
person1 = {'type': 01, 'name': = 'Jhon', 'surname': = 'Smith', 'phone': '555-1234'}
person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
If a field is empty (filled with spaces) there will not be in the dictionary).
With all records of one kind I want to create a pandas DataFrame with the dicts keys as columns names, I've try with pandas.DataFrame.from_dict() without success.
And here comes my question: Is any way to do this with pandas so dict keys become column names? Are any other standard method to deal with this kind of files?

To make a DataFrame from a dictionary, you can pass a list of dictionaries:
>>> person1 = {'type': 01, 'name': 'Jhon', 'surname': 'Smith', 'phone': '555-1234'}
>>> person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
>>> animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
>>> pd.DataFrame([person1])
name phone surname type
0 Jhon 555-1234 Smith 1
>>> pd.DataFrame([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
>>> pd.DataFrame.from_dict([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
For the more fundamental issue of two differently-formatted files intermixed, and assuming the files aren't so big that we can't read them and store them in memory, I'd use StringIO to make an object which is sort of like a file but which only has the lines we want, and then use read_fwf (fixed-width-file). For example:
from StringIO import StringIO
def get_filelike_object(filename, line_prefix):
s = StringIO()
with open(filename, "r") as fp:
for line in fp:
if line.startswith(line_prefix):
s.write(line)
s.seek(0)
return s
and then
>>> type01 = get_filelike_object("animal.dat", "01")
>>> df = pd.read_fwf(type01, names="type name surname phone credit".split(),
widths=[2, 10, 10, 8, 11], header=None)
>>> df
type name surname phone credit
0 1 Jhon Smith 555-1234 NaN
1 1 Jannette Jhonson NaN 100000000
should work. Of course you could also separate the files into different types before pandas ever sees them, which might be easiest of all.

Related

How do I prevent str.contains() from searching for a sub-string?

I want Pandas to search my data frame for the complete string and not a sub-string. Here is a minimal-working example to explain my problem -
data = [['tom', 'wells fargo', 'retired'], ['nick', 'bank of america', 'partner'], ['juli', 'chase', 'director - oil well']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(val, case=False)).any(axis="columns")]
The correct code would have only returned the second row and not the first one
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
Update - My intention is to have a search that looks for the exact string requested. While looking for "well" the algorithm shouldn't extract out "well. Based on the comments, I understand how my question might be misleading.
IIUC, you can use:
>>> df[~df['Position'].str.contains(fr'\b{val}\b')]
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
And for all columns:
>>> df[~df.apply(lambda x: x.str.contains(fr'\b{val}\b', case=False)).any(axis=1)]
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
The regular expression anchor \b which is a word boundary is what you want.
I added addtional data to your code to illustrate more:
import pandas as pd
data = [
['tom', 'wells fargo', 'retired']
, ['nick', 'bank of america', 'partner']
, ['john','bank of welly','blah']
, ['jan','bank of somewell knwon','well that\'s it']
, ['juli', 'chase', 'director - oil well']
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(fr"\b{val}\b", case=False)).any(axis="columns")]
EDIT
In Python3 the string can be substitute with the variable with f in front of " or ' and r will express it as regular expression. Then now you can get the val as you want. Thank #smci
and the output is like this
Name
Place
Position
3
jan
bank of somewell knwon
well that's it
4
juli
chase
director - oil well

what is the smartest way to read csv with multiple type data mixed in?

1,100
2,200
3,300
...
many datas
...
9934,321
9935,111
2021-01-01, jane doe, 321
2021-01-10, john doe, 211
2021-01-30, jack doe, 911
...
many datas
...
2021-11-30, jick doe, 921
If I meet csv file like above,
How can I separate it as 2 dataframes? without loop or something calculate
I see this like that:
import pandas as pd
data = 'file.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c']) # I have to name columns
df_1 = df[~df['c'].isnull()] #This is with 3rd column
df_2 = df[df['c'].isnull()] #This is where are only two columns
Second idea was to first find the index of the row where data will switch from 2 to 3 column.
import pandas as pd
import numpy as np
data = 'stack.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c'])
rows = df['c'].index[df['c'].apply(np.isnan)]
df_1 = pd.read_csv(data ,names=['a', 'b','c'],skiprows=rows[-1]+1)
df_2 = pd.read_csv(data ,names=['a', 'b'],nrows = rows[-1]+1)
I think you can easily modify the code when the files will change.
Here is the reason why I named columns link

Is there a function to find the index of a float value in a column using pandas?

Hi so I have a dataframe df with a numeric index, a datetime column, and ozone concentrations, among several other columns. But here's a list of the important columns regarding my question.
index, date, ozone
0, 4-29-2018, 55.4375
1, 4-29-2018, 52.6375
2, 5-2-2018, 50.4128
3, 5-2-2018, 50.3
4, 5-3-2018, 50.3
5, 5-4-2018, 51.8845
I need to call the index value of a row based on the column value. However, multiple rows have a column value of 50.3. First, how do I find the index value based on a specific column value? I've tried:
np.isclose(df['ozone'], 50.3).argmax() from Getting the index of a float in a column using pandas
but this only gives me the first index value that the number appears. Is there a way to call the index based on two parameters (like ask what the index value for when datetime = 5-2-2018 and ozone = 50.3)?
I've also tried df.loc but it doesn't work for floating points.
here's some sample code:
df = pd.read_csv('blah.csv')
df.set_index('date', inplace = True)
df.index = pd.to_datetime(df.index)
date = pd.to_datetime(df.index)
dv = df.groupby([date.month,date.day]).mean()
dv.drop(columns=['dir_5408'], inplace=True)
df['ozone'] = df.oz_3186.rolling('8H', min_periods=2).mean().shift(-4)
ozone = df.groupby([date.month,date.day])['ozone'].max()
df['T_mda8_3186'] = df.Temp_C_3186.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_3186 = df.groupby([date.month,date.day])['T_mda8_3186'].max()
df['T_mda8_5408'] = df.Temp_C_5408.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_5408 = df.groupby([date.month,date.day])['T_mda8_5408'].max()
df['ws_mda8_5408'] = df.ws_5408.rolling('8H', min_periods=2).mean().shift(-4)
ws_mda8_5408 = df.groupby([date.month,date.day])['ws_mda8_5408'].max()
dv_MDA8 = df.drop(columns=['Temp_C_3186', 'Temp_C_5408','dir_5408','ws_5408','u_5408','v_5408','rain(mm)_5724',
'rain(mm)_5408','rh_3186','rh_5408','pm10_5408','pm10_3186','pm25_5408','oz_3186'])
dv_MDA8.reset_index(inplace=True)
I need the date as a datetime index for the beginning of my code.
Thanks in advance for your help.
This is what you might be looking for,
import pandas as pd
import datetime
data = pd.DataFrame({
'index':[0,1,2,3,4,5],
'date':['4-29-2018','4-29-2018','5-2-2018','5-2-2018','5-3-2018','5-4-2018'],
'ozone':[55.4375,52.6375,50.4128,50.3,50.3,51.8845]
}
)
data.set_index(['index'],inplace=True)
data['date'] = data['date'].apply(lambda x: datetime.datetime.strptime(x,'%m-
%d-%Y'))
data['ozone'] = data['ozone'].astype('float')
data.loc[(data['date'] == datetime.datetime.strptime('5-3-2018','%m-%d-%Y'))
& (data['ozone'] == 50.3)]
Index represents each row and you can find out indexes and then store/use it
later until, ofcourse, the index of df has not changed
Code:
import pandas as pd
import numpy as np
students = [('jack', 34, 'Sydeny', 'Engineering'),
('Sachin', 30, 'Delhi', 'Medical'),
('Aadi', 16, 'New York', 'Computer Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Mumbai', 'Information Security'),
('Aadi', 40, 'London', 'Arts'),
('Sachin', 30, 'Delhi', 'Medical')
]
df = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Subject'])
print(df)
ind_name_sub = df.where((df.Name == 'Riti') & (df.Subject == 'Data Science')).dropna().index
# similarly you can have your ind_date_ozone may have one or more values
print(ind_name_sub)
print(df.loc[ind_name_sub])
Output:
Name Age City Subject
0 jack 34 Sydeny Engineering
1 Sachin 30 Delhi Medical
2 Aadi 16 New York Computer Science
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
5 Riti 30 Mumbai Information Security
6 Aadi 40 London Arts
7 Sachin 30 Delhi Medical
Int64Index([3, 4], dtype='int64')
Name Age City Subject
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science

How to put a range of columns into one column when reading a csv with pandas?

I would like to select specifics rows when reading a csv with pandas but I also would like to keep the last 5 to 8 columns as a one column because they all represent "genres" in my case.
I have tried to put the flag usecols=[0,1,2,np.arange(5,8)] when using pd.read_csv bubt it does not work.
If I use the flag usecols=[0,1,2,5], I just get one genre in the last column and the others (6, 7, 8) are lost.
I have tried the following but without succeeding:
items = pd.read_csv(filename_item,
sep='|',
engine='python',
encoding='latin-1',
usecols=[0,1,2,np.arange(5,23)],
names=['movie_id', 'title', 'date','genres'])
My CSV looks like:
2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic
And I would like to get:
2 - Scream of Stone (Schrei aus Stein) - (1991) - 08-Mar-1996 - drama|comedia|fun|romantic
, where what I drew separated by "-" should be a column of the dataframe.
Thank you
You may need to do this in 2-passes. Firstly read the csv in as is:
In[56]:
import pandas as pd
import io
t="""2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic"""
df = pd.read_csv(io.StringIO(t), sep='|', usecols=[0,1,2,3,*np.arange(6,10)], header=None)
df
Out[56]:
0 1 2 3 6 7 \
0 2 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 drama comedia
8 9
0 fun romantic
Then we can join all the genres together using apply:
In[57]:
df['genres'] = df.iloc[:,4:].apply('|'.join,axis=1)
df
Out[57]:
0 1 2 3 6 7 \
0 2 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 drama comedia
8 9 genres
0 fun romantic drama|comedia|fun|romantic
My solution is based on a piece of code proposed at:
How to pre-process data before pandas.read_csv()
The idea is to write a "file wrapper" class, which can be passed
to read_csv.
class InFile(object):
def __init__(self, infile):
self.infile = open(infile)
def __next__(self):
return self.next()
def __iter__(self):
return self
def read(self, *args, **kwargs):
return self.__next__()
def next(self):
try:
line = self.infile.readline()
return re.sub('\|', ',', line, count=6)
except:
self.infile.close()
raise StopIteration
Reformatting of each source line is performed by:
re.sub('\|', ',', line, count=6)
which changes first 6 | chars into commas, so you can read it
without sep='|'.
To read your CSV file, run:
df = pd.read_csv(InFile('Films.csv'), usecols=[0, 1, 2, 3, 6],
names=['movie_id', 'title', 'prod', 'date', 'genres'])

How to implement where clause in python

I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock