I have some large files with several category columns. Category is kind of a generous word too because these are basically descriptions/partial sentences.
Here are the unique values per category:
Category 1 = 15
Category 2 = 94
Category 3 = 294
Category 4 = 401
Location 1 = 30
Location 2 = 60
Then there are even users with recurring data (first name, last name, IDs etc).
I was thinking of the following solutions to make the file size smaller:
1) Create a file which matches each category with an unique integer
2) Create a map (is there a way to do this from reading another file? Like I would create a .csv and load it as another dataframe and then match it? Or do I literally have to type it out initially?)
OR
3) Basically do a join (VLOOKUP) and then del the old column with the long object names
pd.merge(df1, categories, on = 'Category1', how = 'left')
del df1['Category1']
What do people normally do in this case? These files are pretty huge. 60 columns and most of the data are long, repeating categories and timestamps. Literally no numerical data at all. It's fine for me, but sharing the files is almost impossible due to shared drive space allocations for more than a few months.
To benefit from Categorical dtype when saving to csv you might want to follow this process:
Extract your Category definitions into separate dataframes / files
Convert your Categorical data to int codes
Save converted DataFrame to csv along with definitions dataframes
When you need to use them again:
Restore dataframes from csv files
Map dataframe with int codes to category definitions
Convert mapped columns to Categorical
To illustrate the process:
Make a sample dataframe:
df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null object
locations 100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None
Note the size: 2.3+ MB - this would be roughly the size of your csv file.
Now convert these data to Categorical:
df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
Note the drop in memory usage down to 976.6 KB
But if you would save it to csv now:
df.to_csv('test1.csv')
...you would see this inside the file:
index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2
Which means 'Categorical' has been converted to strings for saving in csv.
So let's get rid of the labels in Categorical data after we save the definitions:
categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details
index
0 Location1
1 Location2
Now covert Categorical to int dtype:
for col in df.select_dtypes(include=['category']).columns:
df[col] = df[col].cat.codes
print df.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int8
locations 100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None
Save converted data to csv and note that the file now has only numbers without labels.
The file size will also reflect this change.
df.to_csv('test2.csv')
index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1
Save the definitions as well:
categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')
When you need to restore the files, load them from csv files:
df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int64
locations 100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None
categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
print categories_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories 4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None
locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()
locations
index
0 Location1
1 Location2
print locations_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations 2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None
Now use map to replace int coded data with categories descriptions and convert them to Categorical:
df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()
categories locations
index
0 Category1 Location1
1 Category2 Location2
2 Category3 Location1
3 Category4 Location2
4 Category1 Location1
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
Note the memory usage back to what it was when we first converted data to Categorical.
It should not be hard to automate this process if you need to repeat it many time.
Pandas has a Categorical data type that does just that. It basically maps the categories to integers behind the scenes.
Internally, the data structure consists of a categories array and an
integer array of codes which point to the real value in the categories
array.
Documentation is here.
Here's a way to save a dataframe with Categorical columns in a single .csv:
Example:
------ -------
Fatcol Thincol: unique strings once, then numbers
------ -------
"Alberta" "Alberta"
"BC" "BC"
"BC" 2 -- string 2
"Alberta" 1 -- string 1
"BC" 2
...
The "Thincol" on the right can be saved as is in a .csv file,
and expanded to the "Fatcol" on the left after reading it in;
this can halve the size of big .csv s with repeated strings.
Functions
---------
fatcol( col: Thincol ) -> Fatcol, list[ unique str ]
thincol( col: Fatcol ) -> Thincol, dict( unique str -> int ), list[ unique str ]
Here "Fatcol" and "Thincol" are type names for iterators, e.g. lists:
Fatcol: list of strings
Thincol: list of strings or ints or NaN s
If a `col` is a `pandas.Series`, its `.values` are used.
This cut a 700M .csv to 248M -- but write_csv runs at ~ 1 MB/sec on my imac.
Related
I have a data frame:
pd.DataFrame({'A': range(1, 10000)})
I can get a nice human-readable thing saying that it has a memory usage of 78.2 KB using df.info():
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
I can get an unhelpful statement with similar effect using df.memory_usage() (and this is how Pandas itself calculates its own memory usage) but would like to avoid having to roll my own. I've looked at the df.info source and traced the source of the string all the way to this line.
How is this specific string generated and how can I pull that out so I can print it to a log?
Nb I can't parse the df.info() output because it prints directly to buffer; calling str on it just returns None.
Nb This line also does not help, what is initialised is merely a boolean flag for whether memory usage should be printed at all.
You can create an instance of pandas.io.formats.info.DataFrameInfo and read the memory_usage_string property, which is exactly what df.info() does:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
>>> pd.io.formats.info.DataFrameInfo(df).memory_usage_string.strip()
'78.2 KB'
If you're passing memory_usage to df.info, you can pass it directly to DataFrameInfo:
pd.io.formats.info.DataFrameInfo(df, memory_usage='deep').memory_usage_string.strip()
This question already has an answer here:
How can I get the name of grouping columns from a Pandas GroupBy object?
(1 answer)
Closed 12 months ago.
Given a grouped DataFrame (obtained by df.groupby([col1, col2])) I would like to obtain the grouping variables (col1 and col2 in this case).
For example, from the GroupBy user guide
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
("bird", "Falconiformes", 389.0),
("bird", "Psittaciformes", 24.0),
("mammal", "Carnivora", 80.2),
("mammal", "Primates", np.nan),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
grouped = df.groupby(["class", "order"])
Given grouped I would like to get class and order. However, grouped.indices and grouped.groups contain only the values of the keys, not the column names.
The column names must be in the object somewhere, because if I run grouped.size() for example, they are included in the indices:
class order
bird Falconiformes 1
Psittaciformes 1
mammal Carnivora 2
Primates 1
dtype: int64
And therefore I can run grouped.size().index.names which returns FrozenList(['class', 'order']). But this is doing an unnecessary calculation of .size(). Is there a nicer way of retrieving these from the object?
The ultimate reason I'd like this is so that I can do some processing for a particular group, and associate it with a key-value pair which defines the group. That way I would be able to amalgamate different grouped datasets with arbitrary levels of grouping. For example I could have
group max_speed
class=bird,order=Falconiformes 389.0
class=bird,order=Psittaciformes 24.0
class=bird 206.5
foo=bar 45.5
...
Very similar to your own suggestion, you can extract the grouped by column names using:
grouped.dtypes.index.names
It is not shorter, but you avoid calling a method.
Grouped DataFrame (obtained by df.groupby([col1, col2])) is converted to pandas.core.groupby.generic.DataFrameGroupBy- Object. So we have to convert it into DataFrame in order to get the column names.
df2 = pd.DataFrame(grouped.size().reset_index(name = "Group_Count"))
print(df2)
Output:
class order Group_Count
0 bird Falconiformes 1
1 bird Psittaciformes 1
2 mammal Carnivora 2
3 mammal Primates 1
print(df2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 4 non-null object
1 order 4 non-null object
2 Group_Count 4 non-null int64
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
I think this would solve your problem of selecting the column names from the grouped data. The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count().
I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).
I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB
I have two data frames.
The first one has header with one row of data providing description of the header columns (241 Columns).
RangeIndex: 1 entries, 0 to 0
Columns: 241 entries, FILEID to B0_235
dtypes: float64(6), object(235)
memory usage: 2.0+ KB
The second data frame has no header record, with 241 columns. Here are the details
RangeIndex: 11718 entries, 0 to 11717
Columns: 241 entries, 0 to 240
dtypes: float64(187), int64(52), object(2)
memory usage: 21.5+ MB
I append/merge dataframe 1 with dataframe 2 with 11719 records and header from dataframe 1
Thanks in advance