Storing .csv in HDF5 pandas - pandas

I was experimenting with HDF and it seems pretty great because my data is not normalized and it contains a lot of text. I love being able to query when I read data into pandas.
loc2 = r'C:\\Users\Documents\\'
(my dataframe with data is called 'export')
hdf = HDFStore(loc2+'consolidated.h5')
hdf.put('raw', export, format= 'table', complib= 'blosc', complevel=9, data_columns = True, append = True)
21 columns and about 12 million rows so far and I will add about 1 million rows per month.
1 Date column [I convert this to datetime64]
2 Datetime columns (one of them for each row and the other one is null about 70% of the time) [I convert this to datetime64]
9 text columns [I convert these to categorical which saves a ton of room]
1 float column
8 integer columns, 3 of these can reach a max of maybe a couple of hundred and the other 5 can only be 1 or 0 values
I made a nice small h5 table and it was perfect until I tried to append more data to it (literally just one day of data since I am receiving daily raw .csv files). I received errors which showed that the dtypes were not matching up for each column although I used the same exact ipython notebook.
Is my hdf.put code correct? If I have append = True does that mean it will create the file if it does not exist, but append the data if it does exist? I will be appending to this file everyday basically.
For columns which only contain 1 or 0, should I specify a dtype like int8 or int16 - will this save space or should I keep it at int64? It looks like some of my columns are randomly float64 (although no decimals) and int64. I guess I need to specify the dtype for each column individually. Any tips?
I have no idea what blosc compression is. Is that the most efficient one to use? Any recommendations here? This file is mainly used to quickly read data into a dataframe to join to other .csv files which Tableau is connected to

Related

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

Write pandas data to a CSV file if column sums are greater than a specified value

I have a CSV file whose columns are frequency counts of words, and whose rows are time periods. I want to sum for each column the total frequencies. Then I want to write to a CSV file for sums greater than or equal to 30, the column and row values, thus dropping columns whose sums are less than 30.
Just learning python and pandas. I know it is a simple question, but my knowledge is at that level. Your help is most appreciated.
I can read in the CSV file and compute the column sums.
df = pd.read_csv('data.csv')
Except of data file containing 3,874 columns and 100 rows
df.sum(axis = 0, skipna = True)
Excerpt of sums for columns
I am stuck on how to create the output file so that it looks like the original file but no longer has columns whose sums were less than 30.
I am stuck on how to write to a CSV file each row for each column whose sums are greater than or equal to 30. The layout of the output file would be the same as for the input file. The sums would not be included in the output.
Thanks very much for your help.
So, here is a link showing an excerpt of a file containing 100 rows and 3,857 columns:
It's easiest to do this in two steps:
1. Filter the DataFrame to just the columns you want to save
df_to_save = df.loc[:, (df.sum(axis=0, skipna=True) >= 30)]
.loc is for picking rows/columns based either on labels or conditions; the syntax is .loc[rows, columns], so : means "take all the rows", and then the second part is the condition on our columns - I've taken the sum you'd given in your question and set it greater than or equal to 30.
2. Save the filtered DataFrame to CSV
df_to_save.to_csv('path/to/write_file.csv', header=True, index=False)
Just put your filepath in as the first argument. header=True means the header labels from the table will be written back out to the file, and index=False means the numbered row labels Pandas automatically created when you read in the CSV won't be included in the export.
See this answer here: How to delete a column in pandas dataframe based on a condition? . Note, the solution for your question doesn't need isnull() before the sum(), as that is specific to their question for counting NaN values.

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?
You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of
174 rows
12430 columns
The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True))
at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')
what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.
I explored the dataframes in pandas and could not find the way to pass the names as parameter.
for empty array, I was able to find the following (numpy) with no names:
a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns
I know R very well (if there is no other way - I hope that I can use it with datalab)
any idea?
Many thanks!
If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.
Assuming the rows and columns are in variables row_names and col_names, do this.
df.loc[row_names, col_names]

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)