How to append new dataframe rows to a csv using pandas? - pandas

I have a new dataframe, how to append it to an existed csv?
I tried the following code:
f = open('test.csv', 'w')
df.to_csv(f, sep='\t')
f.close()
But it doesn't append anything to test.csv. The csv is big, I only want to use append, rather than read the whole csv as dataframe and concatenate it to and write it to a new csv. Is there any good method to solve the problem? Thanks.

Try this:
df.to_csv('test.csv', sep='\t', header=None, mode='a')
# NOTE: -----> ^^^^^^^^

TL:DR Answer from MaxU is correct.
df.to_csv('old_file.csv', header=None, mode='a')
I had the same problem, wishing to append to DataFrame and save to a CSV inside a loop. It seems to be a common pattern.
My criteria was:
Write back to the same file
Don't write data more than necessary.
Keep appending new data to the dataframe during the loop.
Save on each iteration (in case long running loop crashes)
Don't store index in the CSV file.
Note the different values of mode and header. In a complete write, mode='w' and header=True, but in an append, mode='a' and header='False'.
import pandas as pd
# Create a CSV test file with 3 rows
data = [['tom', 10], ['nick', 15], ['juli', 14]]
test_df = pd.DataFrame(data, columns = ['Name', 'Age'])
test_df.to_csv('test.csv', mode='w', header=True, index=False)
# Read CSV into a new frame
df = pd.read_csv('test.csv')
print(df)
# MAIN LOOP
# Create new data in a new DataFrame
for i in range(0, 2):
newdata = [['jack', i], ['jill', i]]
new_df = pd.DataFrame(newdata, columns = ['Name', 'Age'])
# Write the new data to the CSV file in append mode
new_df.to_csv('test.csv', mode='a', header=False, index=False)
print('check test.csv')
# Combine the new data into the frame ready for the next loop.
test_df = pd.concat([test_df, new_df], ignore_index=True)
# At completion, it shouldn't be necessary, but to write the complete data
test_df.to_csv('completed.csv', mode='w', header=True, index=False)
# completed.csv and test.csv should be identical.

try the following code, it will generate an old file(10 rows) and new file(2 rows) in your local folder. After I append, the new content all mix up:
import pandas as pd
import os
dir_path = os.path.dirname(os.path.realpath("__file__"))
print(dir_path)
raw_data = {'HOUR': [4, 9, 12, 7, 3, 15, 2, 16, 3, 21],
'LOCATION': ['CA', 'HI', 'CA', 'IN', 'MA', 'OH', 'OH', 'MN', 'NV', 'NJ'],
'TYPE': ['OLD', 'OLD', 'OLD', 'OLD', 'OLD', 'OLD', 'OLD', 'OLD', 'OLD', 'OLD'],
'PRICE': [4, 24, 31, 2, 3, 25, 94, 57, 62, 70]}
old_file = pd.DataFrame(raw_data, columns = ['HOUR', 'LOCATION', 'TYPE', 'PRICE'])
old_file.to_csv(dir_path+"/old_file.csv",index=False)
raw_data = {'HOUR': [2, 22],
'LOCATION': ['CA', 'MN'],
'TYPE': ['NEW', 'NEW'],
'PRICE': [80, 90]}
new_file = pd.DataFrame(raw_data, columns = ['HOUR', 'LOCATION', 'TYPE', 'PRICE'])
new_file.to_csv(dir_path+"/new_file.csv",index=False)
new_file=dir_path+"/new_file.csv"
df=pd.read_csv(new_file)
df.to_csv('old_file.csv', sep='\t', header=None, mode='a')
it will come to:
HOUR LOCATION TYPE PRICE
4 CA OLD 4
9 HI OLD 24
12 CA OLD 31
7 IN OLD 2
3 MA OLD 3
15 OH OLD 25
2 OH OLD 94
16 MN OLD 57
3 NV OLD 62
21 NJ OLD 70
02CANEW80
122MNNEW90

To append a pandas dataframe in a csv file, you can also try it.
df = pd.DataFrame({'Time':x, 'Value':y})
with open('CSVFileName.csv', 'a+', newline='') as f:
df.to_csv(f, index=False, encoding='utf-8', mode='a')
f.close()

Related

Merging many multiple dataframes within a list into one dataframe

i have several dataframes, with all the same columns, within one list that i would like to have within one dataframe.
For instance, i have these three dataframes here:
df1 = pd.DataFrame(np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]]),
columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
within one list:
dfList = [df1,df2,df3]
I know i can use the following which provides me with exactly what I'm looking for:
df_merge = pd.concat([dfList[0],dfList[1],dfList[2]])
However, my in my actual data i have 100s of dataframes within a list, so I'm trying to find a way to loop through and concat:
dfList_all = pd.DataFrame()
for i in range(len(dfList)):
dfList_all = pd.concat(dfList[i])
I tried the following above, but it provides me with the following error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Any ideas would be wonderful. Thanks

Use pandas cut function in Dask

How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)
The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1

Pandas - Extracting elements within a dictionary

I have a Python dictionary in the below structure. I am trying to extract certain elements from the Dictionary and convert them to a Dataframe.
When I try to perform pd.Dataframe(df) I get summary of the 2 groups data and PageCount whereas I only want the elements within Output in the Dataframe
{'code': 200,
'data': {'Output': [
{'id': 58,
'title': 'title1'},
{'id': 59,
'title': 'title2'}],
'PageCount': {'count': 196,
'page': 1,
'perPage': 10,
'totalPages': 20}},
'request_id': 'fggfgggdgd'}
Expected output:
id, title
58, title1
59, title2
You can do,
df = pd.io.json.json_normalize(dct["data"]["Output"])
You can also use;
l=[v['Output'] for k,v in d.items() if isinstance(v,dict) & ('Output' in str(v))]
pd.DataFrame(l[0])
id title
0 58 title1
1 59 title2

Printing unique list of indices in multiindex pandas dataframe

I am just starting out with pandas and have the following code:
import pandas as pd
d = {'num_legs': [4, 4, 2, 2, 2],
'num_wings': [0, 0, 2, 2, 2],
'class': ['mammal', 'mammal','bird-mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog','cat', 'bat', 'penguin'],
'locomotion': ['walks', 'walks','hops', 'flies', 'walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
I want to print everything that the animal cat does; here, that will be 'walks' and 'hops'.
I can filter to just the cat cross-section using
df2=df.xs('cat', level=1)
But from here, how do I access the level 'locomotion'?
You can do get_level_values
df.xs('cat', level=1).index.get_level_values(1)
Out[181]: Index(['walks', 'hops'], dtype='object', name='locomotion')

Trying to create a Seaborn heatmap from a Pandas Dataframe

This is first time trying this. I actually have a dict of lists I am generating in a program, but since this is my first time ever trying this, I am using a dummy dict just for testing.
I am following this:
python Making heatmap from DataFrame
but I am failing with the following:
Traceback (most recent call last):
File "C:/Users/Mark/PycharmProjects/main/main.py", line 20, in <module>
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True)
File "C:\Users\Mark\AppData\Roaming\Python\Python36\site-packages\seaborn\matrix.py", line 517, in heatmap
yticklabels, mask)
File "C:\Users\Mark\AppData\Roaming\Python\Python36\site-packages\seaborn\matrix.py", line 168, in __init__
cmap, center, robust)
File "C:\Users\Mark\AppData\Roaming\Python\Python36\site-packages\seaborn\matrix.py", line 205, in _determine_cmap_params
calc_data = plot_data.data[~np.isnan(plot_data.data)]
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
My code:
import pandas as pd
import seaborn as sns
Index = ['key1', 'key2', 'key3', 'key4', 'key5']
Cols = ['A', 'B', 'C', 'D']
testdict = {
"key1": [1, 2, 3, 4],
"key2": [5, 6, 7, 8],
"key3": [9, 10, 11, 12],
"key4": [13, 14, 15, 16],
"key5": [17, 18, 19, 20]
}
df = pd.DataFrame(testdict, index=Index, columns=Cols)
df = df.transpose()
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True)
You need to switch your column and index labels
Cols = ['key1', 'key2', 'key3', 'key4', 'key5']
Index = ['A', 'B', 'C', 'D']