How can I sort and replace the rank of each item instead of it's value before sorting in a merged final csv? - pandas

I have 30 different csv files and each row begin with date and some similar features measured for each of 30 items daily. The value of each feature is not important, but the rank they gain after sorting in each day is important. How can I have one merged csv from 30 separate csv with the rank of each feature?

If your files are all the same format, you can do a loop and consolidate in a single data frame. From there you can manipulate as needed:
import pandas as pd
import glob
path = r'C:\path_to_files\files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Other examples of the same thing: Import multiple csv files into pandas and concatenate into one DataFrame.

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

pandas remove spaces from Series

The question is, how to gain access to the strings inside of the first column so that string manipulations can be performed with each value. For example remove spaces in front of each string.
import pandas as pd
data = pd.read_csv("adult.csv", sep='\t', index_col=0)
series = data['workclass'].value_counts()
print(series)
Here is the file:
Zipped csv file
It is index, so use str.strip with series.index:
series.index = series.index.str.strip()
But if need convert series here to 2 columns DataFrame use:
df = series.rename_axis('a').reset_index(name='b')

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir, where the number of rows of each sub-DataFrame is the chunksize. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.
One other option is to use the partition_cols option in pyarrow.parquet.write_to_dataset():
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])
Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)
Keep each parquet size small, around 128MB. To do this:
import dask.dataframe as dd
# Get number of partitions required for nominal 128MB partition size
# "+ 1" for non full partition
size128MB = int(df.memory_usage().sum()/1e6/128) + 1
# Read
ddf = dd.from_pandas(df, npartitions=size128MB)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
cunk = 200000
i = 0
n = 0
while i<= len(all_df):
j = i + cunk
print((i, j))
tmpdf = all_df[i:j]
tmpdf.to_parquet(path=f"./append_data/part.{n}.parquet",engine='pyarrow', compression='snappy')
i = j
n = n + 1

Concatenate multiple file rowwise in a single dataframe with the same header name

I have 400 csv files and all the files contains single column which has 4667 rows . Every row has name and corresponding value for example "A=54,B=56 and so on till 4667 rows. My problem statement is :
1. fetch the variable name and put it different columns
2. fetch the corresponding variable value and put it in the next rows above the columns.
3. Now, Do this step for all the 400 files and append all the corresponding values in the above rows which makes 400 rows.
I have done for the single file and how to do with the multiple files . I don't Know
import glob
from collections import OrderedDict
path =r'Github/dataset/Raw_Dataset/'
filenames = glob.glob(path + "/*.csv")
dict_of_df = OrderedDict((f, pd.read_csv((f),header=None,names=['Devices'])) for f in filenames)
eda=pd.concat(dict_of_df)
I
Do you mean a concatenation?
you could use pandas like this:
import pandas as pd
df1 = pd.DataFrame(columns=["A","B","C","D"],data=[["1","2","3","4"]])
df2 = pd.DataFrame(columns=["A","B","C","D"],data=[["5","6","7","8"]])
df3 = pd.DataFrame(columns=["A","B","C","D"],data=[["9","10","11","12"]])
df_concat = pd.concat([df1, df2, df3])
print(df_concat)
In combination with a loop through the files in the folder ending with ".csv" it would be:
import os
import pandas as pd
path = r"C:\YOUR\DICTIONARY"
df_concat = pd.DataFrame()
for filename in os.listdir(path):
if filename.endswith(".csv"):
print(filename)
df_temp = pd.read_csv(path + "\\" + filename)
df_concat = pd.concat([df_concat, df_temp])
continue
else:
continue

Reading variable column and row structure to Pandas by column amount

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170