how to put first value in one column and remaining into other column? - pandas

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?

in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

Related

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

Concatenate multiple file rowwise in a single dataframe with the same header name

I have 400 csv files and all the files contains single column which has 4667 rows . Every row has name and corresponding value for example "A=54,B=56 and so on till 4667 rows. My problem statement is :
1. fetch the variable name and put it different columns
2. fetch the corresponding variable value and put it in the next rows above the columns.
3. Now, Do this step for all the 400 files and append all the corresponding values in the above rows which makes 400 rows.
I have done for the single file and how to do with the multiple files . I don't Know
import glob
from collections import OrderedDict
path =r'Github/dataset/Raw_Dataset/'
filenames = glob.glob(path + "/*.csv")
dict_of_df = OrderedDict((f, pd.read_csv((f),header=None,names=['Devices'])) for f in filenames)
eda=pd.concat(dict_of_df)
I
Do you mean a concatenation?
you could use pandas like this:
import pandas as pd
df1 = pd.DataFrame(columns=["A","B","C","D"],data=[["1","2","3","4"]])
df2 = pd.DataFrame(columns=["A","B","C","D"],data=[["5","6","7","8"]])
df3 = pd.DataFrame(columns=["A","B","C","D"],data=[["9","10","11","12"]])
df_concat = pd.concat([df1, df2, df3])
print(df_concat)
In combination with a loop through the files in the folder ending with ".csv" it would be:
import os
import pandas as pd
path = r"C:\YOUR\DICTIONARY"
df_concat = pd.DataFrame()
for filename in os.listdir(path):
if filename.endswith(".csv"):
print(filename)
df_temp = pd.read_csv(path + "\\" + filename)
df_concat = pd.concat([df_concat, df_temp])
continue
else:
continue

Exclude last two rows when import a csv file using read_csv in Pandas

Afternoon All,
I am extracting data from SQL server to a csv format then reading the file in.
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
]
)
There is a blank row then the record count at the end of the file which I would like to remove.
End of file screenshot
I have been getting around the issue via this code but would like to resolve the root problem:
# Count_Row=df.shape[0] # gives number of row count
# df_Sample = df[['trading_book','state', 'rfq_num_of_dealers']].head(Count_Row-1)
Is there a way to exclude the last two rows in the file or alternativcely remove any row which has null values for all columns?
Pete
Could you try :
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
]
)[:-2]
Example:
from pandas import read_csv
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)[:-2] #to exclude last two rows
#data = read_csv(url, names=names) #to include all rows
print data
#description = data.describe()
You can make use of skiprows directly in .read_csv
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
],
skiprows=-2 # added this line to skip rows when reading
)

Append values to pandas dataframe incrementally inside for loop

I am trying to add rows to pandas dataframe incrementally inside the for loop.
My for loop is like below:
def print_values(cc):
data = []
for x in values[cc]:
data.append(labels[x])
# cc is a constant and data is a list. I need these values to be appended to a row in pandas dataframe.
# Pandas dataframe structure is like follows: df=pd.DataFrame(columns = ['Index','Names'])
print cc
print data
# This does not work - Not sure about the problem !!
#df_clustercontents.loc['Cluster_Index'] = cc
#df_clustercontents.loc['DatabaseNames'] = data
for x in range(0,10):
print_values(x)
I need the values "cc" and "data" to be appended to the dataframe incrementally.
Any help would be really appreciated !!
You can use ,
...
print(cc)
print(data)
df_clustercontents.loc[len(df_clustercontents)]=[cc,data]
...

How to rename pandas dataframe column with another dataframe?

I really don't understand what I'm doing. I have two data frames. One has a list of column labels and another has a bunch of data. I want to just label the columns in my data with my column labels.
My Code:
airportLabels = pd.read_csv('airportsLabels.csv', header= None)
airportData = pd.read_table('airports.dat', sep=",", header = None)
df = DataFrame(airportData, columns = airportLabels)
When I do this, all the data turns into "NaN" and there is only one column anymore. I am really confused.
I think you need add parameter nrows to read_csv, if you need read only columns, remove header= None, because first row of csv is column names and then use parameter names in read_table with columns from DataFrame airportLabels :
import pandas as pd
import io
temp=u"""col1,col2,col3
1,5,4
7,8,5"""
#after testing replace io.StringIO(temp) to filename
airportLabels = pd.read_csv(io.StringIO(temp), nrows=0)
print airportLabels
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
temp=u"""
a,d,f
e,r,t"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_table(io.StringIO(temp), sep=",", header = None, names=airportLabels.columns)
print df
col1 col2 col3
0 a d f
1 e r t