csv header parsing in pyspark - dataframe

I am trying to read csv file as dataframe from Azure databricks.
The header columns (when I open in excel) are as follows.
All the header names are in the following format in the CSV file.
e.g.
"City_Name"ZYD_CABC2_EN:0TXTMD
Basically I want to include only strings within quotes as my header (City_Name) and ignore the second part of the string (ZYD_CABC2_EN:0TXTMD)
sales_df = spark.read.format("csv").load(input_path + '/sales_2020.csv', inferSchema = True, header=True)

You can parse the column names after reading in the csv file, using regular expressions to extract the words between the quotes, and then using toDF to reassign all column names at once:
import re
# sales_df = spark.read.format("csv")...
sales_df = sales_df.toDF(*[re.search('"(.*)"', c).group(1) for c in df.columns])

You can split the actual names using " to get the desired column names:
sales_df = sales_df.toDF(*[c.split('"')[1] for c in df.columns])

Related

Insert column name and move the column data into the row

I want data like this
When I use df.Column[] it replaces the value with name
First question: Does the .xls file has column names? Or should you define them manually?
If it has column names (it seems like it doesn't), you can use header = 0 .If it doesn't have column names, define a list and then header = None and names = column_names.
The first one is;
df = pd.read_excel('wine-1.xls', header = 0 ) # use read_excel instead of read_csv
and the second one:
column_names = ['wine', 'acidity',...] # list of column names
df = pd.read_excel('wine-1.xls', header = None, names = columns_names)
Hope that works for you.

Pandas dataframe: Splitting single-column data from txt file into multiple columns

I have an obnoxious .txt file that is output from a late 1990's program for an Agilent instrument. I am trying to comma-separate and organize the single column of the text file into multiple columns in a pd dataframe. After some organization, the txt file currently looks like the following: See link here:
Organized Text File
Each row is indexed in a pd dataframe. The code used to reorganize the file and attempt to split into multiple columns follows:
quantData = pd.read_csv(epaTemp, header = None)
trimmed_File = quantData.iloc[16:,]
trimmed_File = trimmed_File.drop([17,18,70,71,72], axis = 0)
print (trimmed_File)
###
splitFile = trimmed_File.apply( lambda x: pd.Series(str(x).split(',')))
print (splitFile)
The split function above did not get applied to all rows present in the txt file. It only split(',')the first row rather than all of them:
0 16 Compound R... 1
dtype: object
I would like this split functionality to apply to all rows in my txt file so I can further organize my data. Thank you for the help.

Pandas read_csv wrong separator recognition

I am trying to open csv file in pandas using read_csv function. My file have the following structure a row with headers where each column header's name have the name underlined with quotes for example "header1";"header2"; non headers values in columns contains int or string values without quotes with only ; delimiter. Dataframe has the following structure
"header1";"header2";"header3";
value1;value2;value3;
When i apply read_csv df = pd.read_csv("filepath", sep=";", engine="python") i am getting ParseError: expected ';' after ' " ' help to solve it
Try to specify column names as follows, and see if it resolves the issue:
col_names = ["header1", "header2", "header3"]
df = pd.read_csv(filepath, sep=";", names=col_names)
If this doesn't work, try adding 'quotechar=' " ' and see

Pandas - Issue when concatenating multiple csv files into one

I have a list of csv files that I am trying to concatenate using Pandas.
Given below is sample view of the csv file:
Note: Column 4 - stores the latitude
Column 5 - store the longitude
store-001,store_name,building_no_060,23.4324,43.3532,2018-10-01 10:00:00,city_1,state_1
store-002,store_name,building_no_532,12.4345,45.6743,2018-10-01 12:00:00,city_2,state_1
store-003,store_name,building_no_536,54.3453,23.3444,2018-07-01 04:00:00,city_3,state_1
store-004,store_name,building_no_004,22.4643,56.3322,2018-04-01 07:00:00,city_2,state_3
store-005,store_name,building_no_453,76.3434,55.4345,2018-10-02 16:00:00,city_4,state_2
store-006,store_name,building_no_456,35.3455,54.3334,2018-10-05 10:00:00,city_6,state_2
When I try to concat multiple csv files in the above format, I see the columns having latitude and longitude are first saved in the first row from A2 - A30 and they are followed by the other columns all in the row 1.
Given below is the way I am performing the concat:
masterlist = glob.glob('path') <<- This is the path where all the csv files are stored.
df_v1 = [pd.read_csv(fp, sep=',', error_bad_lines=False).assign(FileName=os.path.basename(fp)) for fp in masterlist] <<-- This also includes the file name in the csv file
df = pd.concat(df_v1, ignore_index=True)
df.to_csv('path'), index=False) <<-- This stores the final concatenated csv file
Could anyone guide me why is the concatenation not working properly. Thanks

Key error: '3' When extracting data from Pandas DataFrame

My code plan is as follows:
1) find csv files in folder using glob and create a list of files
2) covert each csv file into dataframe
3) extract data from a column location and convert into a separate dataframe
4) append the new data into a separate summary csv file
code is as follows:
Result = []
def result(filepath):
files = glob.glob(filepath)
print files
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['3'].ix[0:4]
new_dfb[colname] = selected_data
Result.append(new_dfb)
folder = r"C:/Users/Joey/Desktop/tcd/summary.csv"
new_dfb.to_csv(folder)
result("C:/Users/Joey/Desktop/tcd/*.csv")
print Result
The code error is shown below. The issue seems to be with line 36 .. which corresponds to the selected_data = df['3'].ix[0:4].
I show one of my csv files below:
I'm not sure what the problem is with the dataframe constructor?
You're csv snippet is a bit unclear. But as suggested in the comments, read_csv (from_csv in this case) automatically taken the first row as a list of headers. The behaviour you appear to want is the columns to be labelled as 1,2,3 etc. To achieve this you need to have
[pd.DataFrame.from_csv(f, index_col=None,header=None) for f in files]