Pandas exporting to_csv() with quotation marks around column names - pandas

For some reason I need to output to a csv in this format with quotations around each columns names, my desired output looks like:
"date" "ret"
2018-09-24 0.00013123989025119056
I am trying with
import csv
import pandas as pd
Y_pred.index.name = "\"date\""
Y_pred.name = "\'ret\'"
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ')
and got outputs like:
"""date""" 'ret'
2018-09-24 0.00013123989025119056
I can't seem to find a way to use quotation to wrap at the columns. Does anyone know how to? Thanks.
My solution:
using quoting=csv.QUOTE_NONE together with Y_pred.index.name = "\"date\"", Y_pred.name = "\"ret\""
Y_pred.index.name = "\"date\""
Y_pred.name = "\"ret\""
Y_pred = Y_pred.to_frame()
path = "prediction/Q1/"
try:
os.makedirs(path)
except:
pass
Y_pred.to_csv(path+instrument_tmp+"_ret.txt",sep=' ',quoting=csv.QUOTE_NONE)
and then I get
"date" "ret"
2018-09-24 0.00013123989025119056

This is called quoted output.
Instead of manually hacking in quotes into your column names (which will mess with other dataframe functionality), use the quoting option:
df = pd.DataFrame({"date": ["2018-09-24"], "ret": [0.00013123989025119056]})
df.to_csv("out_q_esc.txt", sep=' ', escapechar='\\', quoting=csv.QUOTE_ALL, index=None)
"date" "ret"
"2018-09-24" "0.00013123989025119056"
The 'correct' way is to use quoting=csv.QUOTE_ALL (and optionally escapechar='\\'), but note however that QUOTE_ALL will force all columns to be quoted, even obviously numeric ones like the index; if we hadn't specified index=None, we would get:
"" "date" "ret"
"0" "2018-09-24" "0.00013123989025119056"
csv.QUOTE_MINIMAL refuses to quote these fields because they don't strictly need quotes (they're neither multiline nor do they contain internal quote or separator chars)

IIUC, you can use the quoting argument with csv.QUOTE_NONE
import csv
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)
And your resulting csv will look like:
"date" "ret"
0 2018-09-24 0.00013123989025119056
Side Note: To facilitate the adding of quotations to your columns, you can use add_prefix and add_suffix. If your starting dataframe looks like:
>>> df
date ret
0 2018-09-24 0.000131
Then do:
df = df.add_suffix('"').add_prefix('"')
df.to_csv('test.csv',sep=' ',quoting=csv.QUOTE_NONE)

Related

pandas cant replace commas with dots

Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ​​​​in the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes

Losing rows when renaming columns in pyspark (Azure databricks)

I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

how to merge two columns in one column as date with pandas?

I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name.
But how to do the same when it's a column of Spark dataframe?
E.g. the calling program has a Spark dataframe: spark_df
>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']
This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.
If it was pandas dataframe, we could use this:
>>> pandas_df['rank'].name
'rank'
You can get the names from the schema by doing
spark_df.schema.names
Printing the schema can be useful to visualize it as well
spark_df.printSchema()
The only way is to go an underlying level to the JVM.
df.col._jc.toString().encode('utf8')
This is also how it is converted to a str in the pyspark code itself.
From pyspark/sql/column.py:
def __repr__(self):
return 'Column<%s>' % self._jc.toString().encode('utf8')
Python
As #numeral correctly said, column._jc.toString() works fine in case of unaliased columns.
In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .
I don't know Scala syntax, but I'm sure It can be done the same.
If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:
>>> df.columns['High']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not str
However, calling the columns method on your dataframe, which you have done, will return a list of column names:
df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
If you want the column datatypes, you can call the dtypes method:
df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]
If you want a particular column, you'll need to access it by index:
df.columns[2] will return 'High'
I found the answer is very very simple...
// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();
The variable theNameOftheCol is "colName"
I hope these options may serve more like universal ones. Cases covered:
column not having an alias
column having an alias
column having several consecutive aliases
column names surrounded with backticks
No regex:
str(col).replace("`", "").split("'")[-2].split(" AS ")[-1])
Using regex:
import re
re.search(r"'.*?`?(\w+)`?'", str(col)).group(1)
#table name as an example if you have multiple
loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder
table_name = ['customer','department']
for i in table_name:
print(i) # printing the existing table name
df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
for col in df.dtypes:
print(col[0]) # column_name as per availability
print(col[1]) # datatype information of the respective column
Since none of the answers have been marked as the Answer -
I may be over-simplifying the OPs ask but:
my_list = spark_df.schema.fields
for field in my_list:
print(field.name)