I am trying to create a fixed width file output in Pandas. When using pandas dataFrames to_string all the data has a "white space" separating the values. How do I remove the white space between the data columns?
sql = """SELECT
FIELD_1,
FIELD_2,
.........
FROM
VIEW"""
db_connection_string = "your connection string"
df = pd.read_sql_query(sql=sql, con=db_connection_string)
df['field_1'] = df['field_1'].str.pad(width=10, side='right', fillchar='-')
df['field_2'] = df['field_2'].str.pad(width=10, side='right', fillchar='-')
print(df.to_string(header=False, index=False)
I expected the following:
field1----field2----
What I got was:
field1---- field2----
Please note the spaces between the columns. This is what I am trying to remove. The fields should be flush against one another and not have a whitespace separator.
I think problem is to_string add default separator. Possible solutions is join together all columns:
print(df.astype(str).apply(''.join, 1).to_string(header=False, index=False))
field1----field_2---
Or only some columns:
print ((df['field_1'] + df['field_2']).to_string(header=False, index=False)))
Related
Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ββββin the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes
I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.
I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')
I am using this dataframe below
In this dataframe the "TITLE" and"ABSTRACT" column contains a lot of unwanted characters along with words.
I want only the letters and not any other unwanted characters in these two columns.
Please help me remove the unwanted charcters from both the columns of the dataframe.
Please use any method(functions preferable).
df['TITLE'] = df.TITLE.str.replace('[^a-zA-Z]', '')
df['ABSTRACT'] = df.ABSTRACT.str.replace('[^a-zA-Z]', '')
I am trying to open csv file in pandas using read_csv function. My file have the following structure a row with headers where each column header's name have the name underlined with quotes for example "header1";"header2"; non headers values in columns contains int or string values without quotes with only ; delimiter. Dataframe has the following structure
"header1";"header2";"header3";
value1;value2;value3;
When i apply read_csv df = pd.read_csv("filepath", sep=";", engine="python") i am getting ParseError: expected ';' after ' " ' help to solve it
Try to specify column names as follows, and see if it resolves the issue:
col_names = ["header1", "header2", "header3"]
df = pd.read_csv(filepath, sep=";", names=col_names)
If this doesn't work, try adding 'quotechar=' " ' and see