Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ββββin the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes
I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.
I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')
I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')
In pandas, this can be done by column.name.
But how to do the same when it's a column of Spark dataframe?
E.g. the calling program has a Spark dataframe: spark_df
>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']
This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.
If it was pandas dataframe, we could use this:
>>> pandas_df['rank'].name
'rank'
You can get the names from the schema by doing
spark_df.schema.names
Printing the schema can be useful to visualize it as well
spark_df.printSchema()
The only way is to go an underlying level to the JVM.
df.col._jc.toString().encode('utf8')
This is also how it is converted to a str in the pyspark code itself.
From pyspark/sql/column.py:
def __repr__(self):
return 'Column<%s>' % self._jc.toString().encode('utf8')
Python
As #numeral correctly said, column._jc.toString() works fine in case of unaliased columns.
In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .
I don't know Scala syntax, but I'm sure It can be done the same.
If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:
>>> df.columns['High']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not str
However, calling the columns method on your dataframe, which you have done, will return a list of column names:
df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
If you want the column datatypes, you can call the dtypes method:
df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]
If you want a particular column, you'll need to access it by index:
df.columns[2] will return 'High'
I found the answer is very very simple...
// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();
The variable theNameOftheCol is "colName"
I hope these options may serve more like universal ones. Cases covered:
column not having an alias
column having an alias
column having several consecutive aliases
column names surrounded with backticks
No regex:
str(col).replace("`", "").split("'")[-2].split(" AS ")[-1])
Using regex:
import re
re.search(r"'.*?`?(\w+)`?'", str(col)).group(1)
#table name as an example if you have multiple
loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder
table_name = ['customer','department']
for i in table_name:
print(i) # printing the existing table name
df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
for col in df.dtypes:
print(col[0]) # column_name as per availability
print(col[1]) # datatype information of the respective column
Since none of the answers have been marked as the Answer -
I may be over-simplifying the OPs ask but:
my_list = spark_df.schema.fields
for field in my_list:
print(field.name)