How to read a pandas dataframe in Databricks? - pandas

I don't know why, but my file located at "FileStore/tables/train.csv" is not readable using pandas in the Databricks platform. I tried :
pd.read_csv("/dbfs/FileStore/tables/train.csv")
and got
FileNotFoundError: [Errno 2] File /dbfs/FileStore/tables/train.csv does not exist: '/dbfs/FileStore/tables/train.csv'

Sorry for the late reply.
You can read microsoft docs HERE
It will help you in copying the file uploaded in "FileStore" to "Tmp" location.
This should enable your code like:
import pandas as pd
pd.read_csv(path)
and
import pyspark.pandas as ps
ps.read_csv(path)

Related

Cannot read .xlsx file with read_excel()

I want to open the .xlsx file through read_excel().
However, an error message is printed even though the openpyxl and pandas packages are installed.
The pandas version is 0.24.2 and the openpyxl version is 3.0.10.
The error message is - ValueError: Unknown engine: openpyxl
import pandas as pd
import math
retail_df = pd.read_excel('./Online_Retail.xlsx',engine='openpyxl')
print(retail_df.head())
In Pandas 0.24.2 the default engine is openpyxl and for that, you don't need to set it up manually during loading the excel file inside the read_excel() function.
So now your updated working code for reading excel files is :
import pandas as pd
import math
retail_df = pd.read_excel('./Online_Retail.xlsx')
print(retail_df.head())
Testing result from my side with this code.

FileNotFoundError in Python3 (Code editor: Pycharm)

I have imported
import numpy as np
and I have used
xy = np.loadtxt('./Desktop/wine.csv', delimiter=',', dtype=np.float32, skiprows=1)
but Python3 is not able to read the file and I really do not know why. Can anyone help me please?
Can you try again by specifying the full file path?
/home/username/Desktop/wine.cvs

pandas-read-xml has error on 'json-normalize'

I saw there is a way to directly read XML files using pandas. I followed and used this package. However, I keep getting errors.
https://pypi.org/project/pandas-read-xml/
import pandas as pd
import pandas_read_xml as pdx
from pandas.io.json import json_normalize
The error was generated by last line and the error is
ImportError: cannot import name 'json_normalize'
I am using kernel python 3, can anyone tell me what was wrong with it?

How to load csv file into SparkSession

I am learning PySpark from some online source. I googled around and found how I could read csv file into Spark DataFrame using the following codes
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.csv('my_file.csv', header=True)
pandas_df = spark_df.toPandas()
However, on the online site I am learning, it loads the csv file somehow into SparkSession without telling the audience how to do it. That is, when I typed (on the online site's browser)
print(spark.catalog.listTables())
The following output returns.
[Table(name='my_file', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
When I tried to print the catalog as above, I got an empty list back.
Is there anyway how to put the csv file into the SparkSession? I have tried to google for this but most of what I found is how to load csv into Spark DataFrame like I showed above.
Thanks very much.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(#type the app name).getOrCreate()
df = spark.read.csv('invoice.csv',inferSchema=True,header=True)
It seems how to do this is left far behind where it should be on the online site.
sdf = spark.read.csv('my_file.csv', header=True)
pdf = sdf.toPandas()
spark_temp = spark.createDataFrame(pdf)
spark_temp.createOrReplaceTempView('my_file')
print(spark.catalog.listTables())
[Table(name='my_file', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
One question remains though. I cannot use pd.read_csv('my_file.csv') directly. It resulted in some merge error or something.
This can work:
df = my_spark.read.csv("my_file.csv",inferSchema=True,header=True)
df.createOrReplaceTempView('my_file')
print(my_spark.catalog.listTables())

TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

I am converting a csv file to feather type using the code as below,
import pandas as pd
import feather
df = pd.read_csv('myfile.csv')
feather.write_dataframe(df, 'myfile.feather')
myfile.csv is over 2G and when I run the code I get the error message as below:
File "table.pxi", line 705, in pyarrow.lib.RecordBatch.from_pandas
File "table.pxi", line 739, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
I've looked at similar questions and have found that feather started to support large file over 2G recently. But my feather version is 0.4 so I think mine one is already able to support large file. Why do I get this error? Any ideas would be appreciated, thanks.