Spark read multiple csv into one dataframe - error with path - dataframe

I'm trying to read all the csv under a HDFS directory to a dataframe, but got an error that says its "not a valid DFS filename" Could someone help to point out what I did wrong? I tried without the hdfs:// part as well but it says path could not be found. Many thanks.
val filelist = "hdfs://path/to/file/file1.csv,hdfs://path/to/file/file2.csv "
val df = spark.read.csv(filelist)

val df = spark.read.csv(filelist:_*)

Related

Julia load dataframe from s3 csv file

I'm having trouble finding an example to follow online for this simple use-case:
Load a CSV file from an s3 object location to julia DataFrame.
Here is what I tried that didn't work:
using AWSS3, DataFrames, CSV
filepath = S3Path("s3://muh-bucket/path/data.csv")
CSV.File(filepath) |> DataFrames # fails
# but I am able to stat the file
stat(filepath)
#=
Status( mode = -rw-rw-rw-,
...etc
size = 2141032 (2.0M),
blksize = 4096 (4.0K),
blocks = 523,
mtime = 2021-09-01T23:55:26,
...etc
=#
I can also read the file to a string object locally:
data_as_string = String(AWSS3.read(filepath);
#"column_1\tcolumn_2\tcolumn_3\t...etc..."
My AWS config is in order, I can access the object from julia locally.
How to I get this into a dataframe?
Thanks to help from the nice people on julia slack channel (#data).
bytes = AWSS3.read(S3Path("s3://muh-bucket/path/data.csv"))
typeof(bytes)
# Vector{UInt8} (alias for Array{UInt8, 1})
df = CSV.read(bytes, DataFrame)
Bingo, I'm in business. The CSV.jl maintainer mentions that S3Path types used to work when passed to CSV.read, so perhaps this will be even simpler in the future.
Helpful SO post for getting AWS configs in order

How to iterate over a list of csv files and compile files with common filenames into a single csv as multiple columns

I am currently iterating through a list of csv files and want to combine csv files with common filename strings into a single csv file merging the data from the new csv file as a set of two new columns. I am having trouble with the final part of this in that the append command adds the data as rows at the base of the csv. I have tried with pd.concat, but must be going wrong somewhere. Any help would be much appreciated.
**Note the code is using Python 2 - just for compatibility with the software I am using - Python 3 solution welcome if it translates.
Here is the code I'm currently working with:
rb_headers = ["OID_RB", "Id_RB", "ORIG_FID_RB", "POINT_X_RB", "POINT_Y_RB"]
for i in coords:
if fnmatch.fnmatch(i, '*RB_bank_xycoords.csv'):
df = pd.read_csv(i, header=0, names=rb_headers)
df2 = df[::-1]
#Export the inverted RB csv file as a new csv to the original folder overwriting the original
df2.to_csv(bankcoords+i, index=False)
#Iterate through csvs to combine those with similar key strings in their filenames and merge them into a single csv
files_of_interest = {}
forconc = []
for filename in coords:
if filename[-4:] == '.csv':
key = filename[:39]
files_of_interest.setdefault(key, [])
files_of_interest[key].append(filename)
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df = buff_df.append(pd.read_csv(filename))
files_of_interest[key]=buff_df
redundant_headers = ["OID", "Id", "ORIG_FID", "OID_RB", "Id_RB", "ORIG_FID_RB"]
outdf = buff_df.drop(redundant_headers, axis=1)
If you want only to merge in one file:
paths_list=['path1', 'path2',...]
dfs = [pd.read_csv(f, header=None, sep=";") for f in paths_list]
dfs=pd.concat(dfs,ignore_index=True)
dfs.to_csv(...)

Python Pandas Series.any() ==

Am I using pandas correctly? I am trying to loop through files and find if any value in a series matches
import pandas as pd
path = user/Desktop/New Folder
for file in path:
df = pd.read_excel(file)
if df[Series].any() == "string value"
do_something()
Please, check if this address your problem:
if df[df['your column']=="string value"].any()
do_something()
I think you should fix also your file iteration, please check this: https://www.newbedev.com/python/howto/how-to-iterate-over-files-in-a-given-directory/

How to read csv files from s3 bucket using Pyspark (in macos)?

I am trying to read csv df from s3 bucket , but facing issues. Can you let me know where am I masking mistakes here ?
conf=SparkConf()
conf.setMaster('local')
conf.setAppName('sparkbasic')
sc = SparkContext.getOrCreate(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "abc")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xyz")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "mybucket/path/fileeast-1.redshift.amazonaws.com")
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName('sparkbasic').getOrCreate()
This is the code where I get the error
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This is the error I get , I tried links given in stackoverflow answers , but nothing worked me so far
ava.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Maybe you can have a look to S3Fs
Given your details, maybe a configuration like that could work:
import s3fs
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': 'fileeast-1.redshift.amazonaws.com',
"aws_access_key_id": "abc",
"aws_secret_access_key": "xyz"})
To check if you manage to interact with s3, you can try the following command (NB: change somefile.csv to an existing one)
fs.info('s3://bucket/path/file/somefile.csv')
Note that in fs.info we start the path with s3. If you do not encounter an error, you might hope the following command works:
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This time you have the path begins by s3a

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)