How to read tabular data on s3 in pyspark? - amazon-s3

I have some tab separated data on s3 in a directory s3://mybucket/my/directory/.
Now, I am telling pyspark that I want to use \t as the delimiter to read in just one file like this:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import col, date_sub, log, mean, to_date, udf, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql import DataFrame
sc =SparkContext()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
indata_creds = sqlContext.read.load('s3://mybucket/my/directory/onefile.txt').option("delimiter", "\t")
But it is telling me: assertion failed: No predefined schema found, and no Parquet data files or summary files found under s3://mybucket/my/directory/onefile.txt
How do I tell pyspark that this is a tab-delimited file and not a parquet file?
Or, is there an easier way to do read in these files in the entire directory all at once?
thanks.
EDIT: I am using pyspark version 1.6.1 *
The files are on s3, so I am not able to use the usual:
indata_creds = sqlContext.read.text('s3://mybucket/my/directory/')
because when I try that, I get java.io.IOException: No input paths specified in job
Anything else I can try?

Since you're using Apache Spark 1.6.1, you need spark-csv to use this code:
indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/onefile.txt')
That should work!
Another option is for example this answer. Instead of splitting this by the comma you could use to split it by tabs. And then load the RDD into a dataframe. However, the first option is easier and already loads it into a dataframe.
For your alternative in your comment, I wouldn't convert it to parquet files. There is no need for it except if your data is really huge and compression is necessary.
For your second question in the comment, yes it is possible to read the entire directory. Spark supports regex/glob. So you could do something like this:
indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/*.txt')
By the way, why are you not using 2.x.x? It's also available on aws.

The actual problem was that I needed to add my AWS keys to my spark-env.sh file.

Related

Given a column with S3 paths, I want to read them and store the concatenated version of it. Pyspark

I have a column with s3 file paths, I want to read all those paths, concatenate it later in PySpark
You can get the paths as a list using map and collect. Iterate over that list to read the paths and append the resulting spark dataframes into another list. Use the second list (which is a list of spark dataframes) to union all the dataframes.
# get all paths in a list
list_of_paths = data_sdf.rdd.map(lambda r: r.links).collect()
# read all paths and store the df in a list as element
list_of_sdf = []
for path in list_of_paths:
list_of_sdf.append(spark.read.parquet(path))
# check using list_of_sdf[0].show() or list_of_sdf[1].printSchema()
# run union on all of the stored dataframes
import pyspark
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)
Use the final_sdf dataframe to write to a new parquet file.
You can supply multiple paths to the Spark parquet read function. So, assuming these are paths to parquet files that you want to read into one DataFrame, you can do something like:
list_of_paths = [r.links for links_df.select("links").collect()]
aggregate_df = spark.read.parquet(*list_of_paths)

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Why is the file csv in pandas stored incorrectly?

I'm practicing csv files in pandas.But their output is like this Does anyone know the problem?
import pandas as pd
df=pd.DataFrame({
'nme' : ["aparna", "pankaj", "sudhir", "Geeku","seed","kasoo","jak","por"],
'deg' : ["MBA", "BCA", "M.Tech", "MBA","nba","jlk","esda","pin"],
'scr' : [90, 80,8, 98,34,5,23,22]})
df.to_csv('filel3.csv')
use "sep=';'" in your to_csv method.
It could be that your Excel default settings are not on importing csv with ",". First try to add the comma in the Excel import csv settings.
Otherwise you can use sep=';' to get work faster. But it depends to you.
Hope you are doing fine.
Kind regards.

pypi sas7bdat to_data_frame taking too long for large data(5 GB)

I have a 5GB SAS file and the requirment is to create parquet file in Hadoop. I am using SAS7BDAT library and using following approach which is taking more then 5 hours in creating pandas dataframe when running pyspark on client mode. Curious to know if there is any better way of doing the same.
I know there is saurfang package available which is more efficient in this case, but we do not want to use any 3rd part software.
f = sas7bdat.SAS7BDAT(str(source_file))
pandas_df = f.to_data_frame()
spark_df = spark.createDataFrame(pandas_df)
del pandas_df
spark_df.write.save(dest_file,format='parquet', mode='Overwrite')
Please use Spark to read the file, not Pandas
https://github.com/saurfang/spark-sas7bdat/blob/master/README.md#python-api
Add this to your packages
saurfang:spark-sas7bdat:2.1.0-s_2.11
Note, I've not personally used this, I only search for "SAS 7B DAT + Spark". If you have issues, please report here
https://github.com/saurfang/spark-sas7bdat/issues/

Reading csv file from s3 using pyarrow

I want read csv file located in s3 bucket using pyarrow and convert it to parquet to another bucket.
I am facing problem in reading csv file from s3.I tried reading below code but failed.Does pyarrow support reading csv from s3 ?
from pyarrow import csv
s3_input_csv_path='s3://bucket1/0001.csv'
table=csv.read_csv(s3_input_csv_path)
This is throwing error
"errorMessage": "Failed to open local file 's3://bucket1/0001.csv', error: No such file or directory",
I know we can read csv file using boto3 and then can use pandas to convert it into data frame and finally convert to parquet using pyarrow. But in this approach pandas is also required to be added to the package that makes package size go beyond 250 mb limit for lambda when taken along with pyarrow.
Try passing a file handle to pyarrow.csv.read_csv instead of an S3 file path.
Note that future editions of pyarrow will have built-in S3 support but I am not sure of the timeline (and any answer I provide here will grow quickly out of date with the nature of StackOverflow).
import pyarrow.parquet as pq
from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3_input_csv_path = f"s3://bucket1/0001.csv"
dataset = pq.ParquetDataset(s3_input_csv_path, filesystem=s3)
table = dataset.read_pandas().to_pandas()
print(table)
s3_output_csv_path = f"s3://bucket2/0001.csv"
#Wring table to another bucket
pq.write_to_dataset(table=table,
root_path=s3_output_csv_path,
filesystem=s3)
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
Example of CSV read:
import awswrangler as wr
df = wr.s3.read_csv(path="s3://...")
Reference
Its not possible as of now. But here is a workaround, we can load data to pandas and cast it to pyarrow table
import pandas as pd
import pyarrow as pa
df = pd.read_csv("s3://your_csv_file.csv", nrows=10). #reading 10 lines
pa.Table.from_pandas(df)