Pandas - No Quote Character saved to file - pandas

I am having a difficult time trying to get any "Quote" Character to print out using to_csv function in Pandas.
import pandas as pd
final = pd.DataFrame(dataset.loc[::])
final.to_csv(r'c:\temp\temp2.dat', doublequote=True, mode='w',
sep='\x14', quotechar='\xFE', index=False)
print (final)
I have tried various options without success, I am not sure what i am missing. Wondering igf anyone can point me in the right direction. thank you in advance.

Finally! it appears the documentation has changed or it not updated on the this. adding the option of quoting=1 cures the issues. apparently, quoting=csv.QUOTE_ALL no longer works.
the complete command is
import pandas as pd
final = pd.DataFrame(dataset.loc[::])
final.to_csv(r'c:\temp\temp2.dat', index=False, doublequote=True,sep='\x14', quoting=1, quotechar='\xFE')
print (final)

Related

Python Pandas Series.any() ==

Am I using pandas correctly? I am trying to loop through files and find if any value in a series matches
import pandas as pd
path = user/Desktop/New Folder
for file in path:
df = pd.read_excel(file)
if df[Series].any() == "string value"
do_something()
Please, check if this address your problem:
if df[df['your column']=="string value"].any()
do_something()
I think you should fix also your file iteration, please check this: https://www.newbedev.com/python/howto/how-to-iterate-over-files-in-a-given-directory/

Exporting Multiple log files data to single Excel using Pandas

How do I export multiple dataframes to a single excel, I'm not talking about merging or combining. I just want a specific line from multiple log files to be compiled to a single excel sheet. I already wrote a code but I am stuck:
import pandas as pd
import glob
import os
from openpyxl.workbook import Workbook
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
for files in read_files:
logs = pd.read_csv(files, header=None).loc[540:1060, :]
print(LBS_logs)
logs.to_excel("LBS.xlsx")
When I do this, I only get data from the first log.
Appreciate your recommendations. Thanks !
You are saving logs, which is the variable in your for loop that changes on each iteration. What you want is to make a list of dataframes and combine them all, and then save that to excel.
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
dfs = []
for file in read_files:
log = pd.read_csv(file, header=None).loc[540:1060, :]
dfs.append(log)
logs = pd.concat(logs)
logs.to_excel("LBS.xlsx")

How to read csv files from s3 bucket using Pyspark (in macos)?

I am trying to read csv df from s3 bucket , but facing issues. Can you let me know where am I masking mistakes here ?
conf=SparkConf()
conf.setMaster('local')
conf.setAppName('sparkbasic')
sc = SparkContext.getOrCreate(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "abc")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xyz")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "mybucket/path/fileeast-1.redshift.amazonaws.com")
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName('sparkbasic').getOrCreate()
This is the code where I get the error
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This is the error I get , I tried links given in stackoverflow answers , but nothing worked me so far
ava.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Maybe you can have a look to S3Fs
Given your details, maybe a configuration like that could work:
import s3fs
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': 'fileeast-1.redshift.amazonaws.com',
"aws_access_key_id": "abc",
"aws_secret_access_key": "xyz"})
To check if you manage to interact with s3, you can try the following command (NB: change somefile.csv to an existing one)
fs.info('s3://bucket/path/file/somefile.csv')
Note that in fs.info we start the path with s3. If you do not encounter an error, you might hope the following command works:
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This time you have the path begins by s3a

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

ipython - pandasql.sqldf doesn't return an error

After importing
import pandas
import pandasql
and running:
q = """
select min(cast (maxtempi as integer))
from weather_data
where min(cast (maxtempi as integer)) >55
"""
print pandasql.sqldf(q.lower(), locals())
None
is returned and no result sets or error. Obviously the error is in the where clause.
How do I print an error from pandasql.sqldf?
Normal, it should be fine and no error occurs.
please also notice and check followings;
(1)is weather_data an instance of DataFrame?
(2)is pandasql installed? If you use PyCharm, please also restart your PyCharm to let new installed pandasql package works