Pandas fails to read SAS as iterable - pandas

UPDATE. This is a known bug - pandas.read_sas breaks if trying to read a SAS7bdat as an iterable.
I receive an error while attempting pandas.read_sas on pandas 0.18.1 in Spyder 3.0.1, Windows 10.
I generated a simple dataset in SAS and saved in the SAS7bdat format:
data basic;
do i=1 to 20;
j=i**2;
if mod(i,2) then type='Even';
else type='Odd';
output;
end;
run;
We save this data to a directory.
The following code successfully imports the SAS dataset when run in Python:
import pandas
f=pandas.read_sas('basic.sas7bdat')
The following code fails:
import pandas
for chunk in pandas.read_sas('basic.sas7bdat', chunksize=1):
pass
The error generated is
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\common.py", line 101, in __next__
raise AbstractMethodError(self)
AbstractMethodError: This method must be defined in the concrete class of SAS7BDATReader
The same error is produced if I use the option iterable=True, or if I use both iterable= and chunksize= together.
Relevant documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html
Sample SAS7bdat datasets: http://www.principlesofeconometrics.com/sas.htm.

Related

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

After pandas loading the CSV file, the DataFrame has some wrong columns

I have a big CSV file that size more than 5G, so I tried to part of to load the file as below codes.
import pandas as pd
reader = pd.read_csv('/path/to/csv', chunksize=10000, error_bad_lines=True, iterator=True)
for chunk in reader:
with open('/path/to/save', 'a') as chunk_file:
chunk.to_csv(chunk_file)
I saw some warning like:
Skipping line 8245: expected 1728 fields, saw 1729
I thought the saved file would be without the dirty data, but the file still exists some wrong data columns.
I've set up the error_bad_lines, I don't know why that happened?

Pandas connect to Oracle error oci.dll not present

I am trying to connect pandas to Oracle as below(I already downloaded oracle client):
import pandas as pd
import cx_Oracle
username='a'
password='d'
host_name = 'aa.com'
service_name= 'ss'
dsn = cx_Oracle.makedsn(host=host_name,port=1535,sid=None,service_name=service_name)
con = cx_Oracle.connect(user=username, password=password, dsn= dsn ,encoding = "UTF-8", nencoding = "UTF-8")
my_sql_query=(""" SELECT * FROM schema.tbl1 WHERE ROWNUM =1 """)
##1- Directly reading SQL to Pandas
#Read SQL via Oracle connection to Pandas DataFrame
df = pd.read_sql(my_sql_query, con=con)
I get:
Cannot locate a 64-bit Oracle Client library: "C:\oracle\product\11.2.0\client_1\bin\oci.dll is not the correct architecture". See https://oracle.github.io/odpi/doc/installation.html#windows for help
When I click the link shown in the message, it asks me to run some .exe file. What is this file going to do?
You need to make sure that Python, cx_Oracle and the Oracle client libraries are all the same 64-bit or 32-bit architecture. It sounds like you have a mis-match.
The link given in the error is for HTML documentation; it doesn't run an exe file. The documentation mentions a VS Redistributable is needed - which is an exe file. This is a Microsoft package needed by Oracle Instant Client.

Issue automating CSV import to an RSQLite DB

I'm trying to automate writing CSV files to an RSQLite DB.
I am doing so by indexing csvFiles, which is a list of data.frame variables stored in the environment.
I can't seem to figure out why my dbWriteTable() code works perfectly fine when I enter it manually but not when I try to index the name and value fields.
### CREATE DB ###
mydb <- dbConnect(RSQLite::SQLite(),"")
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in 1:length(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = csvFiles[i], overwrite=T)
i=i+1
}
# EXAMPLE CODE THAT SUCCESSFULLY MANUAL IMPORTS INTO mydb
dbWriteTable(mydb,"DEPARTMENT",DEPARTMENT)
When I run the for loop above, I'm given this error:
"Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'DEPARTMENT': No such file or directory
# note that 'DEPARTMENT' is the value of csvFiles[1]
Here's the dput output of csvFiles:
c("DEPARTMENT", "EMPLOYEE_PHONE", "PRODUCT", "EMPLOYEE", "SALES_ORDER_LINE",
"SALES_ORDER", "CUSTOMER", "INVOICES", "STOCK_TOTAL")
I've researched this error and it seems to be related to my working directory; however, I don't really understand what to change, as I'm not even trying to manipulate files from my computer, simply data.frames already in my environment.
Please help!
Simply use get() for the value argument as you are passing a string value when a dataframe object is expected. Notice your manual version does not have DEPARTMENT quoted for value.
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in seq_along(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = get(csvFiles[i]), overwrite=T)
}
Alternatively, consider building a list of named dataframes with mget and loop element-wise between list's names and df elements with Map:
dfs <- mget(csvfiles)
output <- Map(function(n, d) dbWriteTable(mydb, name = n, value = d, overwrite=T), names(dfs), dfs)

Using Python UDF with Hive

I am trying to learn using Python UDF's with Hive.
I have a very basic python UDF here:
import sys
for line in sys.stdin:
line = line.strip()
print line
Then I add the file in Hive:
ADD FILE /home/hadoop/test2.py;
Now I call the Hive Query:
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py'
FROM admission_type;
This works as expected, no changes is made to the field and the output is printed as is.
Now, when I modify the UDF by introducing the split function, I get an execution error. How do I debug here? and what am I doing wrong?
New UDF:
import sys
for line in sys.stdin:
line = line.strip()
fields = line.split('\t') # when this line is introduced, I get an execution error
print line
import sys
for line in sys.stdin:
line = line.strip()
field1, field2 = line.split('\t')
print '\t'.join([str(field1), str(field2)])
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py' As ( admission_type_id_new, description_new)
FROM admission_type;