I'm creating a file export and a number of these fields are set up as numeric(18,4) in both the source table and flat file column. When the file is generated, any numbers which are < 1 are shown as .#, i.e. .52, instead of 0.52.
What needs to be done to fix this? The only way I can think of which neither are ideal methods are :
1. Output them as strings
2. Use a derived columns on all the numeric fields.
Related
I have a csv file containing rows as records. I want to make sure that my CSV columns contain data of same data-type. Currently I do this using following method :
I import the csv using pd.read_csv('path/to/csv', dtype={"c1" : int, "c2": str}), so if the column contain data of different type, the error is thrown. However, this approach has two major problems.
(1) I have to read the entire csv in the memory.
(2) The row with incorrect datatype is not detected.
Reading the csv in memory is not the problem right now for me, since csvs are not huge in size. I'd like to know if there is any better way to solve this problem?
One way to solve your issue is to open the file without type checking, then lazily iterate through the file one line at a time, and finally check if that line is valid or malformed as you need.
For example:
for chunk in pd.read_csv('path/to/csv', chunksize=1):
try:
print(chunk.astype({"c1" : int, "c2": str}))
# or do whatever you want with the line
except ValueError:
print("Malformed row found!")
self-taught SAS user here.
I often work with datasets that I have little control over and are shared among several different users.
I generally have been reading in files as CSVs using an infile statement + defining the variables with blocks of informat, format, and input statements. During this process, can I go ahead and rename variables--provided that everything is renamed in the correct order--or do they have to match the original dataset and be renamed in a later data step?
For example, the variable name in the dataset is '100% Fully Paid Out.' I know SAS variables can't start with numbers and I'd also like to simplify variable names in general, so could I do something like the following:
infile statement...
informat Paid $3.;
format Paid $3.;
input Paid $;
run;
Or maybe I'm going about this very inefficiently. I've tried doing simple proc imports without this whole informat/format/input business, but I've found that trying to redefine variable types afterwards causes more of a headache for me (all datasets I work with have combinations of text, dollars, percentages, general numbers, dates...). In any case, other tips highly appreciated--thanks!
EDIT
Maybe the question I should ask is this: is there any way of keeping the format of the csv for dollars and percentages (through proc import, which seems to convert these to characters)? I know I can manually change the formats from dollars/percentages to "general" in Excel prior to importing the file, but I'd prefer avoiding additional manual steps and also because I actually do want to keep these as dollars and percentages. Or am I just better off doing the informat/format/input to specify data types for the csv, so that variables are read in exactly how I want them to be read in?
Note: I've been unable to proc import xls or xlsx files, either because I'm on a 64-bit computer and/or I'm missing required drivers (or both). I was never able to do this even on a 32-bit computer either.
CSV files do not contain any metadata about the variable types, as your note about trying to import them into Excel demonstrates. You can use PROC IMPORT to have SAS make an educated guess as to how to read them, but the answer could vary from file to file based on the particular data values that happen to appear.
If you have data in XLS or XLSX files you should be able to read them directly into SAS using a libname with the XLS or XLSX engine. That does not use Excel and so does not have any conflicts between 32 and 64 installation. In fact you don't even need Excel installed. SAS will do a better job of determining the variable types from Excel files than from CSV files, but since Excel is a free-form spreadsheet you still might not have consistent variable types for the same variable across multiple files. With an Excel spreadsheet you might not even have the same data type consistently in a single column of a single sheet.
You are better off writing your own data step to read the file. That way you can enforce consistency.
What I typically do when given a CSV file is copy the names from the first row and use it to create a LENGTH statement. This will both define the variables and set the order of the variables. You could at this point give the variables new names.
length paid $3 date amount 8 ;
Then for variables that require an INFORMAT to be read properly I add an INFORMAT statement. Normally this is only needed for date/time variables, but it might also be needed if numeric values include commas or percent signs. The DOLLAR. informat is useful if your CSV file has numbers formatted with $ and/or thousands separators.
informat date mmddyy. amount dollar. ;
Then for variables that require a FORMAT to be displayed properly I add a FORMAT statement. Normally this is only needed for date/time variables. It is only required for character variables if you want to attach $CHAR. format in order to preserve leading spaces.
format date yymmdd10. ;
Then the INPUT statement is really easy since you can use a positional variable list. Note that there is no need to include informats or $ in the INPUT statement since the types are already defined by the LENGTH statement.
input paid -- amount ;
I have been trying to validate the data type of the data that I got from a flat file through pig.
A simple CAT can do the trick but the Flat files are huge and they sometimes contain special characters.
I need to filter out the records containing special characters from the file and also when the data type is not int.
Is there any way to do this in pig?
I am trying to find a substitute for getType().getName() kind of usage of java here.
Enforcing schema and using Describe is what we do while loading data and then remove the miss match but is there anyway to it without enforcing the schema.
Any suggestions will be helpful.
Load the data into a line:charraray and use regular expression to filter out the records that contains characters other than numbers
A = LOAD 'data.txt' AS (line:chararray);
B = FILTER A BY (line matches '\\d+$'); -- Change according to your needs.
DUMP B;
I am importing data from a text file and have hit a snag. I have a numeric field which occasionally has very large values (10 billion+) and some of these values which are being converted to NULLs.
Upon further testing I have isolated the problem as follows - the first 25 rows of data are used to determine the field size, and if none of the first 25 values are large then it throws out any value >= 2,147,483,648 (2^31) which comes after.
I'm using ADO and the following connection string:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=FILE_ADDRESS;Extended Properties=""text;HDR=YES;FMT=Delimited""
Therefore, can anyone suggest how I can get round this problem without having to get the source data sorted descending on the large value column? Is there some way I could define the data types of the recordset prior to importing rather than let it decide for itself?
Many thanks!
You can use an INI file placed in the directory you are connecting to which describes the column types.
See here for details:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms709353(v=vs.85).aspx
I would like to be able to produce a file by running a command or batch which basically exports a table or view (SELECT * FROM tbl), in text form (default conversions to text for dates, numbers, etc are fine), tab-delimited, with NULLs being converted to empty field (i.e. a NULL colum would have no space between tab characters, with appropriate line termination (CRLF or Windows), preferably also with column headings.
This is the same export I can get in SQL Assistant 12.0, but choosing the export option, using tab delimiter, setting my NULL value to '' and including column headings.
I have been unable to find the right combination of options - the closest I have gotten is by building a single column with CAST and '09'XC, but the rows still have a leading 2-byte length indicator in most settings I have tried. I would prefer not to have to build large strings for the various different tables.
To eliminate the 2-byte in the FastExport output:
.EXPORT OUTFILE &dwoutfile MODE RECORD FORMAT TEXT;
And your SELECT must generate a fixed length export field e.g. CHAR(n). So you will inflate the size of the file and end up with a delimited but fixed length export file.
The other option is if you are in a UNIX/Linux environment you can post process the file and strip the leading two bytes or write an ASXMOD in C to do it as the records are streamed to the file.