How do I write output to a text file without overriding the existing contents of the file? - file-io

I have developed this function which transfers data from erlang to a .txt file:
exporttxt()->
F1 ="1",
F2 = "afif",
F3 = "kaled",
file:write_file("test.txt",[io_lib:format("~p\t~p\t~p~n",[F1,F2,F3])] ).
After running this function test.txt contains these values:
"1" "afif" "kaled"
but when I change F1, F2 and F3 in the function exporttxt() to:
F1 ="2"
F2 ="ahmed"
F3 = "alagi"
then test.txtcontains just these values:
"2" "ahmed" "alagi"
and I want test.txt to contain:
"1" "afif" "kaled"
"2" "ahmed" "alagi"
The problem is that at each execution of the function it records the new data
and the old data in test.txt is deleted.
How can I write new data to test.txt without overwriting existing data?

Use file:write_file/3 for this point:
Third arguments is Modes. A list of possible modes is read | write | append | exclusive | raw | binary | {delayed_write, Size, Delay} | delayed_write | {read_ahead, Size} | read_ahead | compressed | {encoding, Encoding}. append mode is for your needs.

Related

Returning a value based on multiple conditions in excel

Consider the following data:
Item | Overall | Individual | newColumn
A | Fail | Pass | blank
A | Fail | Fail | blank
B | Fail | Pass | issue
B | Fail | Pass | issue
C | Pass | Pass | blank
I have the logic built out for the first 3 columns already. There are two levels of fails in this data:
overall, and
individual.
If any of the individual fail, the overall fails. Sometimes the overall can fail even though all the individuals are fine. This logic is already built out.
I am trying to find a formula for the newColumn. If all the individuals are a pass for a given item (example item B), but the overall is still a fail, the cell should return the text "issue". It is ok if it returns issue twice, not sure if you can non-dupe that part. I've tried various forms of countifs/and/ors and creating columns that count distinct values but I always find a scenario where it will break the logic.
Try this:
=IF(COUNTIFS($A$2:$A$6,A2,$C$2:$C$6,"Fail"),"blank",IF(B2="Fail","Issue","blank"))
As required
If you add a new column with the formula:
=IF(B2="Fail",IF(COUNTIFS(A:A,A2,C:C,"fail")=0,"issue",""),"")
Then this should work on the assumptions:
For each item if one of the overalls are false they are all false
The only two possible values are "Pass" and "Fail" for columns B & C
If you require the word blank instead of a blank cell then use:
=IF(B2="Fail",IF(COUNTIFS(A:A,A2,C:C,"fail")=0,"issue","blank"),"blank")

Loading files with dynamically generated columns

I need to create a SSIS project that loads daily batches of 150 files into a SQL Server database. Each batch always contains the same 150 files and each file in the batch has a unique name. Also each file can either be a full or incremental type. Incremental files have one more column than the full files. Each batch contains a control file that states if a file is full or incremental. See example of a file below:
Full File
| SID | Name | DateOfBirth |
|:---: | :----: | :-----------: |
| 1 | Samuel | 20/05/1964 |
| 2 | Dave | 06/03/1986 |
| 3 | John | 15/09/2001 |
Incremental File
| SID | Name | DateOfBirth | DeleteRow |
|:---: | :----: | :-----------: | :----------: |
| 2 | | | 1 |
| 4 | Abil | 19/11/1993 | 0 |
| 5 | Zainab | 26/02/2006 | 0 |
I want to avoid creating 2 packages (full and incremental) for each file.
Is there a way to dynamically generate the column list in each source/destination component based on the file type in the control file? For example, when the file type is incremental, the column list should include the extra column (DeleteRow).
Let's assume my ControlFile.xlsx is :
Col1 Col2
File1.xlsx Full
file2.xlsx Incremental
Flow:
1.Create a DFT where ControlFile.xlsx is captured in an object variable. Source : Control connection, Destination : RecordSet Destination
Pass this object variable in ForEach loop. ResultSet variable should be capturing Col2 of ControlFile.xlsx.
Create a Sequence container just for a start point. Add 2 DFD for full load and incremental load. Use the constraints (as shown in below
image) to decide which DFD will run.
Inside DFD, use excel source to OLEDB destination.
Use FilePath variable for connection property in Full load and incremental excel connections to make it dynamic.
Step1: overall image
Step2:
In DFT - read control file, you read the FlowControl.xlsx to save it RecordSet destination, into RecordOutput variable
Step3:
Your precedence constraints should look like below image("Full" for full load, "Incremental" for incremental load ) :
Use the source and destination connections as shown in first image. It's a bit hard to explain all the steps, but flow is simple.
one thing to notice is, you have additional column in Incremental, hence you'll need to use 'Derived Column' in your full load for correct mapping.
Also, make sure DelayValidation property is set to true.
For each loop container uses For each ADO Enumerator. Following images describe the properties :
AND
I can think of two solutions.
1) Have a script task at the beginning of the package that looks to see if this is an incremental load or a full load. If it is a full load, have it loop through all the files and add a "DeleteRow" column with all zeros to every file. Then you can use the same column list.
2) Use BiML to dynamically generate your package at run time based on the available metadata.

Value to table header in Pentaho

Hi I'm quite new in Pentaho Spoon and I have a problem:
I have a table like this:
model | type | color| q
--1---| --1-- | blue | 1
--1---| --2-- | blue | 2
--1---| --1-- | red | 1
--1---| --2-- | red | 3
--2---| --1-- | blue | 4
--2---| --2-- | blue | 5
And I would like to create a single table (to export in csv or excel) for each model grouped by type with the value of the group as header and as value the q value:
table-1.csv
type | blue | red
--1--| -1-- | -1-
--2--| -2-- | -3-
table-2.csv
type | blue
--1--| -4-
--2--| -5-
I tried with row denormalizer but nothing.
Any suggestion?
Typically it's helpful to see what you have done in order to offer help, but I know how counterintuitive the "help" on this step is.
Make sure you sort the rows on Model and Type before sending them to the denormalizer step. Then give this a try:
As for splitting the output into files, there are a few ways to handle that. Take a look at the Switch/Case step using the Model field.
Also, if you haven't found them already, take a look at the sample files that come with the PDI download. They should be in ...pdi-ce-6.1.0.1-196\data-integration\samples. They can be more helpful than the online documentation sometimes.
Row denormalizer can't be used here if number of colors is unknown, also, you can't define text output fields dynamically.
There are few ways that I can see without using java and js steps. One of them is based on the following idea: we can prepare rows with two columns:
Row Model
type|blue|red 1
1|1|1 1
2|2|3 1
type|blue 2
1|4 2
2|5 2
Then we can prepare filename for each row using Model field and then easily output all rows using text output where file name is taken from filename field. In this case all records will be exported into two files without additional efforts.
Here you can find sample transformation: copy-paste me into new transformation
Please note that it's a sample solution that works only with csv. Also it works only if you have the same number of colors for each type inside model. It's just a hint how to use spoon, it's not a complete solution.

Generate automatically all the variables and value labels in SPSS

I have the variable labels and value labels in a table in my database, like this
id_variable_label | variable_label | id_value_label | value_label | id_father_label
---------------------------------------------------------------------------------------------------------
1 | father_label | null | null | null
null | father_label | 1 | child01 | 1
null | father_label | 2 | child02 | 1
Is there a way to generate automatically all the variables and value labels when I import the data from my database through a ODBC connection?
There isn't a direct way to do this, but if you read that table as an SPSS dataset, it would be pretty simple to generate the labels with a little Python code.
Note also that if your labeling is static, you can use APPLY DICTIONARY to copy labels from one dataset to another, so saving one fully labeled file would allow you to propagate that to others that are similarly structured.
You can use SPSS syntax to create variable and value labels.
See the SPSS commands VARIABLE LABELS and VALUE LABELS.
Here's a tutorial here that explains how you can use them.
You could generate the syntax from your database.

SQL query engine for text files on Linux?

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.
Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?
David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.
An example on squeal's home page:
$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source |
--------+--------------------+
1633 |kernel |
1324 |NetworkManager |
98 |ntpd |
70 |avahi-daemon |
63 |dhclient |
48 |setroubleshoot |
39 |dnsmasq |
29 |nm-system-settings |
27 |bluetoothd |
14 |/usr/sbin/gpm |
13 |acpid |
10 |init |
9 |pcscd |
9 |pulseaudio |
6 |gnome-keyring-ask |
6 |gnome-keyring-daemon|
6 |gnome-session |
6 |rsyslogd |
5 |rpc.statd |
4 |vpnc |
3 |gdm-session-worker |
2 |auditd |
2 |console-kit-daemon |
2 |libvirtd |
2 |rpcbind |
1 |nm-dispatcher.action|
1 |restorecond |
q - Run SQL directly on CSV or TSV files:
https://github.com/harelba/q
Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.
I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.
Sample invocation:
$ sql.py --file foo --sql "select count(*) from data"
The code:
#!/usr/bin/env python
"""Run a SQL statement on a text file"""
import os
import sys
import getopt
import tempfile
import re
class Usage(Exception):
def __init__(self, msg):
self.msg = msg
def runCmd(cmd):
if os.system(cmd):
print "Error running " + cmd
sys.exit(1)
# TODO(dan): Return actual exit code
def usage():
print >>sys.stderr, "Usage: sql.py --file file --sql sql"
def main(argv=None):
if argv is None:
argv = sys.argv
try:
try:
opts, args = getopt.getopt(argv[1:], "h",
["help", "file=", "sql="])
except getopt.error, msg:
raise Usage(msg)
except Usage, err:
print >>sys.stderr, err.msg
print >>sys.stderr, "for help use --help"
return 2
filename = None
sql = None
for o, a in opts:
if o in ("-h", "--help"):
usage()
return 0
elif o in ("--file"):
filename = a
elif o in ("--sql"):
sql = a
else:
print "Found unexpected option " + o
if not filename:
print >>sys.stderr, "Must give --file"
sys.exit(1)
if not sql:
print >>sys.stderr, "Must give --sql"
sys.exit(1)
# Get the first line of the file to make a CREATE statement
#
# Copy the rest of the lines into a new file (datafile) so that
# sqlite3 can import data without header. If sqlite3 could skip
# the first line with .import, this copy would be unnecessary.
foo = open(filename)
datafile = tempfile.NamedTemporaryFile()
first = True
for line in foo.readlines():
if first:
headers = line.rstrip().split()
first = False
else:
print >>datafile, line,
datafile.flush()
#print datafile.name
#runCmd("cat %s" % datafile.name)
# Create columns with NUMERIC affinity so that if they are numbers,
# SQL queries will treat them as such.
create_statement = "CREATE TABLE data (" + ",".join(
map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"
cmdfile = tempfile.NamedTemporaryFile()
#print cmdfile.name
print >>cmdfile,create_statement
print >>cmdfile,".separator ' '"
print >>cmdfile,".import '" + datafile.name + "' data"
print >>cmdfile, sql + ";"
cmdfile.flush()
#runCmd("cat %s" % cmdfile.name)
runCmd("cat %s | sqlite3" % cmdfile.name)
if __name__ == "__main__":
sys.exit(main())
Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?
Depending on the amount of data, performance could be acceptable.
MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.
Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.
You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.
That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?
I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!