How to merge multiple text files into one new csv file with column header created using Perl - perl-data-structures

I am new to Perl and need some guidance. I have multiple text files and want to merge all them into a new csv file. Then, from the csv file, I want to split the string into multiple column as shown in the "Output" format below. Can someone pls help me?
Text File#1.txt
Name:A
Test1:80
Test2:60
Test3:50
Text File#2.txt
Name:B
Test1:85
Test2:78
Test3:60
Output (format #1):
New Text File#3.csv
Name Test1 Test2 Test3
A 80 60 50
B 85 78 60
Output (format #2):
New Text File#3.csv
Name Test Data
A 1 80
A 2 60
A 3 50
B 1 85
B 2 78

reading files:
Open FILE, "filename.txt" or die $!;
#create hash
%hash = ();
#read file - you have to do this for all files
while (<FILE>) {
if(first_row){
$name = (split ':')[1];
}else{
$points = split ':'[1];
push( #{$hash{$name}}, $points );
}
}
at this point you have a hash:
A -> [80,60,50]
B -> [85,78,60]
you can now use this hash to print your csv file:
for file 1:
open CSV1, ">", "1.csv" or die $!
foreach my $name( keys %hash )
{
$points = $hash{$name};
print CSV1 $name + ";" + join(";", #{$points})
print "/n";
}
for file 2:
open CSV2, ">", "2.csv" or die $!
foreach my $name( keys %hash )
{
$points = $hash{$name};
$testnumber = 0;
foreach $point ( in #$points){
print CSV2 $name + ";" $testnumber+1 + ";" + $point + "\n";
}
}
hope this helps you out, if anything is not clear, you can ask.
do not copy paste, it can contain minor errors, but i assume the way of thinking is clear
Update: the ";" is because csv split columns on that character
please feedback

Related

Using PowerShell, how can a SQL file be split into several files based on its content?

I'm trying to use PowerShell to automate the division of a SQL file into seperate files based on where the headings are located.
An example SQL file to be split is below:
/****************************************
Section 1
****************************************/
Select 1
/****************************************
Section 2
****************************************/
Select 2
/****************************************
Section 3
****************************************/
Select 3
I want the new files to be named as per the section headings in the file i.e. 'Section 1', 'Section 2' and 'Section 3'. The content of the first file should be as follows:
/****************************************
Section 1
****************************************/
Select 1
The string: /**************************************** is only used in the SQL file for the section headings and therefore can be used to identify the start of a section. The file name will always be the text on the line directly below.
You can try like this (split is here based on empty lines between sections) :
#create an index for our output files
$fileIndex = 1
#load SQLite file contents in an array
$sqlite = Get-Content "G:\input\sqlite.txt"
#for each line of the SQLite file
$sqlite | % {
if($_ -eq "") {
#if the line is empty, increment output file index to create a new file
$fileindex++
} else {
#if the line is not empty
#build output path
$outFile = "G:\output\section$fileindex.txt"
#push line to the current output file (appending to existing contents)
$_ | Out-File $outFile -Append
}
}
#load generated files in an array
$tempfiles = Get-ChildItem "G:\output"
#for each file
$tempfiles | % {
#load file contents in an array
$data = Get-Content $_.FullName
#rename file after second line contents
Rename-Item $_.FullName "$($data[1]).txt"
}
The below code uses the heading names found within the comment blocks. It also splits the SQL file into several SQL files based on the location of the comment blocks.
#load SQL file contents in an array
$SQL = Get-Content "U:\Test\FileToSplit.sql"
$OutputPath = "U:\TestOutput"
#find first section name and count number of sections
$sectioncounter = 0
$checkcounter = 0
$filenames = #()
$SQL | % {
#Add file name to array if new section was found on the previous line
If ($checkcounter -lt $sectioncounter)
{
$filenames += $_
$checkcounter = $sectioncounter
}
Else
{
If ($_.StartsWith("/*"))
{
$sectioncounter += 1
}
}
}
#return if too many sections were found
If ($sectioncounter > 50) { return "Too many sections found"}
$sectioncounter = 0
$endcommentcounter = 0
#for each line of the SQL file (Ref: sodawillow)
$SQL | % {
#if new comment block is found point to next section name, unless its the start of the first section
If ($_.StartsWith("/*") -And ($endcommentcounter -gt 0))
{
$sectioncounter += 1
}
If ($_.EndsWith("*/"))
{
$endcommentcounter += 1
}
#build output path
$tempfilename = $filenames[$sectioncounter]
$outFile = "$OutputPath\$tempfilename.sql"
#push line to the current output file (appending to existing contents)
$_ | Out-File $outFile -Append
}

Split large CSV file based on column value cardinality

I have a large CSV file with following line format:
c1,c2
I would like to split the original file in two files as follows:
One file will contain the lines where the value of c1 appears exactly once in the file.
Another file will contain the lines where the value of c1 appears twice or more in the file.
Any idea how it can be done?
For example, if the original file is:
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
I would like to produce the following files:
3,foo
4,bar
and
1,foo
2,bar
2,foo
1,bar
this one-liner generates two files o1.csv and o2.csv
awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' file file
test:
kent$ cat f
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
kent$ awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' f f
kent$ head o*
==> o1.csv <==
3,foo
4,bar
==> o2.csv <==
1,foo
2,bar
2,foo
1,bar
Note
awk reads your file twice, instead of saving whole file in memory
the order of the file is retained
Depending what you mean by big, this might work for you.
It has to hold the lines, in associative array done, until it sees
2nd use, or until the end of file. When a 2nd use is seen the remembered
data is changed to "!" to avoid printing it again on a 3rd and later match.
>file2
awk -F, '
{ if(done[$1]!=""){
if(done[$1]!="!"){
print done[$1]
done[$1] = "!"
}
print
}else{
done[$1] = $0
order[++n] = $1
}
}
END{
for(i=1;i<=n;i++){
out = done[order[i]]
if(out!="!")print out >>"file2"
}
}
' <csvfile >file1
I'd break out Perl for this job
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
my #lines;
open ( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file
while ( <$input> ) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
push ( #lines, [ $c1 , $c2 ] );
}
close ( $input );
print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } #lines ) {
print join (",", #$pair );
}
print "File 2:\n";
#filter any repeats.
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } #lines ) {
print join (",", #$pair );
}
This will hold the whole file in memory, but given your data - you don't save much space by double processing it and maintaining a count.
However you could do:
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
open( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file counting "c1"
while (<$input>) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
}
open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe, '>', "output_dupes.csv" ) or die $!;
seek( $input, 0, 0 );
while ( my $line = <$input> ) {
my ($c1) = split( ",", $line );
if ( $count_of{$c1} > 1 ) {
print {$output_dupe} $line;
}
else {
print {$output_single} $line;
}
}
close($input);
close($output_single);
close($output_dupe);
This will minimise memory occupancy, by only retaining the count - it reads the file first to count the c1 values, and then processes it a second time and prints lines to different outputs.

eliminating all values that occur in all files in folder with awk

I have a folder with several files of which I want to eliminate all of the terms that they have in common using awk.
Here is the script that I have been using:
awk '
FNR==1 {
if (seen[FILENAME]++) {
firstPass = 0
outfile = FILENAME "_new"
}
else {
firstPass = 1
numFiles++
ARGV[ARGC++] = FILENAME
}
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' *
An example of the information in the files would be:
File1
3 coffee
4 and
8 milk
File2
4 dog
2 and
9 cat
The output should be:
File1_new
3 coffee
8 milk
File2_new
4 dog
9 cat
It works when I use a small number of files (i.e. 10), but when I start to increase that number, I get the following error message:
awk: file20_new makes too many open files input record number 27, file file20_new source line number 14
Where is the error coming from when I use larger amounts of files?
My main goal is to run this script over all of the files in a folder to generate new files that do not contain any words that occur in all of the files in the folder.
When you use >, a file is opened for writing (and truncated). As suggested in the comments, you need to close your files as you go along. Try something like this:
awk '
FNR==1 {
if (seen[FILENAME]++) {
firstPass = 0
if (outfile) close(outfile) # <-- close the previous file
outfile = FILENAME "_new"
}
else {
firstPass = 1
numFiles++
ARGV[ARGC++] = FILENAME
}
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' *

Replace text from select statement from one file to another

I have a bunch of views for my database and need to update the select statements within each view.
I have all the select statements in files called viewname.txt in one dir and in a sub dir called sql; I have the all the views viewname.sql. I want to run a script to take the text from viewname.txt and replace the select statement in the correct viewname.sql in the sql sub dir.
I have tried this to append the text after the SELECT in each .sql file:
for i in */*; do
if ["../$(basename "${i}")" == "$(basename "${i}")"]
then
sed '/SELECT/a "$(basename "$i" .txt)"' "$(basename "$i" .sql)"
fi
done
Any assistance is greatly appreciated!
Dickie
This is an awk answer that's close - the output is placed in the sql directory under corresponding "viewname.sql.new" files.
#!/usr/bin/awk -f
# absorb the whole viewname.txt file into arr when the first line is read
FILENAME ~ /\.txt$/ && FILENAME != last_filename {
last_filename = FILENAME
# get the viewname part of the file name
split( FILENAME, file_arr, "." )
while( getline file_data <FILENAME > 0 ) {
old_data = arr[ file_arr[ 1 ] ]
arr[ file_arr[ 1 ] ] = \
old_data (old_data == "" ? "" : "\n") file_data
}
next
}
# process each line of the sql/viewname.sql files
FILENAME ~ /\.sql$/ {
# strip the "/sql" from the front of FILENAME for lookup in arr
split( substr( FILENAME, 5 ), file_arr, "." )
if( file_arr[ 1 ] in arr ) {
if( $0 ~ /SELECT/ )
print arr[ file_arr[ 1 ] ] > FILENAME ".new"
else
print $0 > FILENAME ".new"
}
}
I put this into a file called awko and chmod +x and ran it like the following
awko *.txt sql/*
You'll have to mv the new files into place, but it's as close as I can get right now.

calculating the mean of columns in text files

I have two folders named f1 and f2. These folders contain 300 text files with 2 columns.The content of files are shown below.I would like to calculate the mean of second column.file names are same in both folders.
file1 in f1 folder
54 6
55 10
57 5
file2 in f1 folder
24 8
28 12
file1 in f2 folder
34 3
22 8
file2 in f2 folder
24 8
28 13
output
folder1 folder2
file1 21/3= 7 11/2=5.5
file2 20/2=10 21/2=10.5
-- -- --
-- -- --
file300 -- --
total mean of folder1 = sum of the means/3oo
total mean of folder2 = sum of the means/3oo
I'd do it with two awk scripts. (Originally, I had a sort phase in the middle, but that isn't actually necessary. However, I think that two scripts is probably easier than trying to combine them into one. If someone else does it 'all in one' and it is comprehensible, then choose their solution instead.)
Sample run and output
This is based on the 4 files shown in the question. The names of the files are listed on the command line, but the order doesn't matter. The code assumes that there is only one slash in the file names, and no spaces and the like in the file names.
$ awk -f summary1.awk f?/* | awk -f summary2.awk
file1 21/3 = 7.000 11/2 = 5.500
file2 20/2 = 10.000 21/2 = 10.500
total mean of f1 = 17/2 = 8.500
total mean of f2 = 16/2 = 8.000
summary1.awk
function print_data(file, sum, count) {
sub("/", " ", file);
print file, sum, count;
}
oldfile != FILENAME { if (count > 0) { print_data(oldfile, sum, count); }
count = 0; sum = 0; oldfile = FILENAME
}
{ count++; sum += $2 }
END { print_data(oldfile, sum, count) }
This processes each file in turn, summing the values in column 2 and counting the number of lines. It prints out the folder name, the file name, the sum and the count.
summary2.awk
{
sum[$2,$1] = $3
cnt[$2,$1] = $4
if (file[$2]++ == 0) file_list[n1++] = $2
if (fold[$1]++ == 0) fold_list[n2++] = $1
}
END { for (i = 0; i < n1; i++)
{
printf("%-20s", file_list[i])
name = file_list[i]
for (j = 0; j < n2; j++)
{
folder = fold_list[j]
s = sum[name,folder]
n = cnt[name,folder]
a = (s + 0.0) / n
printf(" %6d/%-3d = %10.3f", s, n, a)
gsum[folder] += a
}
printf("\n")
}
for (i = 0; i < n2; i++)
{
folder = fold_list[i]
s = gsum[folder]
n = n1;
a = (s + 0.0) / n
printf("total mean of %-6s = %6d/%-3d = %10.3f\n", folder, s, n, a)
}
}
The file associative array tracks references to file names. The file_list array keeps the file names in the order that they're read. Similarly, the fold associative array tracks the folder names, and the fold_list array keeps track of the folder names in the order that they appear. If you do something weird enough with the order that you supply the names to the first command, you may need to insert a sort command between the two awk commands, such as sort -k2,2 -k1,1.
The sum associative array contains the sum for a given file name and folder name. The cnt associative array contains the count for a given file name and folder name.
The END section of the report has two main loops (though the first loop contains a nested loop). The first main loop processes the files in the order presented, generating one line containing one entry for each folder. It also accumulates the averages for the folder name.
The second main loop generates the 'total mean` data for each folder. I'm not sure whether the statistics makes sense (shouldn't the overall mean for folder1 be the sum of the values in folder1 divided by the number of entries, or 41/5 = 8.2 rather than 17/2 or 8.5?), but the calculation does what I think the question asks for (sum of means / number of files, written as 300 in the question).
With some help from grep:
grep '[0-9]' folder[12]/* | awk '
{
split($0,b,":");
f=b[1]; split(f,c,"/"); d=c[1]; f=c[2];
s[f][d]+=$2; n[f][d]++; nn[d]++;}
END{
for (f in s) {
printf("%-10s", f);
for (d in s[f]) {
a=s[f][d] / n[f][d];
printf(" %6.2f ", a);
p[d] += a;
}
printf("\n");
}
for (d in p) {
printf("total mean %-8s = %8.2f\n", d, p[d]/nn[d]);
}
}'