how to split large text files into smaller text files using vba? - sql

I have a database textfile.
It is large text file about 387,480 KB. This file contains table name, headers of the table and values. I need to split this file into multiple files each containing table creation and insertion with a file name as table name.
Please can anyone help me??

I don't see how Excel will open a 347MB file. You can try to load it into Access, and do the split, using VBA. However, the process of importing a file that large may fragment enough to blow Access up to #GB, and then it's all over. SQL Server would handle this kind of job. Alternatively, you could use Python or R to do the work for you.
### Python:
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path/main.csv', chunksize=3)):
chunk.to_csv('chunk{}.csv'.format(i))
### R
setwd("C:/your_path/")
mydata = read.csv("annualsinglefile.csv")
# If you want 5 different chunks with same number of lines, lets say 30.
# Chunks = split(mydata,sample(rep(1:5,30))) ## 5 Chunks of 30 lines each
# If you want 100000 samples, put any range of 20 values within the range of number of rows
First_chunk <- sample(mydata[1:100000,]) ## this would contain first 100000 rows
# Or you can print any number of rows within the range
# Second_chunk <- sample(mydata[100:70,] ## this would contain last 30 rows in reverse order if your data had 100 rows.
# If you want to write these chunks out in a csv file:
write.csv(First_chunk,file="First_chunk.csv",quote=F,row.names=F,col.names=T)
# write.csv(Second_chunk,file="Second_chunk.csv",quote=F,row.names=F,col.names=T)

Related

Create a concatenated csv file from looping through columns of data

I'm trying to iterate through a csv starting a column 17 and then have the results stack on top of each other similar to what 'pd.concat([df1, df2], axis=0)' does. My code is iterating through all the columns and I know why but not sure how to fix then when I export to csv spreadsheet is jumbled or doesnt display correctly. Appreciate any help.
em_list=[]
for k in range(len(ad)):
n=ad.iloc[:,k]
counts=n.value_counts() # *
perc3=n.value_counts(normalize=True) # *
.mul(100).round(1).astype(str)+'%' # *
df4=pd.DataFrame(perc_str3)
em_list.append(df4)
k=k+1
df=pd.concat(em_list)
df1=pd.DataFrame.from_records(df)
df1.to_csv()

Read text file using fortran, where the starting 18 lines are strings and the rest of the lines are real numbers in 1000*8 array?

I have a text file where the first few lines are text and the rest of the lines contain data in the form of real numbers. I just require the real numbers array to be stored in a new file. For this, I read the total lines of the file for which output is correct and then trying to read the real numbers from the particular line numbers. I am unable to understand as to how to read this data.
Below is a part of file. Also I have many files like these to read.
AptitudeQT paperI: 12233105
Latitude : 30.00 S
Longitude: 46.45 E
Attemptone Time: 2017-03-30-09-03
End Time: 2017-03-30-14-55
Height(agl): m
Pressure: hPa
Temperature: deg C
Humidity: %RH
Uvelocity: cm/s
Vvelocity: cm/s
WindSpeed: cm/s
WindDirection: deg
---------------------------------------
10 1008.383 27.655 62.200 -718.801 -45.665 720.250 266.500
20 1007.175 27.407 62.950 -792.284 -18.481 792.500 268.800
There are many examples how to skip/read lines like this
But to sum it up, option A is to skip header and read only data:
! Skip first 17 lines
do i = 1, 17
read (unit,*,IOSTAT=stat) ! Dummy read
if ( stat /= 0 ) stop "error"
end do
! Read data
do i = 1, 1000
read (unit,*,IOSTAT=stat) data(:,i)
if ( stat /= 0 ) stop "error"
end do
If you have many files like this, I suggest wrapping this in a subroutine/function.
Option B is to use unix tail utility to discard header (more info here):
tail -n +18 file.txt

Python - Compare two csv file - based on Column

I am trying to compare two CSV files, most of the time it will have same data but order of data will not be the same. Eg
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
CSV File2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
so I want to use third column as Primary key to compare other values. Report the difference. Is this possible to do it in Robotframework or Panda?
If you are making use of robotframework you need to do the following,
install robotframework-csvlib
Use Built-in Collections
Input from your question
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
csv file2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
My Solution
In the below approach, we are first reading csv into list of lists for both csv files and then comparing all the list of list items by making use of Collections KW List Should Contain Sub List, here, notice that we are passing an argument "values=True" which compares the value as well.
Code that compares 2 csv files
*** Settings ***
Library CSVLib
Library Collections
*** Test Cases ***
Test CSV
${list1}= read csv as list csv1.csv
log to console ${list1}
${list2}= read csv as list csv2.csv
log to console ${list2}
List Should Contain Sub List ${list1} ${list2} values=True
OUTPUT
(rf1) C:\Users\kgurupra>robot s1.robot
==============================================================================
S1
==============================================================================
Test CSV .[['C1,C2,C3'], ['AAA,111,A1A1'], ['BBB,222,B2B2'], ['CCC,333,C3C3']]
..[['C1,C2,C3'], ['CCC,333,C3C3'], ['BBB,212,B2B2'], ['AAA,111,A1A1']]
Test CSV | FAIL |
Following values were not found from first list: ['BBB,212,B2B2']
------------------------------------------------------------------------------
S1 | FAIL |
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================
Output: C:\Users\kgurupra\output.xml
Log: C:\Users\kgurupra\log.html
Report: C:\Users\kgurupra\report.html
Assuming you've imported your CSV files as pandas DataFrames you can do the following to merge the two while retaining fundamental differences:
df = csv1.merge(csv2, on='<insert name primary key column here>',how='outer')
Adding the suffixes option allows you to more clearly differentiate between identically named columns from each file:
df = csv1.merge(csv2, on='<insert name>',how='outer',suffixes=['_csv1','_csv2'])
After that it depends on what kind of differences you are looking to spot but perhaps a starting point is:
df['difference_1'] = df['column1_csv1'] == df['column1_csv2']
this will create a boolean column which indicates True if observations are the same and False otherwise.
But there are nearly endless options for comparison.

Create a 350000 column csv file by merging smaller csv files

I have about 350000 one-column csv files, which are essentially 200 - 2000 numbers printed one under another. The numbers are formatted like this: "-1.32%" (no quotes). I want to merge the files to create a monster of a csv file where each file is a separate column. The merged file will have 2000 rows maximum (each column may have a different length) and 350000 columns.
I thought of doing it with MySQL but there is a 30000 column limit. An awk or sed script could do the job but I don't know them all that well and I am afraid it will take a very long time. I could use a server if the solution requires to. Any suggestions?
This python script will do what you want:
#!/usr/bin/env python2
import os
import sys
import codecs
fhs = []
count = 0
for filename in sys.argv[1:]:
fhs.append(codecs.open(filename,'r','utf-8'))
count += 1
while count > 0:
delim = ''
for fh in fhs:
line = fh.readline()
if not line:
count -= 1
line = ''
sys.stdout.write(delim)
delim = ','
sys.stdout.write(line.rstrip())
sys.stdout.write('\n')
for fh in fhs:
fh.close()
Call it with all the CSV files you want to merge and it will print a new file to stdout.
Note that you can't merge all files at once; for one, you can't pass 350,000 file names as arguments to a process and secondly, a process can only open 1024 files at once.
So you'll have to do it in several passes. I.e. merge files 1-1000, then 1001-2000, etc. Then you should be able to merge the 350 resulting intermediate files at once.
Or you could write a wrapper script which uses os.listdir() to get the names or all files and calls this script several times.

3D Graph in Octave/Matlab from a CSV file

I'm new to Octave/Matlab and I want to plot a 3D-Graph.
I was able to do so using a predefined formula, like this:
x=1:.1:5;
y=1:.1:5;
[xx,yy] = meshgrid(x,y);
z = sin(xx)+sin(yy);
mesh(x,y,z);
But now the question is how to do the same getting the data from a CSV (for example). I know I can use the function csvread, but the big question is how to format the CSV to contain such data.
An example of doing the same graph above but this time grabbing the data from Excel/CSV would be appreciated. Thanks!
Done! I was finally able to do it!
Here's how I did it:
1) I've created a file in Excel with the X values in the cells A2:A42, and the Y values in the cells B1:AP1 (so you form a rectangle).
2) Then in the cells in the middle I put the formula I want (ie =sin(A$2)+sin($B1))
3) Saved the file as CSV (but separated by spaces!) and manually edited it to look this way (the way QtOctave opens matrix files, in Matlab it might be different). For example (note the extra space before each column):
# Created by Octave 3.2.4, Thu Jan 12 19:32:05 2012 ART <diego#notebook2>
# name: z
# type: matrix
# rows: 3
# columns: 3
1 2 3
4 5 6
7 8 9
(if you're not sure how to do it, do what I did: create a simple matrix and export it to see how the exported file looks like!)
4) Octave has a function under Data -> Load matrix from file, which loads that kind of files. Or actually running this command (varname is the name of the resulting variable):
load("-text", "file-where-the-data-is", "varname")
5) Create the graph (ex is the name of the matrix I've just imported):
x=1:.1:5;
y=1:.1:5;
mesh(x,y,ex)