Python - Compare two csv file - based on Column - pandas

I am trying to compare two CSV files, most of the time it will have same data but order of data will not be the same. Eg
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
CSV File2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
so I want to use third column as Primary key to compare other values. Report the difference. Is this possible to do it in Robotframework or Panda?

If you are making use of robotframework you need to do the following,
install robotframework-csvlib
Use Built-in Collections
Input from your question
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
csv file2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
My Solution
In the below approach, we are first reading csv into list of lists for both csv files and then comparing all the list of list items by making use of Collections KW List Should Contain Sub List, here, notice that we are passing an argument "values=True" which compares the value as well.
Code that compares 2 csv files
*** Settings ***
Library CSVLib
Library Collections
*** Test Cases ***
Test CSV
${list1}= read csv as list csv1.csv
log to console ${list1}
${list2}= read csv as list csv2.csv
log to console ${list2}
List Should Contain Sub List ${list1} ${list2} values=True
OUTPUT
(rf1) C:\Users\kgurupra>robot s1.robot
==============================================================================
S1
==============================================================================
Test CSV .[['C1,C2,C3'], ['AAA,111,A1A1'], ['BBB,222,B2B2'], ['CCC,333,C3C3']]
..[['C1,C2,C3'], ['CCC,333,C3C3'], ['BBB,212,B2B2'], ['AAA,111,A1A1']]
Test CSV | FAIL |
Following values were not found from first list: ['BBB,212,B2B2']
------------------------------------------------------------------------------
S1 | FAIL |
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================
Output: C:\Users\kgurupra\output.xml
Log: C:\Users\kgurupra\log.html
Report: C:\Users\kgurupra\report.html

Assuming you've imported your CSV files as pandas DataFrames you can do the following to merge the two while retaining fundamental differences:
df = csv1.merge(csv2, on='<insert name primary key column here>',how='outer')
Adding the suffixes option allows you to more clearly differentiate between identically named columns from each file:
df = csv1.merge(csv2, on='<insert name>',how='outer',suffixes=['_csv1','_csv2'])
After that it depends on what kind of differences you are looking to spot but perhaps a starting point is:
df['difference_1'] = df['column1_csv1'] == df['column1_csv2']
this will create a boolean column which indicates True if observations are the same and False otherwise.
But there are nearly endless options for comparison.

Related

UniVerse - SQL LIST: View List of All Database Tables

I am trying to obtain a list of all the DB Tables that will give me visibility on what tables I may need to JOIN for running SQL scripts.
For example, in TCL when I run "LIST.DICT" it returns "Name of File:" for input. I then enter "PRODUCT" and it returns a list of all available fields.
However, Where can I get a list of all my available Tables or list of my options that I can enter after "Name of File:"?
Here is what I am trying to achieve. In the screen shot below, I would like to run a SQL script that gives me the latest Log File Activity, Date - Time - Description. I would like the script to return '8/13/14 08:40am BR: 3;BuyPkg'
Thank you in advance for your help.
From TCL within the database account containing your database files, type: LISTF
Sample output:
FILES in your vocabulary 03:21:38pm 29 Jun 2015 Page 1
Filename........................... Pathname...................... Type Modulo
File - Contains all logical device names
DICT &DEVICE& /u1/uv/D_&DEVICE& 2 1
DATA &DEVICE& /u1/uv/&DEVICE& 2 3
File - Used by MAKE.MAP.FILE
DICT &MAP& /u1/uv/D_&MAP& 2 1
DATA &MAP& /u1/uv/&MAP& 6 7
File - Contains all parts of Distributed Files
DICT &PARTFILES& /u1/uv/D_&PARTFILES& 2 1
DATA &PARTFILES& /u1/uv/&PARTFILES& 18 7
DICT &PH& D_&PH& 3 1
DATA &PH& &PH& 1
DICT &SAVEDLISTS& D_&SAVEDLISTS& 3 1
DATA &SAVEDLISTS& &SAVEDLISTS& 1
File - Used by uniVerse to access the current directory.
DICT &UFD& /u1/uv/D_UFD 2 1
DATA &UFD& . 19 1
DICT &XML& D_&XML& 18 3
DATA &XML& &XML& 19 1
Firstly, UniVerse has no Log File Activity Date and Time.
However, you can still obtain the table's modified/ accessed date from the file system however.
To do this,
You need to have a subroutine accepting a path of the table to return a date or a time.
e.g. SUBROUTINE GET.FILE.MOD.DATE(DAT.MOD, S.FILE.PATH)
Inside the subroutine, you can use EXECUTE to run shell command like istat for getting these info on a unix e.g.
Please beware that for a dynamic file e.g. there are Data and Overflow parts under a directory. You should compare the dates obtained and return only the latest one.
Globally catalog the subroutine
Create an I-Desc in VOC, e.g. I.FILE.MOD.DATE in the field definition of this I-Desc: SUBR("*GET.FILE.MOD.DATE",F2) and Conv Code as "D/MDY2"
Create another I-Desc e.g. I.FILE.MOD.TIME
Finally, you can
LIST VOC I.FILE.MOD.DATE I.FILE.MOD.TIME DESC WITH TYPE LIKE "F..."
alternatively in SQL,
SELECT I.FILE.MOD.DATE, I.FILE.MOD.TIME, VOC.DESC FROM VOC WHERE TYPE LIKE "F%";

map columns in Pentaho spoon

I have input.csv file with 3 columns viz, name,age,address. And, I have an output.csv file with 5 columns viz, Person name,Person age,Person address,Person salary,Person pass criteria.
I need to map my input.csv to output.csv. Please help me out with this. I tried Select values step, but it does not work.
You can do this in 4 steps.
1) Using CSV file input step you can get the name,age,address fields
2) Using Select values step you can rename name,age,address fields to Person name,Person age,Person address.
3) Using Add constants step you can add the additional fields, Person salary,Person pass criteria.
4) Using Text file output step you can output to a csv file. Here as the extension, type csv and separator as ,.

Split JSON file using apache PIG

I have a JSON input file that needs to be split into multiple files based on a keyword and the output should also retain the same JSON format.
Example:
The keyword here is the value of the object EVT.NAME. Depeneding on the value it should route it to the output.
Input has three different values (KEYPRESS,TUNE,TRICK), so 3 different output files should be created.
Input:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
Output1:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 2:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 3:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
You can use JsonLoader and JsonStorage. See this article - http://joshualande.com/read-write-json-apache-pig.
table = LOAD 'file.json'
USING JsonLoader('KEYPRESS:chararray, TUNE:chararray, TRICK:chararray');

How can I delete a specific line (e.g. line 102,206,973) from a 30gb csv file?

What method can I use to delete a specific line from a csv/txt file that is too big too load into memory and edit manually?
Background
My question is actually an indirect solution to a problem related with importing csv into sql databases.
I have a series of 10-30gb csv files I want to import and populate an sqlite table from within R (Since they are too large to import as data frames as a whole into R). I am using the 'RSQlite' package for this.
A couple fail because of an error related to one of the lines being badly formatted. The populating process is then cancelled. R returns the line number which caused the process to fail.
The error given is:
./csvfilename line 102206973 expected 9 columns of data but found 3)
So I know exactly the line which causes the error.
I see 2 potential 'indirect' solutions which I was hoping someone could help me with.
(i) Deleting the line causing the error in 20+gb files. e.g. line 102,206,973 in the example above.
I am not concerned with 'losing' the data in line 102,206,973 by just skipping or deleting it. However I have tried and failed to somehow access the csv file and to remove the line.
(ii) Using sqlite directly (or anything else?) to import an csv which does allow you to skip lines or an error.
Although not likely to be related directly to the solution, here is the R code used.
db <- dbConnect(SQLite(), dbname=name_of_table)
dbWriteTable(conn = db, name ="currentdata", value = csvfilename, row.names = FALSE, header = TRUE)
Thanks!
To delete a specific line you can use sed:
sed -e '102206973d' your_file
If you want the replacement to be done in-place, do
sed -i.bak -e '102206973d' your_file
This will create a backup names your_file.bak and your_file will have the specified line removed.
Example
$ cat a
1
2
3
4
5
$ sed -i.bak -e '3d' a
$ cat a
1
2
4
5
$ cat a.bak
1
2
3
4
5

Create a 350000 column csv file by merging smaller csv files

I have about 350000 one-column csv files, which are essentially 200 - 2000 numbers printed one under another. The numbers are formatted like this: "-1.32%" (no quotes). I want to merge the files to create a monster of a csv file where each file is a separate column. The merged file will have 2000 rows maximum (each column may have a different length) and 350000 columns.
I thought of doing it with MySQL but there is a 30000 column limit. An awk or sed script could do the job but I don't know them all that well and I am afraid it will take a very long time. I could use a server if the solution requires to. Any suggestions?
This python script will do what you want:
#!/usr/bin/env python2
import os
import sys
import codecs
fhs = []
count = 0
for filename in sys.argv[1:]:
fhs.append(codecs.open(filename,'r','utf-8'))
count += 1
while count > 0:
delim = ''
for fh in fhs:
line = fh.readline()
if not line:
count -= 1
line = ''
sys.stdout.write(delim)
delim = ','
sys.stdout.write(line.rstrip())
sys.stdout.write('\n')
for fh in fhs:
fh.close()
Call it with all the CSV files you want to merge and it will print a new file to stdout.
Note that you can't merge all files at once; for one, you can't pass 350,000 file names as arguments to a process and secondly, a process can only open 1024 files at once.
So you'll have to do it in several passes. I.e. merge files 1-1000, then 1001-2000, etc. Then you should be able to merge the 350 resulting intermediate files at once.
Or you could write a wrapper script which uses os.listdir() to get the names or all files and calls this script several times.