I am trying to compare two CSV files, most of the time it will have same data but order of data will not be the same. Eg
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
CSV File2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
so I want to use third column as Primary key to compare other values. Report the difference. Is this possible to do it in Robotframework or Panda?
If you are making use of robotframework you need to do the following,
install robotframework-csvlib
Use Built-in Collections
Input from your question
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
csv file2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
My Solution
In the below approach, we are first reading csv into list of lists for both csv files and then comparing all the list of list items by making use of Collections KW List Should Contain Sub List, here, notice that we are passing an argument "values=True" which compares the value as well.
Code that compares 2 csv files
*** Settings ***
Library CSVLib
Library Collections
*** Test Cases ***
Test CSV
${list1}= read csv as list csv1.csv
log to console ${list1}
${list2}= read csv as list csv2.csv
log to console ${list2}
List Should Contain Sub List ${list1} ${list2} values=True
OUTPUT
(rf1) C:\Users\kgurupra>robot s1.robot
==============================================================================
S1
==============================================================================
Test CSV .[['C1,C2,C3'], ['AAA,111,A1A1'], ['BBB,222,B2B2'], ['CCC,333,C3C3']]
..[['C1,C2,C3'], ['CCC,333,C3C3'], ['BBB,212,B2B2'], ['AAA,111,A1A1']]
Test CSV | FAIL |
Following values were not found from first list: ['BBB,212,B2B2']
------------------------------------------------------------------------------
S1 | FAIL |
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================
Output: C:\Users\kgurupra\output.xml
Log: C:\Users\kgurupra\log.html
Report: C:\Users\kgurupra\report.html
Assuming you've imported your CSV files as pandas DataFrames you can do the following to merge the two while retaining fundamental differences:
df = csv1.merge(csv2, on='<insert name primary key column here>',how='outer')
Adding the suffixes option allows you to more clearly differentiate between identically named columns from each file:
df = csv1.merge(csv2, on='<insert name>',how='outer',suffixes=['_csv1','_csv2'])
After that it depends on what kind of differences you are looking to spot but perhaps a starting point is:
df['difference_1'] = df['column1_csv1'] == df['column1_csv2']
this will create a boolean column which indicates True if observations are the same and False otherwise.
But there are nearly endless options for comparison.
I need to upload a CSV file to BigQuery via the UI, after I select the file from my local drive I specify BigQuery to automatically detect the Schema and run the job. It fails with the following message:
"Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2; errors: 1. Please look into the
errors[] collection for more details."
I have tried removing the comma in the last column, and tried changing options in the advanced section but it always results in the same error.
The error log is not helping me understand where the problem is, this is example of the error log entry:
2
019-04-03 23:03:50.261 CLST Bigquery jobcompleted
bquxjob_6b9eae1_169e6166db0 frank#xxxxxxxxx.nn INVALID_ARGUMENT
and:
"Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2; errors: 1. Please look into the
errors[] collection for more details."
and:
"Error while reading data, error message: Error detected while parsing
row starting at position: 46. Error: Data between close double quote
(") and field separator."
The strange thing is that the sample CSV data has NO double quote field separator!?
2019-01-02 00:00:00,326,1,,292,0,,294,0,,-28,0,,262,0,,109,0,,372,0,,453,0,,536,0,,136,0,,2609,0,,1450,0,,352,0,,-123,0,,17852,0,,8528,0
2019-01-02 00:02:29,289,1,,402,0,,165,0,,-218,0,,150,0,,90,0,,263,0,,327,0,,275,0,,67,0,,4863,0,,2808,0,,124,0,,454,0,,21880,0,,6410,0
2019-01-02 00:07:29,622,1,,135,0,,228,0,,-147,0,,130,0,,51,0,,381,0,,428,0,,276,0,,67,0,,2672,0,,1623,0,,346,0,,-140,0,,23962,0,,10759,0
2019-01-02 00:12:29,206,1,,118,0,,431,0,,106,0,,133,0,,50,0,,380,0,,426,0,,272,0,,63,0,,1224,0,,740,0,,371,0,,-127,0,,27758,0,,12187,0
2019-01-02 00:17:29,174,1,,119,0,,363,0,,59,0,,157,0,,67,0,,381,0,,426,0,,344,0,,161,0,,923,0,,595,0,,372,0,,-128,0,,22249,0,,9278,0
2019-01-02 00:22:29,175,1,,119,0,,301,0,,7,0,,124,0,,46,0,,382,0,,425,0,,431,0,,339,0,,1622,0,,1344,0,,379,0,,-126,0,,23888,0,,8963,0
I shared an example of a few lines of CSV data. I expect BigQuery to be able to detect the schema and load the data into a new table.
Using BigQuery new WebUI and your input data I did the following:
Select a dataset
Clicked on create a table
Filled the create table form as follow:
The table was created and I was able to SELECT 6 rows as expected
SELECT * FROM projectId.datasetId.SO LIMIT 1000
I have a JSON input file that needs to be split into multiple files based on a keyword and the output should also retain the same JSON format.
Example:
The keyword here is the value of the object EVT.NAME. Depeneding on the value it should route it to the output.
Input has three different values (KEYPRESS,TUNE,TRICK), so 3 different output files should be created.
Input:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
Output1:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 2:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 3:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
You can use JsonLoader and JsonStorage. See this article - http://joshualande.com/read-write-json-apache-pig.
table = LOAD 'file.json'
USING JsonLoader('KEYPRESS:chararray, TUNE:chararray, TRICK:chararray');