how to guess file encoding

how to guess file encoding - pandas

I have a file (an author list from the Library of Congress) with lines like:
Arteaga, Ana Mar�ia
Corval�an-V�asquez, Oscar E.
(when printed to linux console)
I'd like to read those (either into a pandas dataframe or a set of lines)
df = pd.read_csv(fname, sep='\t', header='infer', lineterminator=None,encoding='latin1') #lineterminator \r\n hits error
or
with open(fname,'r',encoding='ISO-8859-1') as fp:
lines=fp.readlines()
but both are not quite right , giving me output like
Arteaga, Ana Marâia
(again when printed to console)
when I am pretty sure the actual name here should be María.
Does someone recognize this format?

Ok this seems to be the 'marc-8' format .
yaz-iconv -f marc8 -t utf8 infile.txt > outfile.txt
took care of the conversion to utf8 , with the sole hiccup being that yaz killed all the line terminators (both for \r\n and \n versions of the file).
Those can be returned with something along the lines of
sed 's/\[/\n\[/g' outfile.txt > outfile_utf.txt
(for example in my case where each line starts with a '[' character)

Related

Issues converting a small Hex value to a Binary value

I am trying to take the contents of a file that has a Hex number and convert that number to Binary and output to a file.
This is what I am trying but not getting the binary value:
xxd -r -p Hex.txt > Binary.txt
The contents of Hex.txt is: ff
I have also tried FF and 0xFF, but would like to just use ff since the device I am pulling the info from has it in that format.
Instead of 11111111 which it should be, I get a y with 2 dots above it.
If I change it to ee, I get an i with 2 dots. It seems to be reading it just fine but according to what I have read on the xxd -r -p command, it is not outputing it in the correct format.
The other ways I have found to convert Hex to Binary have either also not worked or is a pretty big Bash script that seems unnecessary to do what I thought would be a simple task.
This also gives me the y with 2 dots.
$ for i in $(cat Hex.txt) ; do printf "\x$i" ; done > Binary.txt
For some reason almost every solution I find gives me this format instead of a human readable Binary value with 1s and 0s.
Any help is appreciated. I am planning on using this in a script to pull the Relay values from Digital Loggers devices using curl and giving Home Assistant a readable file to record the Relay State. Digital Loggers curl cmd gives the state of all 8 relays at once using Hex instead of being able to pull the status of a specific relay.

If "file.txt" contains:
fe
0a
and you run this:
perl -ane 'printf("%08b\n",hex($_))' file.txt
You'll get this:
11111110
00001010
If you use it a lot, you might want to make a bash function of it in your login profile along these lines - being extremely respectful of spaces and semi-colons that might look unnecessary:
bin(){ perl -ane 'printf("%08b\n",hex($_))' $1 ; }
Then you'll be able to do:
bin file.txt
If you dislike Perl for some reason, you can achieve something similar without it as follows:
tr '[:lower:]' '[:upper:]' < file.txt |
while read h ; do
echo "obase=2; ibase=16; $h" | bc
done

How to Edit a text from the output in DCL -- OpenVMS scripting

I wrote the below code, which will extract the directory name along with the file name and I will use purge command on that extracted Text.
$ sear VAXMANAGERS_ROOT:[PROC]TEMP.LIS LOG/out=VAXMANAGERS_ROOT:[DEV]FVLIM.TXT
$ OPEN IN VAXMANAGERS_ROOT:[DEV]FVLIM.TXT
$ LOOP:
$ READ/END_OF_FILE=ENDIT IN ABCD
$ GOTO LOOP
$ ENDIT:
$ close in
$ ERROR=F$EXTRACT(0,59,ABCD)
$ sh sym ERROR
$ purge/keep=1 'ERROR'
The output is as follows:
ERROR = "$1$DKC102:[PROD_LIVE.LOG]DP2017_TMP2.LIS;27392 "
Problem here is --- Every time the directory length varies (Length may be 59 or 40 or some other value, but the directory and filename length will not exceed 59 characters in my system). So in the above output, the system is also fetching the Version number of that file number. So I am not able to purge the file along with the version number.
%PURGE-E-PURGEVER, version numbers not permitted
Any suggestion -- How to eliminate the version number from the output ?
I cannot use the exact length of the directory, as directory length varies everytime.... :(

The answer with F$ELEMENT( 0, ";", ABCD ) should work, as confirmed. I might script something like this:
$ ERROR = F$PARSE(";",ERROR) ! will return $1$DKC102:[PROD_LIVE.LOG]DP2017_TMP2.LIS;
$ ERROR = ERROR - ";"
$ PURGE/KEEP=1 'ERROR'
Not sure why you have the read loop. What you will get is the last line in the file, but assuming that's what you want.

While HABO explained it, some more explanations
Suppose I use f$search to check if a file exists
a = f$search("sys$manager:net$server.log")
then I find I it exists
wr sys$output a
shows
SYS$SYSROOT:[SYSMGR]NET$SERVER.LOG;9
From the help of f$parse I get
help lex f$parse arg
shows, among other things
`Specifies a character string containing the name of a field
in a file specification. Specifying the field argument causes
the F$PARSE function to return a specific portion of a file
specification.
Specify one of the following field names (do not abbreviate):
NODE Node name
DEVICE Device name
DIRECTORY Directory name
NAME File name
TYPE File type
VERSION File version number`
So I can do
wr sys$output f$parse(a,,,"DEVICE")
which shows
SYS$SYSROOT:
and also
wr sys$output f$parse(a,,,"DIRECTORY")
so I get
[SYSMGR]
and
wr sys$output f$parse(a,,,"NAME")
shows
NET$SERVER
and
wr sys$output f$parse(a,,,"TYPE")
shows
.LOG
the version is
wr sys$output f$parse(a,,,"VERSION")
shown as
;9
The lexicals functions can be handy, check it using
help lexical
it shows
F$CONTEXT F$CSID F$CUNITS F$CVSI F$CVTIME F$CVUI F$DELTA_TIME F$DEVICE F$DIRECTORY F$EDIT
F$ELEMENT F$ENVIRONMENT F$EXTRACT F$FAO F$FID_TO_NAME F$FILE_ATTRIBUTES F$GETDVI F$GETENV
F$GETJPI F$GETQUI F$GETSYI F$IDENTIFIER F$INTEGER F$LENGTH F$LICENSE F$LOCATE F$MATCH_WILD
F$MESSAGE F$MODE F$MULTIPATH F$PARSE F$PID F$PRIVILEGE F$PROCESS F$READLINK F$SEARCH
F$SETPRV F$STRING F$SYMLINK_ATTRIBUTES F$TIME F$TRNLNM F$TYPE F$UNIQUE F$USER

awk/sed - generate an error if 2nd address of range is missing

We are currently using sed to filter output of regression runs. Sometimes we have a filter that looks like this:
/copyright/,/end copyright/d
If that end copyright is ever missing, the rest of the file is deleted. I'm wondering if there's some way to generate an error for this? awk would also be okay to use. I don't really want to add code that reads the file line by line and issues an error if it hits EOF.
here's a string
copyright
2016 jan 15
end copyright
date 2016 jan 5 time 15:36
last one
I'd like to get an error if end copyright is missing. The real filter also would replace the date line with DATE, so it's more that just ripping out the copyright.

You can persuade sed to generate an error if you reach end of input (i.e. see address $) between your start and end, but it won't be a very helpful message:
/copyright/,/end copyright/{
$s//\1/ # here
d
}
This will error if end copyright is missing or on the last line, with an exit status of 1 and the helpful message:
sed: -e expression #1, char 0: invalid reference \1 on `s' command's RHS
If you're using this in a makefile, you might want to echo a helpful message first, or (better) to wrap this in something that catches the error and produces a more useful one.
I tested this with GNU sed; though if you are using GNU sed, you could more easily use its useful extension:
q [EXIT-CODE]
This command only accepts a single address.
Exit 'sed' without processing any more commands or input. Note
that the current pattern space is printed if auto-print is not
disabled with the -n options. The ability to return an exit code
from the 'sed' script is a GNU 'sed' extension.
Q [EXIT-CODE]
This command only accepts a single address.
This command is the same as 'q', but will not print the contents of
pattern space. Like 'q', it provides the ability to return an exit
code to the caller.
So you could simply write
/copyright/,/end copyright/{
$Q 42
d
}

Never use range expressions /start/,/end/ as they make trivial code very slightly briefer but require a complete rewrite or duplicate conditions when you have the tiniest requirements change. Always use a flag instead. Note that since sed doesn't support variables, it doesn't support flag variables, and so you shouldn't be using sed you should be using awk instead.
In this case your original code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0}' file
And your modified code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0} END{if (f) print "Missing end copyright"}' file
The above is obviously untested since you didn't provide any sample input/output we could test a potential solution against.

With sed you can build a loop:
sed -e '/copyright/{:a;/end copyright/d;N;ba;};' file
:a defines the label "a"
/copyright end/d deletes the pattern space, only when "end copyright" matches
N appends the next line to the pattern space
ba jumps to the label "a"
Note that d ends the loop.
In this way you can avoid to delete the text until the end.
If you don't want the text to be displayed at all and prefer an error message when a "copyright" block stays unclosed, you obviously need to wait the end of the file. You can do it with sed too storing all the lines in the buffer space until the end:
sed -n -e '/copyright/{:a;/end copyright/d;${c\ERROR MESSAGE
;};N;ba;};H;${g;p};' file
H appends the current line to the buffer space
g put the content of the buffer space to the pattern space
The file content is only displayed once the last line reached with ${g;p} otherwise when the closing "end copyright" is missing, the current line is changed in the error message with ${c\ERROR MESSAGE\n;} inside the loop.
This way you can test what returns sed before redirecting it to whatever you want.

extracting subtext between two characters using grep

I have text file which has information like follows
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
+
!!!#!!!!!!!!++??????????????~~~~~~~~~~~~~
BBBBBBBBBBBBMMMMMM!!!!!++LLLLLL******
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
+
**********||||||||||||#########++++?????????
MMMMMMMMMUUUU***+++~~~~~~~~~~~~~~~~~~~~~~~~~~
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas
+
!!!!!!!!!!!!!!!!!!#####???????????????????
I would like extract the region only between #Mp_* and + that comes right below the text and export it to txt file like following
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas
When I used the following code
grep -o -P '(?<=#MP.*).*(?=+)' query.txt > output.txt
It gave me "grep: nothing to repeat".
Could anyone guide where my mistake is and how to rectify it.
Thanks in advance.

Better use awk for this:
awk '/^#/{f=1} /^+/ {f=0} f' file > output.txt
Or, if you have leading spaces, match them with \s*:
awk '/^\s*#/{f=1} /^\s*\+/ {f=0} f' file > output.txt
This uses a flag f to decide whether the line should be printed or not.
When it sees a line starting with #, it activates it.
When it sees a line starting with +, it deactivates it.
Then, it evaluates the flag and prints if it is True.
With your given input it returns:
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas

awk getline skipping to last line -- possible newline character issue

I'm using
while( (getline line < "filename") > 0 )
within my BEGIN statement, but this while loop only seems to read the last line of the file instead of each line. I think it may be a newline character problem, but really I don't know. Any ideas?
I'm trying to read the data in from a file other than the main input file.
The same syntax actually works for one file, but not another, and the only difference I see is that the one for which it DOES work has "^M" at the end of each line when I look at it in Vim, and the one for which it DOESN'T work doesn't have ^M. But this seems like an odd problem to be having on my (UNIX based) Mac.
I wish I understood what was going with getline a lot better than I do.

You would have to specify RS to something more vague.
Here is a ugly hack to get things working
RS="[\x0d\x0a\x0d]"
Now, this may require some explanation.
Diffrent systems use difrent ways to handle change of line.
Read http://en.wikipedia.org/wiki/Carriage_return and http://en.wikipedia.org/wiki/Newline if you are
interested in it.
Normally awk hadles this gracefully, but it appears that in your enviroment, some files are being naughty.
0x0d or 0x0a or 0x0d 0x0a (CR+LF) should be there, but not mixed.
So lets try a example of a mixed data stream
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
try]oe
We can see that the last lines are lost.
Now using the ugly hack to RS
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{RS="[\x0d\x0a\x0d]";while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
r=[zoe]
r=[qwe]
r=[try]
We can see every line is obtained, reguardless of the 0x0d 0x0a junk :-)

Maybe you should preprocess your input file with for example dos2unix (http://sourceforge.net/projects/dos2unix/) utility?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to guess file encoding - pandas

Related

Issues converting a small Hex value to a Binary value

How to Edit a text from the output in DCL -- OpenVMS scripting

awk/sed - generate an error if 2nd address of range is missing

extracting subtext between two characters using grep

awk getline skipping to last line -- possible newline character issue

Categories

Resources