How to render characters instead of hex? (\x60) - awk

I have a file (file.txt) with this content:
file.txt:
+ "Chapter 8\n“So Much More Than Just an ‘A’”: A Transformative High School and University Writing Center Partnership
+ "Chapter 9\n“Oh, I Get By with a Little Help from My Friends”: Short-Term Writing Center/Community Collaborations"
My code extracts certain data from this with gawk, and then writes the results to a new file (nav.txt) using printf.
nav.txt:
Chapter 8 \xE2\x80\x9CSo Much More Than Just an \xE2\x80\x98A\xE2\x80\x99\xE2\x80\x9D: A Transformative High School and University Writing Center Partnership
Chapter 9 \xE2\x80\x9COh, I Get By with a Little Help from My Friends\xE2\x80\x9D: Short-Term
Does anyone know why the special characters (like ‘, -, etc) render as these weird hex codes? (\xE2\x80\x98A, \xE2\x80\x9D, etc)
If so, how can I output the correct characters to nav.txt, instead of their hex equivelants?
my gawk code, running on Ubuntu:
gawk '
match($0, /^[\|\+-].*"([^"]+)".*#([[:digit:]]+)/, m) {
printf "<li>\n %s\n </li>\n", m[2], m[1] >> "nav.txt"
}
' file.txt

Related

GNU Radio text file sink

I'm trying to teach myself basics of GNU Radio and DSP. I created a flowchart in GNU Radio Companion that takes a vector that is the binary representation of a single character (the character "1" as "00110001"), modulates, demodulates, and writes to a file sink.
The scope sink after demodulation looks like the values are returned (see below; appears to be correct pattern of 0s and 1s), but the file sink, although its size is 19 bytes, appears empty, or at least is not returning the correct values (I've looked at it in ASCII and Hex text editors). I assumed the single character transferred would result in 1 byte (or 8 bits) -- not 19 bytes. Changing some of the settings in the Polyphase Sync and adding a Repack Bits block after the binary slicer results in some characters in the output file, but never the right character.
My questions are:
Can GNU Radio take a single character, modulate/demodulate it, and return the same character?
Are there errors in my flowchart?
I'd appreciate any insights or suggestions, thank you.

How do I use awk split file to multiline records?

On OSX, I've converted a Powerpoint deck to ASCII text, and now want to process this with awk.
I want to split the file into multiline records corresponding to slides in the deck.
Treating any line beginning with a capital latin letter provides a good approximation, but I can't figure out doing this in awk.
I've tried resetting the record separator, RS = "\n^[A-Z]" and RS = "\n^[[:alnum:]][[:upper:]]", and various permutations, but none differentiate. That is, awk keeps treating each individual as a record, rather than grouping them as I want.
The cleaned text looks like this:
Welcome
++ Class will focus on:
– Basics of SQL syntax
– SQL concepts analogous to Excel concepts
Who Am I
++ Self-taught on LAMP(ython) stack
++ Plus some DNS, bash scripting, XML / XSLT
++ Prior professional experience:
– Office of Management and Budget
– Investment banking (JP Morgan, UBS, boutique)
– MBA, University of Chicago
Roadmap
+ Preliminaries
+ What is SQL
+ Excel vs SQL
+ Moving data from Excel to SQL and back
+ Query syntax basics
- Running queries
- Filtering, grouping
- Functions
- Combining tables
+ Using queries for analysis
Some 'slides' have blank lines, some don't.
Once past these hurdles I plan to wrap each record in an tag for use in deck.js. But getting the record definitions right is killing me.
How do I do those things?
EDIT: The question initially asked also about converting Unicode bullet characters to ASCII, but I've figured that out. Some remarks in comments focus on that stuff.
In awk you could try to collect records using:
/^[[:upper:]]/ {
if (r>0) print rec
r=1; rec=$0 RS; next
}
{
rec=rec $0 RS
}
END {
print rec
}
To remove bullets you could use
gsub (/•/,"++",rec)
You might try using the "textutil" utility built into OSX to convert the file within a script to save you doing it all by hand. Try typing the following into Terminal window and pressing to move to the next page:
man textutil
Once you have got some converted text, try posting that so people can see what the inputs look like, then maybe someone can help you split it up how you want.

reading unformatted data, intel ifort vs Ibm xlf

I'm trying to shift from intel ifort to IBM xlf, but when reading "unformatted output data"(unformatted I mean they are not the same length), there is problem. Here is an example:
program main
implicit none
real(8) a,b
open(unit=10,file='1.txt')
read (10,*) a
read (10,*) b
write(*,'(E20.14E2)'),a,b
close(10)
end program
1.txt:
0.10640229631236
8.5122792850319D-02
using ifort I get output:
0.10640229631236E+00
0.85122792850319E-01
using xlf I get output:
' in the input file. The program will recover by assuming a zero in its place.e invalid digit '
0.10640229631236E+00
0.85122792850319E-01
Since the data in the 1.txt is unformatted, I can't use a fixed format to read the data. Dose anyone know how to solve this warning?
(Question answered in the comments. See Question with no answers, but issue solved in the comments (or extended in chat) )
#M.S.B wrote:
Is there an apostrophe in the input file? Or any character besides digits, decimal point and "D"? Your reads are "list directed".
The OP Wrote:
Yes it seems to have some character after 0.10640229631236 that costs this warning. When I write those numbers to a new file by hand(change line after 0.10640229631236 by the enter key), this warning goes away. I cat -v these two files: With the warning file I get 0.10640229631236^M 8.5122792850319D-02 while the no warning files I get 0.10640229631236 8.5122792850319D-02 Do you know what that M stands for and where it comes from?
#agentp gave the link:
'^M' character at end of lines
Which explains that ^M is the windows character for carriage return

GAWK Script using special characters

I am having an issue using special characters. I am parsing a text file separated by tabs. I want to have the program add a "*" to the first word in the line if a certain parameter is true.
if ($Var < $3) $1 = \*$1
Now every time I run it I get the error that it is not the end of the line.
2 things, but without more context to test with we really can't help you much.
$Var will only have meaning if you have set it above like Var=3. Then I don't think gawk will evaluate your $3 to the value of $3. The other side of that expression < $3 WILL expand to the value of the 3rd field. If you're getting $Var from the shell environment, you need to let the gawk script 'see' that value, i.e.
awk '{ ..... if ('"$Var"' < $3) $1= "*" $1 .....}
If you want the string literal '*' pre-pended, you're better off doing $1 = "*" $1
Without sample inputs, sample expected output, actual output and error messages, we'll be playing 20 questions here. If these comments don't solve your problem, please edit your question above to include these items.
P.S. Welcome to StackOverflow and let me remind you of three things we usually do here: 1) As you receive help, try to give it too, answering questions in your area of expertise 2) Read the FAQs, http://tinyurl.com/2vycnvr , 3) When you see good Q&A, vote them up by using the gray triangles, http://i.imgur.com/kygEP.png , as the credibility of the system is based on the reputation that users gain by sharing their knowledge. Also remember to accept the answer that better solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png

Summing values in one-line comma delimited file

EDIT: Thanks all of you. Python solution worked lightning-fast :)
I have a file that looks like this:
132,658,165,3216,8,798,651
but it's MUCH larger (~ 600 kB). There are no newlines, except one at the end of file.
And now, I have to sum all values that are there. I expect the final result to be quite big, but if I'd sum it in C++, I possess a bignum library, so it shouldn't be a problem.
How should I do that, and in what language / program? C++, Python, Bash?
Penguin Sed, "Awk"
sed -e 's/,/\n/g' tmp.txt | awk 'BEGIN {total=0} {total += $1} END {print total}'
Assumptions
Your file is tmp.txt (you can edit this obviously)
Awk can handle numbers that large
Python
sum(map(int,open('file.dat').readline().split(',')))
The language doesn't matter, so long as you have a bignum library. A rough pseudo-code solution would be:
str = ""
sum = 0
while input
get character from input
if character is not ','
append character to back of str
else
convert str to number
add number to sum
str = ""
output sum
If all of the numbers are smaller than (2**64)/600000 (which still has 14 digits), an 8 byte datatype like "long long" in C will be enough. The program is pretty straight-forward, use the language of your choice.
Since it's expensive to treat that large input as a whole I suggest you take a look at this post. It explains how to write a generator for string splitting. It's in C# but it well suited for crunching through that kind of input.
If you are worried about the total sum to not fit in a integer (say 32-bit) you can just as easily implement a bignum your self, especially if you just use integer and addition. Just carry the bit-31 to next dword and keep adding.
If precision isn't important, just accumulate the result in a double. That should give you plenty of range.
http://www.koders.com/csharp/fid881E3E70CC37E480545A0C37C98BC8C208B06723.aspx?s=datatable#L12
A fast C# CSV parser. I've seen it crunch though a few thousand 1MB files rather quickly, I have it running as part of a service that consumes about 6000 files a month.
No need to reinvent a fast wheel.
python can handle the big integers.
tr "," "\n" < file | any old script for summing
Ruby is convenient, since it automatically handles big numbers. I can't remember of Awk does arbitrary precision arithmentic, but if so, you could use
awk 'BEGIN {RS="," ; sum = 0 }
{sum += $1 }
END { print sum }' < file