I have a CSV file made of many couples of columns, each couple has code_### and name_###.
code_boat|name_boat|year|code_color|name_color|code_size|name_size
1|jeanneau|2000|#00f|blue|5|small
2|bavaria|2005|#00f|blue|10|big
1|jeanneau|2010|#f00|red|10|big
2|bavaria|2008|#000|white|5|small
3|fountaine-pajot|2005|#f00|red|5|small
1|jeanneau|2012|#000|white|5|small
code_boat │ name_boat │ year │ code_color │ name_color │ code_size │ name_size
──────────┼─────────────────┼──────┼────────────┼────────────┼───────────┼───────────
1 │ jeanneau │ 2000 │ #00f │ blue │ 5 │ small
2 │ bavaria │ 2005 │ #00f │ blue │ 10 │ big
1 │ jeanneau │ 2010 │ #f00 │ red │ 10 │ big
2 │ bavaria │ 2008 │ #000 │ white │ 5 │ small
3 │ fountaine-pajot │ 2005 │ #f00 │ red │ 5 │ small
1 │ jeanneau │ 2012 │ #000 │ white │ 5 │ small
I need to count how many times these couples are used, and keep the couple index:
couple_index │ code │ name │ count
─────────────┼───────┼─────────────────┼───────
0 │ 1 │ jeanneau │ 3
0 │ 2 │ bavaria │ 2
0 │ 3 │ fountaine-pajot │ 1
2 │ #000 │ white │ 2
2 │ #f00 │ red │ 2
2 │ #00f │ blue │ 2
4 │ 5 │ small │ 4
4 │ 10 │ big │ 2
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#f00|red|2
2|#00f|blue|2
4|5|small|4
4|10|big|2
I know how to do it couple by couple with awk, but I'd like to do all at once, because the csv files are pretty big.
awk -F'|' '{c[$39" "$40]++} END{for (i in c) {if (c[i]>0) print i,c[i]}}' myfile.csv
Assumptions/Understandings:
from OP's comments the actual data file is pipe-delimited with no leading/trailing spaces in fields (see modified input file - below)
output is to be generated in the same format (ie, pipe-delimited with no leading/trailing spaces in fields)
Sample input file:
$ cat myfile.csv
boat_CODE|boat_NAME|color_CODE|color_NAME|size_CODE|size_NAME
1|jeanneau|#00f|blue|5|small
2|bavaria|#00f|blue|10|big
1|jeanneau|#f00|red|10|big
2|bavaria|#000|white|5|small
3|fountaine-pajot|#f00|red|5|small
1|jeanneau|#000|white|5|small
NOTE: will need to come back and modify the code depending on what, if any, header record(s) actually exists in the file
One GNU awk idea making use of arrays of arrays (aka multi-dimensional arrays):
awk '
BEGIN { FS=OFS="|" }
NR>1 { for (i=1;i<=NF;i+=2)
counts[(i-1)][$i][$(i+1)]++
}
END { print "couple_index","CODE","NAME","count"
for (ndx=0;ndx<NF;ndx+=2)
for (code in counts[ndx])
for (name in counts[ndx][code])
print ndx,code,name,counts[ndx][code][name]
}
' myfile.csv
This generates:
couple_index|CODE|NAME|count
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#00f|blue|2
2|#f00|red|2
4|5|small|4
4|10|big|2
OP has mentioned in comments they are running on macOS; assuming GNU awk is not available we can use a multi-value hash as the index for a single-dimensional array, eg:
awk '
BEGIN { FS=OFS="|" }
NR>1 { for (i=1;i<=NF;i+=2)
counts[(i-1) FS $i FS $(i+1)]++
}
END { print "couple_index","CODE","NAME","count"
for (i in counts)
print i,counts[i]
}
' myfile.csv
This generates:
couple_index|CODE|NAME|count
0|3|fountaine-pajot|1
2|#f00|red|2
4|5|small|4
0|1|jeanneau|3
4|10|big|2
2|#000|white|2
0|2|bavaria|2
2|#00f|blue|2
Sorting:
If the result needs to be sorted this will probably be easier in bash via the sort command:
remove the print "couple_index","CODE","NAME","count" from both awk solutions; instead move this up to the command line
pipe the awk results to sort
One idea:
echo "couple_index|CODE|NAME|count" > result.csv
awk '.....' myfile.csv | sort -t'|' -k1,1n -k2,2V -k3,3 >> result.csv
Both awk solutions generate:
$ cat result.csv
couple_index|CODE|NAME|count
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#00f|blue|2
2|#f00|red|2
4|5|small|4
4|10|big|2
If the couples are always one next to another, you can easily do it with a loop:
awk 'BEGIN{FS=OFS="|"}
(FNR>2){for(i=1;i<=NF;i+=2) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' file
This does not take care of issues that could be the result misalignment or typos.
If the table has intermediate columns that are of no interest to the problem at hand, it is paramount to process the header first:
awk 'BEGIN{FS=OFS="|"}
(FNR==1) { for(i=1;i<=NF;++i) if ($i ~ /_CODE *$/) { idx[i] } }
(FNR>2) { for(i in idx) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' file
I would implement counting of pairs as follows, let file.txt content be
boat_CODE | boat_NAME | color_CODE | color_NAME | size_CODE | size_NAME
1 | jeanneau | #00f | blue | 5 | small
2 | bavaria | #00f | blue | 10 | big
1 | jeanneau | #f00 | red | 10 | big
2 | bavaria | #000 | white | 5 | small
3 | fountaine-pajot | #f00 | red | 5 | small
1 | jeanneau | #000 | white | 5 | small
then
awk 'BEGIN{FPAT="[^[:space:]|]+"}NR>1{for(i=1;i<=NF;i+=2){c[$i" "$(i+1)]+=1}}END{for(i in c){printf "%-25s%s\n",i,c[i]}}' file.txt
output
10 big 2
#00f blue 2
#f00 red 2
2 bavaria 2
5 small 4
3 fountaine-pajot 1
#000 white 2
1 jeanneau 3
Explanation: I inform GNU AWK that field consist of one or more (+) characters which are not (^) whitespace ([:space:]) and |. Then for each row after first (NR>1) I iterate using for loop with step being 2, and increase value in array c under key being concatenation of this column value and space and next column value. After all lines are processing I printf key-value pairs from array c, with key being leftjusted in string of length 25 (feel free to change this to fit your needs). Disclaimer: This solution assume there is never whitespace inside value.
(tested in gawk 4.2.1)
Related
I'm looking for a query that will allow me to
Imagine this is the table, "News_Articles", with two columns: "ID" & "Headline". How can I make a result from this showing
NEWS_ARTICLES
ID
Headline
0001
Today's News: Local Election Today!
0002
COVID-19 Rates Drop Today
0003
Today's the day to shop local
One word per row (from the headline column)
A count of how many unique IDs it appears in
A count of how many total times the word appears in the whole dataset
DESIRED RESULT
Word
Unique_Count
Total_Count
Today
3
4
Local
2
2
Election
1
1
Ideally, we'd like to remove any conjunctions from the words as well (see "Today's" above is counted as "Today").
I'd also like to be able to remove filler words such as "the" or "a". Ideally this would be through some existing library but if not, I can always manually remove the ones I see with a where clause.
I would also change all characters to lowercase if needed.
Thank you!
You can use full text search and unnest to extract the lexemes, then aggregate:
SELECT parts.lexeme AS word,
count(*) AS unique_count,
sum(cardinality(parts.positions)) AS total_count
FROM news_articles
CROSS JOIN LATERAL unnest(to_tsvector('english', news_articles.headline)) AS parts
GROUP BY parts.lexeme;
word │ unique_count │ total_count
═══════╪══════════════╪═════════════
-19 │ 1 │ 1
covid │ 1 │ 1
day │ 1 │ 1
drop │ 1 │ 1
elect │ 1 │ 1
local │ 2 │ 2
news │ 1 │ 1
rate │ 1 │ 1
shop │ 1 │ 1
today │ 3 │ 4
(10 rows)
As output for a script, I produce inut for tbl. However, when a table seems to reach an end of page, the borders of a table go all over the place. As an example:
│ │ │ │
│ │ │ │
│ │ │ │
│ │ ‐ 1 ‐ │ │
│ │ │ │
│ │ │ │
│ │ │ │
4. The in3 intermediate data structure │
│ │ │ │
In3 is an intermediate language. The goal of the
intermediate language is to provide all the content in the
right │order, in such a way that the output‐filters can
(this is nroff-output). The column-borders conform to table at the bottom of the page.
This mainly seems to happen when a table is fully specified (i.e. for every row, a line is written in the header), for example:
.TS
allbox,center;
l l l
l l l
l l l
l l l
l l l
^ l l
l l l.
I must do this, because I do not know beforehand when two rows need a merged cell (^).
I tried to put in a conditional new page before every table, but that is less obvious than it looks, because a) nroff (text output) and groff (ps-output) do not seem to handle this the same way and b) it is difficult (due to possible multi-line cells) to predict how long a table will be.
I would like a solution that does not force me to begin a new page for every table.
It may be sufficient just to fully specify the table by giving it an explicit table header, which needs to be repeated at the start of the next page after a page split. You may also need to use macros -mm or -ms, which are also doing end-of-page handling, and need to co-operate with tbl and the T# macro it creates for this purpose.
The format is
.TS H
options ;
format .
heading
.TH
data
data
.TE
The heading line above can be omitted, but you still need the .TH and the .TS H.
I made some tests with groff 1.22.3 and the following example, with a forced page length (.pl) of 14 lines worked well with -mm but not with -ms.
( echo .pl 14
echo .TS H
echo 'allbox,center;'
for ((i=1;i<5;i++)); do echo 'l l l'; done
echo '^ l l'
for ((i=1;i<5;i++)); do echo 'l l l'; done
echo 'l l l.'
echo .TH
for ((i=1;i<11;i++)); do echo -e 'a\tb\tc';done
echo .TE
) >t
tbl t | nroff -mm
Here's part of the output, with the blank lines removed:
- 1 -
+--+---+---+
|a | b | c |
+--+---+---+
|a | b | c |
+--+---+---+
- 2 -
+--+---+---+
|a | b | c |
+--+---+---+
- 3 -
+--+---+---+
| | b | c |
|a +---+---+
| | b | c |
+--+---+---+
I'm looking for a solution in GNU sed, but POSIX sed is OK and awk will be OK but probably more complicated than necessary. I prefer sed for this, it should be easy but I'm stuck. Seems like a one-liner can do this, no need to create a python/bash script or anything.
my attempted solution
sed -i '218,226140d; 218i ...REMOVED...' psql.log
This deletes the desired rows, but the insert gets lost. If I move the insert to line 217 I get:
sed -i '218,226140d; 217i ...REMOVED...' psql.log
result:
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
...REMOVED...
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217 in original file)
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141 in original file)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
I know - this should be good enough, but I'm annoyed that I can't get this simple one-liner to work right. What am I missing?
the problem
I keep the psql.log file as a reference for work I am doing developing SQL code. It's very useful to see iterations of the query and the results.
The problem is that sometimes I forget to limit the output and the query will generate 100k+ rows of results that aren't a helpful reference, and I'd like to delete them from my file, leaving a note that reminds me the query output has been excised.
It would be nice to match the pattern, say every output more than 50 rows I could squash down to just the first 5 rows and the last 5. However, its easy for me to mark the line numbers where I've blown up the file, so I'd be happy with just using sed to delete lines N through M, and insert the message ...REMOVED... where line N was.
Here is an example log file, added notes are in parentheses. The query text can change and the number of columns can be from 1 to 100 or more:
...
********* QUERY **********
select *
from table
where rnk <= 3
**************************
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217)
│ CC6XF2SVWT │ 13182280615 │ [null] │
(many rows)
│ CC6XF2XWDT │ 995086081 │ [null] │
│ CC6XFX3TL1 │ 25195177405 │ [null] │
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
(225926 rows)
********* QUERY **********
/* another query begins */
select * from table where X = 1 limit 20;
/* well done you remembered to limit the output */
**************************
...
acceptable output
the query text should all be untouched, and the top/bottom three rows of output kept. The annotation ...REMOVED... has been added and rows 218 through 226140 have been deleted:
********* QUERY **********
select *
from table
where rnk <= 3
**************************
┌────────────┬─────────────────────┬─────────────────┐
│ col_one │ col_two │ column_three │
├────────────┼─────────────────────┼─────────────────┤
│ CC00CBSNRY │ 553854451 │ 15003.44 │
│ CC00CBSNRY │ 1334177150 │ 5159.57 │
│ CC6XDSQGH2 │ 42385958605 │ [null] │ (line 217 in original file)
...REMOVED...
│ CC6XJ8YG5C │ 24661013005 │ [null] │ (line 226141 in original file)
│ CC6XJ9HGRG │ 44946564505 │ [null] │
│ CC6XMQW6SJ │ 34496719615 │ [null] │
└────────────┴─────────────────────┴─────────────────┘
(225926 rows)
********* QUERY **********
(etc just like example above)
update
the border comes from my .psqlrc with \pset border 2
therefore solutions depending on the ┌ character are fragile but OK
over time i've learned that manually flagging the line numbers is too time consuming, so the best solution needs a pattern match
There is example 'every output more than 50 rows I could squash down to just the first 5 rows and the last 5'.
With test input:
$ seq 160 | awk -vstart=10 -vmax=50 -vleft=5 '{if(NR < start) {print; next} {i++; if(i <= left || i > max - left){print}; if(i == left + 1){print "...REMOVED..."}if(i == max){i = 0}}}'
If you line put script in file, store this to squash.awk
BEGIN {
start=10;
max=50;
left=5;
}
{
if(NR < start) {
print;
next
}
i++;
if(i <= left || i > max - left) {
print
}
if(i == left + 1) {
print "...REMOVED...";
}
if(i == max) {
i = 0
}
}
For testing:
$ seq 160 | awk -f squash.awk
Variable start is line number from which squashing line will begin.
Variable max is maximum rows (in your example 50).
Variable left is how many rows will left from max first and last.
if(NR < start) {print; next} if line number less then start (in our case 10), we just print them and go to next line.
Here you can put any condition to skip squashing.
i++ it's rows counter increment.
if(i <= left || i > max - left){print} if rows counter less then 5 or more then max - 5 - print it.
if(i == left + 1){print "...REMOVED..."} when we starting skip rows - put "...REMOVED..." message
if(i == max){i = 0} if rows counter reach max, zero it
One in awk:
$ awk '
/^ └/ { # at the end marker
for(j=1;j<=6;j++) # output from the buffer b the wanted records
print b[j]
for(j=(i-2);j<=i;j++)
print b[j]
delete b # reset buffer
i=0 # and flag / counter
}
/^ ┌/ || i { # at the start marker or when flag up
b[++i]=$0 # gather records to buffer
next
} 1' file # print records which are not between the markers
This might work for you (GNU sed):
sed -r '/\o342[^\n]*$/{:a;N;//ba;s/^(([^\n]*\n){6}).*((\n[^\n]*){5})$/\1 ... REMOVED ...\3/}' file
Focus only on table data which will always contains the octal value 342. Gather up table lines in the pattern space, substitute the required value ... REMOVED ... and print. The number of lines above and below the required string can be altered here 6 (headings + 3 rows) and 5 (required string + 3 rows + table count).
To change a range use:
sed 'm,nc ... REMOVE ...' file # where m,n from and to line numbers
or:
sed -e 'ma ...REMOVE ...' -e 'm,nd' file
N.B. the d command terminates any following commands.
The sed man page is more helpful than you might think at first glance. The [addr]c command is exactly what is needed (note the whitespace after c is ignored):
sed -i '218,226141c ...REMOVED...' psql.log
So there is the solution for known line numbers.
Does anyone want to provide a generic solution where the line numbers aren't known? Probably awk would be the better tool but maybe sed can remove output that is too long.
Is there something like R's table function in Julia? I've read about xtab, but do not know how to use it.
Suppose we have R's data.frame rdata which col6 is of the Factor type.
R sample code:
rdata <- read.csv("mycsv.csv") #1
table(rdata$col6) #2
In order to read data and make factors in Julia I do it like this:
using DataFrames
jldata = readtable("mycsv.csv", makefactors=true) #1 :col6 will be now pooled.
..., but how to build R's table like in julia (how to achieve #2)?
You can use the countmap function from StatsBase.jl to count the entries of a single variable. General cross tabulation and statistical tests for contingency tables are lacking at this point. As Ismael points out, this has been discussed in the issue tracker for StatsBase.jl.
I came to the conclusion that a similar effect can be achieved using by:
Let jldata consists of :gender column.
julia> by(jldata, :gender, nrow)
3x2 DataFrames.DataFrame
| Row | gender | x1 |
|-----|----------|-------|
| 1 | NA | 175 |
| 2 | "female" | 40254 |
| 3 | "male" | 58574 |
Of course it's not a table but at least I get the same data type as the datasource. Surprisingly by seems to be faster than countmap.
I believe, "by" is depreciated in Julia as of 1.5.3 (It says: ERROR: ArgumentError: by function was removed from DataFrames.jl).
So here are some alternatives, we can use split apply combine to do a cross tabs as well or use FreqTables.
Using Split Combine:
Example 1 - SingleColumn:
using RDatasets
using DataFrames
mtcars = dataset("datasets", "mtcars")
## To do a table on cyl column
gdf = groupby(mtcars, :Cyl)
combine(gdf, nrow)
Output:
# 3×2 DataFrame
# Row │ Cyl nrow
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 6 7
# 2 │ 4 11
# 3 │ 8 14
Example 2 - CrossTabs Between 2 columns:
## we have to just change the groupby code a little bit and rest is same
gdf = groupby(mtcars, [:Cyl, :AM])
combine(gdf, nrow)
Output:
#6×3 DataFrame
# Row │ Cyl AM nrow
# │ Int64 Int64 Int64
#─────┼─────────────────────
# 1 │ 6 1 3
# 2 │ 4 1 8
# 3 │ 6 0 4
# 4 │ 8 0 12
# 5 │ 4 0 3
# 6 │ 8 1 2
Also on a side note if you don't like the name as nrow on top, you can use :
combine(gdf, nrow => :Count)
to change the name to Count
Alternate way: Using FreqTables
You can use package, FreqTables like below to do count and proportion very easily, to add it you can use Pkg.add("FreqTables") :
## Cross tab between cyl and am
freqtable(mtcars.Cyl, mtcars.AM)
## Proportion between cyl and am
prop(freqtable(mtcars.Cyl, mtcars.AM))
## with margin like R you can use it too in this (columnwise proportion: margin=2)
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=2)
## with margin for rowwise proportion: margin = 1
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=1)
Outputs:
## count cross tabs
#3×2 Named Array{Int64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────
#4 │ 3 8
#6 │ 4 3
#8 │ 12 2
## proportion wise (overall)
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼─────────────────
#4 │ 0.09375 0.25
#6 │ 0.125 0.09375
#8 │ 0.375 0.0625
## Column wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.157895 0.615385
#6 │ 0.210526 0.230769
#8 │ 0.631579 0.153846
## Row wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.272727 0.727273
#6 │ 0.571429 0.428571
#8 │ 0.857143 0.142857
cat raw.txt
Name country IP Cost
sam us 10.10.10.10 $250
jack India 10.10.10.12 $190
joy Australia 10.10.10.13 $230
christ canada 10.10.10.15 $190
jackson africa 10.10.10.20 $230
I need to output like a table list four column and four row, i.e Name Country IP Cost
http://res.cloudinary.com/dzy8bgton/image/upload/v1413617325/Screenshot_from_2014-10-18_12_35_11_h6wjsu.png
please anyone can help me out.
Here's an old school answer :-)
#!/bin/sh
# use tbl|nroff to make an ASCII table
# use sed to change multiple spaces into a single tab for tbl(1)
sed 's/ */\t/g' < raw.txt | awk '
BEGIN {
print ".TS" # beginning of table
print "allbox;" # allbox format
print "c s s s" # Table name format - centered and spanning 4 columns
print "lb lb lb lb" # bold column headers
print "l l l l." # table with 4 left justified columns. "." means repeat for next line
print "My Table" # Table name
}
{print} # print each line of 4 values
END {
print ".TE" # end of table
}' | tbl | nroff -Tdumb
which generates
┌─────────────────────────────────────────┐
│ My Table │
├────────┬───────────┬─────────────┬──────┤
│Name │ country │ IP │ Cost │
├────────┼───────────┼─────────────┼──────┤
│sam │ us │ 10.10.10.10 │ $250 │
├────────┼───────────┼─────────────┼──────┤
│jack │ India │ 10.10.10.12 │ $190 │
├────────┼───────────┼─────────────┼──────┤
│joy │ Australia │ 10.10.10.13 │ $230 │
├────────┼───────────┼─────────────┼──────┤
│christ │ canada │ 10.10.10.15 │ $190 │
├────────┼───────────┼─────────────┼──────┤
│jackson │ africa │ 10.10.10.20 │ $230 │
└────────┴───────────┴─────────────┴──────┘
You can try the column command:
column -t file
Name country IP Cost
sam us 10.10.10.10 $250
jack India 10.10.10.12 $190
joy Australia 10.10.10.13 $230
christ canada 10.10.10.15 $190
jackson africa 10.10.10.20 $230