hey guys so i have this database:
id | item_name | number_of_store| store_location|
+----+---------+-------------------+-------------+
| 3 | margarine | 2 | QLD |
| 4 | margarine | 2 | NSW |
| 5 | wine | 3 | QLD |
| 6 | wine | 3 | NSW |
| 7 | wine | 3 | NSW |
| 8 | laptop | 1 | QLD |
+----+---------+-------------------+-------------+
i got the result i wanted using the distinct from sqllite3 syntax which are the following:
id | item_name | number_of_store| store_location|
+----+---------+-------------------+-------------+
| 3 | margarine | 2 | QLD |
| 4 | margarine | 2 | NSW |
syntax are :
sqlite3 store.sqlite 'select item_name,number_of_store,store_location from store where item_name = 'margarine'> store.txt
but when i saved it to txt i got
3|margarine|2|QLD
4|margarine|2|NSW
however my desired output in the txt are
3,margarine,2,QLD
4,margarine,2,NSW
i think i should use SED but not quite sure how to do it
i tried with
'|sed 's/|//g' |sed 's/|//g'|sed 's/^//g'|sed 's/$//g'
however the result only erase the '|' i'm not too sure how to change it to ','
Though you should sql itself but as per your request you could use following sed.
awk '{gsub("|",",")} 1' Input_file
Or in sed:
sed 's#|#,#g' Input_file
In case you want to save output into Input_file itself use sed -i.bak option it will take backup of Input_file and save output into Input_file itself.
Related
I want to filter my table based on those sequences that have at least 80 base pair (end-begin+1 >= 80) which spans over 80% of their total length (base pairs left should be =< 20% of the total length: (end-begin+1) + left = total length)
| query sequence | begin | end | (left)|
| -------------- | ------| --- | ----- |
| D1 | 1 | 330 | (1939)|
| D2 | 2180 | 2269| (0) |
| D3 | 4 | 168 | (0) |
| D4 | 1 | 1610| (0) |
| D5 | 1 | 402 | (84) |
| D6 | 1 | 58 | (0) |
| D7 | 1 | 79 | (0) |
| D8 | 4 | 167 | (437) |
| D9 |310 | 478 | (214) |
| D10 |1 | 227 | (234) |
| D11 |2 | 604 | (141) |
that is my awk code:
awk '{print $0, $7-$6+1, $7+$8, ($7-$6+1)/($7+$8)}' | awk '$18 >= 0.8 {print $0}'
however there are sequences that are not filtered according to the minimum 80 base pair nor the 80% of the total length rule, where am I wrong?
the expected output:
| query sequence | begin | end | (left)|
| -------------- | ------| --- | ----- |
| D2 | 2180 | 2269| (0) |
| D3 | 4 | 168 | (0) |
| D4 | 1 | 1610| (0) |
| D5 | 1 | 402 | (84) |
Column $8 (left) has parentheses around the numbers, therefore awk fails to interpret $8 as a number and uses 0 instead. Example: awk '{print $1+2}' <<< '(3)' prints 2 instead of 5.
You can extract the number inside the parentheses into a variable using left=$8; gsub(/[()]/,"",left).
By the way: No need for 2 awk scripts. You can do everything in one script:
awk '{left=$8; gsub(/[()]/,"",left); bp=$7-$6+1; tl=bp+left} bp>=80 && bp>0.8*tl'
You might set custom field separator to get just numbers in $8 (and other columns) rather than digits inside ( and ), i.e. replace
awk '{print $0, $7-$6+1, $7+$8, ($7-$6+1)/($7+$8)}'
using
awk 'BEGIN{FS="[)[:space:](]+"}{print $0, $7-$6+1, $7+$8, ($7-$6+1)/($7+$8)}'
Explanation: treat any combination of ) whitespace ( as field separator (FS). Not tested due to lack of sample input as text. If you want to know more about FS read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
I have a huge csv with this structure (sample):
| DATE | WEEKDAY | Shop Code |Shop Manager|Item Presentation Time|Item Sell|
|02-Mar |MONDAY | BOG | Tom |1030 |0 |
|02-Mar |TUESDAY | TEF | Lucas |1300 |1 |
|02-Mar |WEDNESDAY | TDC | Eriberto |1300 |1 |
|02-Mar |THURSDAY | TEF | Lucas |1300 |1 |
|02-Mar |FRIDAY | TEF | Lucas |1300 |1 |
|02-Mar |SATURDAY | GTY | Maya |1600 |1 |
|02-Mar |SUNDAY | TDC | Eriberto |1300 |1 |
I am interested in the sum of successful event ($6)per weekday, the count of presentation per weekday ($2), and the percentual of successful event ( sum $6/count $2 *100)
I wrote the following script:
#!/bin/awk -f
BEGIN {FS = OFS = ","}
{if (NR!=1) a[$2]+=$6;count[$2]++$2}END{for (i in a){ print i","a[i] "," count[i]","a[i]/count[i]*100}}
The script runs:
$ awk -f script.awk raw_file.csv > new_file.csv
It works out perfectly and the output is:
|MONDAY | 2 | 10 |0.20|
|TUESDAY | 18 | 30 |0.60|
|WEDNESDAY | 10 | 20 |0.50|
|THURSDAY | 1 | 20 |0.05|
|FRIDAY | 1 | 15 |0.07|
|SATURDAY | 60 | 100 |0.60|
|SUNDAY | 47 | 80 |0.59|
However I would like to add in the output the header (WEEKDAY,SUCCESSFUL_EVENTS,TOTAL_EVENTS and SUCCESSFUL_RATE. I have no idea how to put in the same script the NR with the header.
I can show the output with:
awk 'NR==1 {print
"WEEKDAY","SUCCESSFUL_EVENTS","TOTAL_EVENTS","SUCCESSFUL_RATE"}{print
$0}' new_file.csv
but no way to integrate this in the script
Any suggestion is really appreciated
You can do this in the begin section of your script:
#!/bin/awk -f
BEGIN {
FS = OFS = ","
print "WEEKDAY", "SUCCESSFUL_EVENTS", "TOTAL_EVENTS", "SUCCESSFUL_RATE"
}
# ...
I have a fasta file temp_mart.txt as such:
ENSG00000100219|ENST00000005082
MTLLTFRDVAIEFSLEEWKCLDLAQQNLYRDVMLENYRNLFSVGLTVCKPGL
And I tried to load it into a table in sql using
load data local infile '~/Desktop/temp_mart.txt' into table mart
But instead of getting an output like this:
+-----------------+------+------+
| ENSG | ENST | 3UTR |
+-----------------+------+------+
| >ENSG0000010021 | ENST00000005082 | MTLLTFRDVAIEFSL |
I get this:
+-----------------+------+------+
| ENSG | ENST | 3UTR |
+-----------------+------+------+
| >ENSG0000010021 | NULL | NULL |
| MTLLTFRDVAIEFSL | NULL | NULL |
| EPWNVKRQEAADGHP | NULL | NULL |
| DKFTAMSSHFTQDLL | NULL | NULL |
Everything seems to go into the first column. What is the best way to load it into the table so it presents as intended? Do I need to convert it into a csv file first?
I have a hive table as below:
+----+---------------+-------------+
| id | name | partnership |
+----+---------------+-------------+
| 1 | sachin sourav | first |
| 2 | sachin sehwag | first |
| 3 | sourav sehwag | first |
| 4 | sachin_sourav | first |
+----+---------------+-------------+
In this table I need to replace strings such as "sachin" with "ST" and "Sourav" with "SG". I am using following query, but it is not solving the purpose.
Query:
select
*,
case
when name regexp('\\bsachin\\b')
then regexp_replace(name,'sachin','ST')
when name regexp('\\bsourav\\b')
then regexp_replace(name,'sourav','SG')
else name
end as newName
from sample1;
Result:
+----+---------------+-------------+---------------+
| id | name | partnership | newname |
+----+---------------+-------------+---------------+
| 4 | sachin_sourav | first | sachin_sourav |
| 3 | sourav sehwag | first | SG sehwag |
| 2 | sachin sehwag | first | ST sehwag |
| 1 | sachin sourav | first | ST sourav |
+----+---------------+-------------+---------------+
Problem: My intention is, when id = 1, the newName column should bring value as "ST SG". I mean it should replace both strings.
You can nest the replaces:
select s.*,
replace(replace(s.name, 'sachin', 'ST'), 'sourav', 'SG') as newName
from sample1 s;
You don't need regular expressions, so just use replace().
I would like to read a CSV file from the shell as if it was an SQL Database table.
Is this possible without having to import the CSV file content to a SQL enviroment?
Maybe there is some kind of linux based tool that can work it out...
I know it sounds like a tricky question, but I'm trying to avoid installing a SQL server and stuff. I have some limitations.
Any clue?
There is also csvsql (part of csvkit)!
It can not only run sql on given csv (converting it into sqlite behind scenes), but also convert and insert into one of many supported sql databases!
Here you have example command (also in csvsql_CDs_join.sh):
csvsql --query 'SELECT CDTitle,Location,Artist FROM CDs JOIN Artists ON CDs.ArtistID=Artists.ArtistID JOIN Locations ON CDs.LocID = Locations.LocID' "$#"
showing how to join three tables (available in csv_inputs in csv_dbs_examples).
(formatting with csvlook also part of csvkit)
Inputs
$ csvlook csv_inputs/CDs.csv
| CDTitle | ArtistID | LocID |
| -------- | -------- | ----- |
| CDTitle1 | A1 | L1 |
| CDTitle2 | A1 | L2 |
| CDTitle3 | A2 | L1 |
| CDTitle4 | A2 | L2 |
$ csvlook csv_inputs/Artists.csv
| ArtistID | Artist |
| -------- | ------- |
| A1 | Artist1 |
| A2 | Artist2 |
$ csvlook csv_inputs/Locations.csv
| LocID | Location |
| ----- | --------- |
| L1 | Location1 |
| L2 | Location2 |
csvsql
$ csvsql --query 'SELECT CDTitle,Location,Artist FROM CDs JOIN Artists ON CDs.ArtistID=Artists.ArtistID JOIN Locations ON CDs.LocID = Locations.LocID' "$#" | csvlook
Produces:
| CDTitle | Location | Artist |
| -------- | --------- | ------- |
| CDTitle1 | Location1 | Artist1 |
| CDTitle2 | Location2 | Artist1 |
| CDTitle3 | Location1 | Artist2 |
| CDTitle4 | Location2 | Artist2 |
Take a look at https://github.com/harelba/q, a Python tool for treating text as a database. By default it uses spaces to delimit fields, but the -d , parameter will allow it to process CSV files.
Alternatively you can import the CSV file into SQLite and then run SQL commands against it. This is scriptable, with a bit of effort.