I want to split the string and keep the first word only - pandas

I've a dataframe which contains details of cars. Now I want keep only the brand name and remove the model name.
I've tried using the str.split function to separate the car name. However it gives me a list and then I'm not able to extract the first name.
splitted = df['CarName'].str.split(' ',1)
Expected result:
alfa-romero
Audi
VW
Acutal result:
[alfa-romero, giulia]
[alfa-romero, stelvio]
[alfa-romero, Quadrifoglio]
[audi, 100 ls]
[audi, 100ls]

you can do in two ways, one as WeNYoBen explained in his comment, or by using extract against a list of Brands
df['brand'] =df['cars'].str.split(' ',1).str[0]
or
pattern =['audi', 'alfa-romero']
df['brand_2'] =df['cars'].str.extract("(" + "|".join(pattern) +")", expand=False)

Then you can do
splitted = df['CarName'].str.split(' ',1).str[0]

This could be achieved using pandas.DataFrame.apply with str.split
df['res']= df['CarName'].apply(lambda x : str(x).split(' ')[0])

Related

Select first and last three strings from column for conditionSQL

My goal is to select all the columns that start and end with the same 3 strings as the first row.
In this case it was simple, since the CONCAT was equal to 'SCLMIA'
AND CONCAT(origin, destination) = 'SCLMIA'
AND ((flight_path LIKE '%SCL%' AND flight_path LIKE '%MIA%')
but now the difficulty is for multiple strings.
AND CONCAT(origin, destination) IN ('SCLMIA', 'SCLIQQ','SCLMAD', 'LIMCUZ', 'BOGMDE', 'FORGRU', 'SDUCGH', 'SCLGRU', 'BOGLIM', 'GYEUIO')
AND (**here I need to replicate the same as above.**)
I read that it can be with the functions SUBSTRING, LEFT AND RIGHT selecting the three first and last strings but I don't know how to do it.
Tried with this, but failed:
AND (flight_path LIKE '%' + SUBSTR(flight_path,3, LENGTH(flight_path) - 4) + '%')
It should be noted that it is a chain of conditions that's why start with AND.
Edit:
Image: Sample of data single path 'SCLMIA'
It's from Bigquery.
I think this is what you're trying to do:
SELECT *
FROM
flight_paths
WHERE
CONCAT(origin, destination) IN ('SCLMIA', 'SCLIQQ', 'SCLMAD', 'LIMCUZ', 'BOGMDE', 'FORGRU', 'SDUCGH', 'SCLGRU', 'BOGLIM', 'GYEUIO')
AND RIGHT(flight_path, 3) = origin
AND LEFT(flight_path, 3) = destination
Here's a db-fiddle that demonstrates the answer:
https://www.db-fiddle.com/f/vUZ4HL4NC9xaBBZpwTYNcR/0

multiple columns from single column in python

I am trying to split a column into multiple columns
The column has values like this:
message
------------
time=15:45:19 devname="FG3H0E3917903319" devid="FG3H0E3917903319"
logid="1059028705" type="utm" subtype="app-ctrl" eventtype="app-ctrl-all"
level="warning" vd="root" eventtime=1564226119 appid=16009
srcip=172.24.208.2 dstip=93.184.221.240 srcport=4832 dstport=80
srcintf="LAN-RahaNet" srcintfrole="lan" dstintf="WAN-RahaNet"
dstintfrole="lan" proto=6 service="HTTP" direction="outgoing" policyid=43
sessionid=493024483 applist="LanAppControl" appcat="Update"
app="MS.Windows.Update" action="block"
hostname="www.download.windowsupdate.com" incidentserialno=1522726002
url="/msdownload/update/v3/static/trustedr/en/authrootseq.txt" msg="Update:
MS.Windows.Update," apprisk="elevated"
Basically I need to split this column into:
time devname devid ...
--------------------------------------------------------------
15:45:19 FG3H0E3917903319 FG3H0E3917903319 ...
short answer:
split the message on space, to get a list of key value pairs.
split every key-value pair on = sign.
add corresponding keys to their respective columns.

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"

Django select only rows with duplicate field values

suppose we have a model in django defined as follows:
class Literal:
name = models.CharField(...)
...
Name field is not unique, and thus can have duplicate values. I need to accomplish the following task:
Select all rows from the model that have at least one duplicate value of the name field.
I know how to do it using plain SQL (may be not the best solution):
select * from literal where name IN (
select name from literal group by name having count((name)) > 1
);
So, is it possible to select this using django ORM? Or better SQL solution?
Try:
from django.db.models import Count
Literal.objects.values('name')
.annotate(Count('id'))
.order_by()
.filter(id__count__gt=1)
This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:
dupes = Literal.objects.values('name')
.annotate(Count('id'))
.order_by()
.filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])
This was rejected as an edit. So here it is as a better answer
dups = (
Literal.objects.values('name')
.annotate(count=Count('id'))
.values('name')
.order_by()
.filter(count__gt=1)
)
This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:
Literal.objects.filter(name__in=dups)
The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.
try using aggregation
Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)
In case you use PostgreSQL, you can do something like this:
from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value
duplicate_ids = (Literal.objects.values('name')
.annotate(ids=ArrayAgg('id'))
.annotate(c=Func('ids', Value(1), function='array_length'))
.filter(c__gt=1)
.annotate(ids=Func('ids', function='unnest'))
.values_list('ids', flat=True))
It results in this rather simple SQL query:
SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1
Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:
dupes = []
uniques = []
dupes_query = MyModel.objects.values_list('field', flat=True)
for dupe in set(dupes_query):
if not dupe in uniques:
uniques.append(dupe)
else:
dupes.append(dupe)
print(set(dupes))
If you want to result only names list but not objects, you can use the following query
repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')

Using SQLDF to select specific values from a column

SQLDF newbie here.
I have a data frame which has about 15,000 rows and 1 column.
The data looks like:
cars
autocar
carsinfo
whatisthat
donnadrive
car
telephone
...
I wanted to use the package sqldf to loop through the column and
pick all values which contain "car" anywhere in their value.
However, the following code generates an error.
> sqldf("SELECT Keyword FROM dat WHERE Keyword="car")
Error: unexpected symbol in "sqldf("SELECT Keyword FROM dat WHERE Keyword="car"
There is no unexpected symbol, so I'm not sure whats wrong.
so first, I want to know all the values which contain 'car'.
then I want to know only those values which contain just 'car' by itself.
Can anyone help.
EDIT:
allright, there was an unexpected symbol, but it only gives me just car and not every
row which contains 'car'.
> sqldf("SELECT Keyword FROM dat WHERE Keyword='car'")
Keyword
1 car
Using = will only return exact matches.
You should probably use the like operator combined with the wildcards % or _. The % wildcard will match multiple characters, while _ matches a single character.
Something like the following will find all instances of car, e.g. "cars", "motorcar", etc:
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
And the following will match "car" or "cars":
sqldf("SELECT Keyword FROM dat WHERE Keyword like 'car_'")
This has nothing to do with sqldf; your SQL statement is the problem. You need:
dat <- data.frame(Keyword=c("cars","autocar","carsinfo",
"whatisthat","donnadrive","car","telephone"))
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
# Keyword
# 1 cars
# 2 autocar
# 3 carsinfo
# 4 car
You can also use regular expressions to do this sort of filtering. grepl returns a logical vector (TRUE / FALSE) stating whether or not there was a match or not. You can get very sophisticated to match specific items, but a basic query will work in this case:
#Using #Joshua's dat data.frame
subset(dat, grepl("car", Keyword, ignore.case = TRUE))
Keyword
1 cars
2 autocar
3 carsinfo
6 car
Very similar to the solution provided by #Chase. Because we do not use subset we do not need a logical vector and can use both grep or grepl:
df <- data.frame(keyword = c("cars", "autocar", "carsinfo", "whatisthat", "donnadrive", "car", "telephone"))
df[grep("car", df$keyword), , drop = FALSE] # or
df[grepl("car", df$keyword), , drop = FALSE]
keyword
1 cars
2 autocar
3 carsinfo
6 car
I took the idea from Selecting rows where a column has a string like 'hsa..' (partial string match)