Splitting elements in a column by hyphen - pandas

I have a table as shown below:
cell_id
area
a-1-2-0
34
a-1-2-1
42
a-1-2-2
45
a-1-2-3
42
b-1-5-0
47
b-1-5-1
40
I want to convert it to this one to make groups while splitting it for test and train sets:
cell_id
area
a-1-2
34
a-1-2
42
a-1-2
45
a-1-2
42
b-1-5
47
b-1-5
40
for i in range(df.shape[0]):
k=df['cell_id'][i].split("-")
l="{}-{}-{}-{}".format(k[0],k[1],k[2],k[3])
df['cell_id'][i]=l
I used the code above but it takes so long and i wonder if there is any faster way doing this. Thanks in advance

Related

Display rows where multiple columns are different

I have data that looks like this. Thousands of rows returned, but this is just a sample.
Most days have the same numbers in them, but some do not. Note that ID 1 and 5 have identical numbers every day.
ID
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
1
26
26
26
26
26
26
26
2
44
44
30
30
44
44
44
3
55
55
55
55
80
90
55
4
12
12
43
43
43
43
43
5
36
36
36
36
36
36
36
I'd like to only return rows where the days of the week have different numbers.
In this case, the only IDs returned should be 2, 3 & 4.
What would I want this query to look like?
Thanks!
One idea that should work in most RDBMS (with some syntax tweaks) is the following.
This is SQL Server compatible: pivot the days into rows and count the distinct values and filter accordingly:
select id
from t
cross apply (
select Count(distinct d) from (
values(sunday),(monday),(tuesday),(wednesday),(thursday),(friday),(saturday)
)d(d)
)d(v)
where d.v>1

Create range bins in hive for histograms

I have a data set which contains students_id and their ages. I want the marks should be arranged in a range or bin with the bucket size of 10.
stud_id ages
101 11
102 13
103 21
104 25
Similarly i have date for more number of records. this has to be arranged with a bin size of 10.
The Expected output is:
stud_id ages_bin
101 11-20
102 11-20
103 21-30
104 21-30
I tried simple case statement in hive.
select stud_id,
case when ages between 0 and 10 then '0-10'
when ages between 11 and 20 then '11-20'
when ages between 21 and 30 then '21-30'
when ages between 31 and 40 then '31-40'
when ages between 41 and 50 then '41-50'
when ages between 51 and 60 then '51-60'
when ages between 61 and 70 then '61-70'
when ages between 71 and 80 then '71-80'
when ages between 81 and 90 then '81-90'
when ages between 91 and 100 then '91-100'
when ages between 101 and 110 then '101-110'
when ages between 111 and 120 then '111-120'
when ages between 121 and 130 then '121-130'
when ages between 131 and 140 then '131-140'
when ages between 141 and 150 then '141-150'
else NULL end as ages_bin
from students
Is there any simple way to have the binned data with bucket size 10?
can someone help me in writing a simple code?
There's one simple method to arrange the range of bins for histogram. Here is the code:
select stud_id,floor((ages)/10)*10 as strt_range,
floor((ages)/10)*10+9 as end_range from students
This produces the following output:
stud_id ages_bin
101 10-19
102 10-19
103 20-29
104 20-29
Try this. This should be able get u the bins in bin format :
select stud_id, concat(cast(floor((ages)/10)*10 as string),'-',
cast(floor((ages)/10)*10+9 as string)) from students
to be able to get appropriate output, it would better if u group it and order it
appropriately

Alphanumeric Sorting in PostgreSQL 9.4

I've a table in PostgreSQL 9.4 database in which one of the column contains data both integer and alphabets in following format.
1
10
10A
10A1
1A
1A1
1A1A
1B
1C
1C1
2
65
89
Format is, it starts with a number then an alphabet then number then alphabet and it goes on. I want to sort the field like below,
1
1A
1A1
1A1A
1B
1C
1C1
2
10
10A
10A1
65
89
But when sorting 10 comes before 2. Please suggest a possible query to obtain desired result.
Thanks in advance
Try this
SELECT *
FROM table_name
ORDER BY (substring(column_name, '^[0-9]+'))::int -- cast to integer
,coalesce(substring(column_name, '[^0-9_].*$'),'')

transpose column to row oracle

I have a query returned value in this form (query return more than 50 columns).
1-99transval 100-200transval 200-300transval ... 1-99nontransval 100...
50 90 80 67 58
For a row value. I want these details to be converted into columns and take the following shape:
Range Transval NonTransval
1-99 50 67
100-200 90 58
In pure SQL, it will need a lot of coding because you will have to manually put the range as there is no relation between the values and the range at all. Had there been a relationship, you could use CASE expression and build the range dynamically.
SQL> WITH DATA AS
2 (SELECT 50 "1-99transval",
3 90 "100-200transval",
4 80 "200-300transval",
5 67 "1-99nontransval",
6 58 "100-200nontransval",
7 88 "200-300nontransval"
8 FROM dual
9 )
10 SELECT '1-99' range,
11 "1-99transval" transval,
12 "1-99nontransval" nontransval
13 FROM DATA
14 UNION
15 SELECT '100-200' range,
16 "100-200transval",
17 "100-200nontransval" nontransval
18 FROM DATA
19 UNION
20 SELECT '200-300' range,
21 "200-300transval",
22 "200-300nontransval" nontransval
23 FROM DATA;
RANGE TRANSVAL NONTRANSVAL
------- ---------- -----------
1-99 50 67
100-200 90 58
200-300 80 88
From Oracle database 11g Release 1 and above, you could use UNPIVOT
SQL> WITH DATA AS
2 (SELECT 50 "1-99transval",
3 90 "100-200transval",
4 80 "200-300transval",
5 67 "1-99nontransval",
6 58 "100-200nontransval",
7 88 "200-300nontransval"
8 FROM dual
9 )
10 SELECT *
11 FROM DATA
12 UNPIVOT( (transval,nontransval)
13 FOR RANGE IN ( ("1-99transval","1-99nontransval") AS '1-99'
14 ,("100-200transval","100-200nontransval") AS '100-200'
15 ,("200-300transval","200-300nontransval") AS '200-300'));
RANGE TRANSVAL NONTRANSVAL
------- ---------- -----------
1-99 50 67
100-200 90 58
200-300 80 88
Above, in your case you need to replace the WITH clause with your existing query as a sub-query. You need to include other columns in the UNION.
In PL/SQL, you could (ab)use EXECUTE IMMEDIATE and get the "range" by extracting the column names in dynamic sql.
Although, it would be much better to modify/rewrite your existing query which you have not shown yet.
If you are using Oracle 11g version then you can use the UNPIVOT feature.
CREATE TABLE DATA AS
SELECT 50 "1-99transval",
90 "100-200transval",
80 "200-300transval",
67 "1-99nontransval",
58 "100-200nontransval",
88 "200-300nontransval"
FROM dual
SELECT *
FROM DATA
UNPIVOT( (Transval,NonTransval) FOR Range IN ( ("1-99transval","1-99nontransval") as '1-99'
,("100-200transval","100-200nontransval") as '100-200'
,("200-300transval","200-300nontransval") as '200-300'))
http://sqlfiddle.com/#!4/c9747/3/0

select previous row value for same user (multiple records)

I have a query in Access 2010 (have also tried on 2013, same result) that is working but not perfectly for all records. I'm wondering if anyone knows what is causing the error.
Here is the query (adapted from http://allenbrowne.com/subquery-01.html#AnotherRecord):
SELECT t_test_table.individ, t_test_table.test_date, t_test_table.score1, (SELECT top 1 Dupe.score1
FROM t_test_table AS Dupe
WHERE Dupe.individ = t_test_table.individ
AND Dupe.test_date < t_test_table.test_date
ORDER BY Dupe.primary DESC, Dupe.individ
) AS PriorValue, [score1]-[priorvalue] AS scorechange
FROM t_test_table;
The way the data is set up, an individual has multiple records in the file (designated by individ) representing different dates a test was taken. A date AND individ combination are unique - you can only take a test once. [primary] refers to primary key column. I just made it because the individ field is not a primary key since multiples are possible (I'm not including it here due to space)
The goal of the above code was to create the following:
individ test_date score1 PriorValue scorechange
1 3/1/2013 40
1 6/4/2013 51 40 11
1 7/25/2013 55 51 4
1 12/13/2013 59 55 4
5 8/29/2009 39
5 12/9/2009 47 39 8
5 6/1/2010 58 47 11
5 8/28/2010 42 58 -16
5 12/15/2010 51 42 9
Here is what I actually got. You can see that for individ 1, it winds up taking the first score rather than the previous score for each subsequent record. For individ 5, it kind of works, but the final priorvalue should be 42 and not 58.
individ test_date score1 PriorValue scorechange
1 3/1/2012 40
1 6/4/2012 51 40 11
1 7/25/2012 55 40 15
1 12/13/2012 59 40 19
5 8/29/2005 39
5 12/9/2005 47 39 8
5 6/1/2006 58 47 11
5 8/28/2006 42 58 -16
5 12/15/2006 51 58 -7
Does anyone have any ideas about what went wrong here? In other records, it works perfectly, but I can't determine what is causing some records to fail to take the previous value.Any help is appreciated, and let me know if you require additional information.
To get the most recent test for a given individ, you'll need to include a sort by date. In your inner query, replace
ORDER BY Dupe.primary DESC, Dupe.individ
with
ORDER BY Dupe.test_date DESC
It's hard to say exactly what effect sorting by primary has, since you haven't told us how you're generating the values of primary. If the combination of individ and test_date is guaranteed to be unique, you might want to consider making the two of them into your primary key instead of creating a new thing. The Dupe.individ in the ORDER BY line has no effect, since your WHERE clause already limited the results of the inner query to one individ.