adding element to series with simple '.' notation - pandas

x = pd.Series()
x.a = 1
x.a
>> 1
x.values()
>> array([])
x.a can be retrieved by calling it directly (x.a) but selecting a list of series elements doesn't include 'a'.
Is there a way to get a list of elements that does include 'a'.

This is not way to assign a new value by using .
x.loc['a']=1
x
Out[53]:
a 1
dtype: int64

Related

Calling for a column in a dataframe

I'm creating a dataframe(df) with two, columns PEA_0 and PEA_180. To input values in the cells of each column I could do this:
df$PEA_0 <-1
df$PEA_180 <-1
But since I'm expanding my df I would like a easy way to change the number behind "PEA_". I would like to assign the number to a letter so that I could change the number on the fly later:
a<-0
b<-180
and run the code like this:
paste("df$PEA_", a, sep="") < -1
I want R to understand that the code above means:
df$PEA_0 <- 1 but R only sees "df$PEA_0" <- 1 and throws the error :
Error in paste("tilst$Death_", a) <- 1 :
target of assignment expands to non-language object
Any thoughts on how to omit this?

AttributeError: 'int' object has no attribute 'count' while using itertuples() method with dataframes

I am trying to iterate over rows in a Pandas Dataframe using the itertuples()-method, which works quite fine for my case. Now i want to check if a specific value ('x') is in a specific tuple. I used the count() method for that, as i need to use the number of occurences of x later.
The weird part is, for some Tuples that works just fine (i.e. in my case (namedtuple[7].count('x')) + (namedtuple[8].count('x')) ), but for some (i.e. namedtuple[9].count('x')) i get an AttributeError: 'int' object has no attribute 'count'
Would appreciate your help very much!
Apparently, some columns of your DataFrame are of object type (actually a string)
and some of them are of int type (more generally - numbers).
To count occurrences of x in each row, you should:
Apply a function to each row which:
checks whether the type of the current element is str,
if it is, return count('x'),
if not, return 0 (don't attempt to look for x in a number).
So far this function returns a Series, with a number of x in each column
(separately), so to compute the total for the whole row, this Series should
be summed.
Example of working code:
Test DataFrame:
C1 C2 C3
0 axxv bxy 10
1 vx cy 20
2 vv vx 30
Code:
for ind, row in df.iterrows():
print(ind, row.apply(lambda it:
it.count('x') if type(it).__name__ == 'str' else 0).sum())
(in my opinion, iterrows is more convenient here).
The result is:
0 3
1 1
2 1
So as you can see, it is possible to count occurrences of x,
even when some columns are not strings.

modifying a dataframe by adding additional if statement column

Modifying a data frame by adding an additional column with if statement.
I created 5 lists namely: East_Asia, Central_Asia,Central_America,South_America, Europe_East & Europe_West. And I wanted to add a conditional column based on existing column. i.e if japan in Central_East, then the japan row in the adding column should contain Central East, so on.
df['native_region'] =df["native_country"].apply(lambda x: "Asia-East" if x in 'Asia_East'
"Central-Asia" elif x in "Central_Asia"
"South-America" elif x in "South_America"
"Europe-West" elif x in "Europe_West"
"Europe-East" elif x in "Europe_East"
"United-States" elif x in "
United-States"
else "Outlying-US"
)
File "", line 2
"Central-Asia" elif x in "Central_Asia"
^
SyntaxError: invalid syntax
I might be wrong, but I think you're taking the problem the wrong way around.
What you seem to be doing there is just to replace '_' by '-', which you can do with the following line:
df['native_region'] = df.native_country.str.replace('_', '-')
And then, in my experience, it's more understandable to work like that :
known_countries = ['Asia-East', 'Central-Asia', 'South-America', ...]
is_known = df['native_country'].isin(known_countries )
df.native_region[~known_countries] = 'Outlying-US'
This could work also if you worked with countries like :
east_asia_countries = ['Japan', 'China', 'Korea']
isin_east_asia = df['native_country'].isin(east_asia_countries)
df.native_region[known_countries] = 'East-Asia'

detect sequence in hive column with lead function

I'm trying to detect a sequence in a column of my hive table. I have 3 columns
(id, label, index). Each id has a sequence of labels and index is the ordering of the labels, like
id label index
a x 1
a y 2
a x 3
a y 4
b x 1
b y 2
b y 3
b y 4
b x 5
b y 6
I want to identify if the label sequence of x,y,x,y occurs.
I was thinking of trying a lead function to accomplish this like:
select id, index, label,
lead( label, 1) over (partition by id order by index) as l1_fac,
lead( label, 2) over (partition by id order by index) as l2_fac,
lead( label, 3) over (partition by id order by index) as l3_fac
from mytable
yields:
id index label l1_fac l2_fac l3_fac
a 1 x y x y
a 2 y x y NULL
a 3 x y NULL NULL
a 4 y NULL NULL NULL
b 1 x y y y
b 2 y y y x
b 3 y y x y
b 4 y x y NULL
b 5 x y NULL NULL
where l1(2,3) are the next label values. Then I could check for a pattern with
where label = l2_fac and l1_fac = l3_fac
This will work for id = a, but not id = b where the label sequence is: x, y, y, y, y, x. I don't care that it was 3 y's in a row I am just interested that it went from x to y to x to y.
I'm not sure if this is possible, I was trying a combination of group by and partition, but not successful.
I answered this question where the OP wanted to collect items to a list and remove any repeating items. I think this is essentially what you want to do. This would extract actual xyxy sequences and also would account for your second example where xyxy occurs, but is clouded by 2 extra ys. You need to collect the label column to an array using this UDAF -- this will preserve the order -- then use the UDF I referenced, then you can use concat_ws to make the contents of this array a string, and lastly, check that string for the occurrence of your desired sequence. The function instr will spit out the location of the first occurrence and zero if it never finds the string.
Query:
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;
create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select id, label_string, instr('xyxy', label_string) str_flg
from (
select id, concat_ws('', no_dups) label_string
from (
select id, remove_seq_dups(label_array) no_dups
from (
select id, collect(label) label_array
from db.table
group by id ) x
) y
) z
Output:
id label_string str_flg
============================
a xyxy 1
b xyxy 1
A better alternative might be to simply collect label with the UDF, make it a string, and then regex out the sequence xyxy but I'm pretty terrible at regex so possibly someone else can comment intelligently on this.

dimple.js aggregating non-numerical values

Say i have a csv containing exactly one column Name, here's the first few values:
Name
A
B
B
C
C
C
D
Now i wish to create a bar plot using dimple.js that displays the distribution of the Name (i.e the count of each distinct Name) but i can't find a way to do that in dimple without having the actual counts of each name. I looked up the documentation of chart.addMeasureAxis(
https://github.com/PMSI-AlignAlytics/dimple/wiki/dimple.chart#addMeasureAxis) and found that for the measure argument:
measure (Required): This should be a field name from the data. If the field contains numerical values they will be aggregated, if it contains any non-numeric values the distinct count of values will be used.
So basically what that says is it doesn't matter how many times each category occurred, dimple is just going to treat each of them as if they occurred once ??
I tried the following chart.addMeasureAxis('y', 'Name') and the result was a barplot that had each bar at value of 1 (i.e each name occurred only once).
Is there a way to achieve this without changing the data set ?
You do need something else in your data which distinguishes the rows:
e.g.
Type Name
X A
Y A
Z A
X B
Y B
A C
You could do it with:
chart.addCategoryAxis("x", "Name");
chart.addMeasureAxis("y", "Type");
or add a count in:
Name Count
A 1
A 1
A 1
B 1
B 1
C 1
Then it's just
chart.addCategoryAxis("x", "Name");
chart.addMeasureAxis("y", "Count");
But there isn't a way to do it with just the name.
Hope this helps.. First use d3.nest() to get the unique values of the dataset. then add the count of each distinct values to the newly created object.
var countData = d3.nest()
.key(function (d) {
return d["name];
})
.entries(yourDate);
countData.forEach(function (c) {
c.count = c.values.length;
});
after this created the dimple chart from the new "countData Object"
var myChart = new dimple.chart(svg, countData);
Then for 'x' and 'y' axis,
myChart.addCategoryAxis("x", "key");
myChart.addMeasureAxis("y", "count");
Please look in to d3.nest() to understand the "countData" object.