how to add headers to a selected data from bigger data frame? - pandas

I'm learning pandas and I have a DataFrame (from CSV) that I need to filter. The original DataFrame looks like this:
+----------+-----------+-------------+
| Header1 | Header2 | Header3 |
| Value 1 | A | B |
| Value 1 | A | B |
| Value 2 | C | D |
| Value 1 | A | B |
| Value 3 | B | E |
| Value 3 | B | E |
| Value 2 | C | D |
+----------+-----------+-------------+
Then, I select the new data with this code:
dataframe.header1.value_counts()
output:
Value 1 -- 3
Value 2 -- 2
Value 3 -- 2
dtype: int64
So, I need to add headers to this selection and output something like this
Values Count
Value 1 -- 3
Value 2 -- 2
Value 3 -- 2

pd.Series.value_counts returns a Series, where the Index is all unique values in the Series calling the method. reset_index is what you want to make it a DataFrame, and we can use the rename methods to get the column labels correct.
(df.Header1.value_counts()
.rename('Count') # Series name becomes column label for counts
.rename_axis('Values') # Index name becomes column label for unique values.
.reset_index() # Series -> DataFrame
)
# Values Count
#0 Value_1 3
#1 Value_2 2
#2 Value_3 2

Related

PySpark - drop rows with duplicate values with no column order

I have a PySpark DataFrame with two columns like so:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 2 | 1 |
| 3 | 2 |
| 1 | 1 |
+-------------------+--------------------------+
I want to drop duplicates but with no respect to the order of the columns.
For example, a row that contains (1,2) and a row that contains (2,1) are duplicates.
The resultant Dataframe would look like this:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 1 | 1 |
+-------------------+--------------------------+
The regular drop_duplicates method doesn't work in this case, anyone has any ideas how to do this cleanly and efficiently?
(df1.withColumn('x',array_sort(array(col('left'), col('right'))))#create sorted array column of columns left and right
.dropDuplicates(['x'])#Use column created to drop duplicates
.drop('x')#drop unwanted column
).show()

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of:
index | letter
------|-------
0 | a
1 | a
2 | b
3 | c
4 | a
5 | a
6 | b
I want to create the following dataframe in which the occurrence count for each instance of a letter is stored, for example the first time we see "a" its occurrence count is 0, second time 1, third time 2:
index | letter | occurrence
------|--------|-----------
0 | a | 0
1 | a | 1
2 | b | 0
3 | c | 0
4 | a | 2
5 | a | 3
6 | b | 1
I can achieve this in pandas using:
df['occurrence'] = df.groupby('letter').cumcount()
How would I go about doing this in pyspark? Cannot find an existing method that is similar.
The feature you're looking for is called window functions
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn("occurence", row_number().over(Window.partitionBy("letter").orderBy("index")))

How to assign indexes to your datatable rows in sqlite in large databases?

I'm using sqlite with python. Suppose that I have a datatable that looks like this:
Table 1
1 | 2 | 3 | 4 | 5
__|___|___|___|__
A | B | B | C | D
B | D | B | D | C
A | D | C | C | A
B | D | B | D | C
D | B | B | C | D
D | B | B | C | D
Question: How can I create (very quickly/efficiently/viable for very large databases) an index column for each row where if row x and row y are identical they get assigned the same index? For the example database I would want something like this:
Table 1
Index| 1 | 2 | 3 | 4 | 5
_____|___|___|___|___|___
23 | A | B | B | C | D
32 | B | D | B | D | C
106| A | D | C | C | A
72 | B | D | B | D | C
80 | D | B | B | C | D
80 | D | B | B | C | D
I don't care what the actual indexes are, as long as duplicate rows (like the last two in the example) get the same index.
You COULD create an index made up of every field in the table.
create index on table1 (field1, field2, field3, field4, field5)
But that's probably not a good idea. It makes a huge index that will be slow to build and slow to process. Some database engines won't let you create an index where the combination of fields is over a certain length. I'm not sure if there's such a limit in sqllite or what it might be.
The normal thing to do is to pick some field or combination of a small number of fields that is likely to be short and well distributed.
By "short" I mean literally and simply, the data in the field only takes a few bytes. It's an int or a varchar with a small length, varchar(4) or some such. There's no absolute rule about how short "short" is, but you should pick the shortest otherwise suitable field. A varchar(4000) would be a bad choice.
By "well distributed" I mean that there are many different values. Ideally, each row has a unique value, that is, there is no value that is the same for any two rows. If there is no such field, then pick one that comes as close to this as possible. A field where sometimes 2 or 3 rows share a value but rarely more than that is good. A field where half the records all have the same value is not.
If there is no one field that is well distributed, you can create an index on a combination of two or three fields. But if you use too many fields, you start breaking the "short" condition.
If you can parse your file row by row why not use a dict with the row as a string or a tuple?
my_dico = {}
index_counter = 1
with open(my_db) as my_database, open(out_file) as out:
for row in my_database:
my_row_as_a_tuple = tuple(row.strip().split())
if my_row_as_a_tuple in my_dico:
out.write(my_dico[my_row_as_a_tuple] + '<your separator>' + row)
else:
index_counter += 1
out.write(str(index_counter) + '<your separator>' + row)
my_dico[my_row_as_a_tuple] = str(index_counter)

Combine column x to n in OpenRefine

I have a table with an unknown number of columns, and I need to combine all columns after a certain point. Consider the following:
| A | B | C | D | E |
|----|----|---|---|---|
| 24 | 25 | 7 | | |
| 12 | 3 | 4 | | |
| 5 | 5 | 5 | 5 | |
Columns A-C are known, and the information in them correct. But column D to N (an unknown number of columns starting with D) needs to be combined as they are all parts of the same string. How can I combine an unknown number of columns in OpenRefine?
As some columns may have empty cells (the string may be of various lengths) I also need to disregard empty cells.
There is a two step approach to this that should work for you.
From the first column you want to merge (Col D in this case) choose Transpose->Transpose cells across columns into rows
You will be asked to set some options. You'll want to choose 'From Column' D and 'To Column' N. Then choose to transpose into One Column, assign a name to that column, make sure the option to 'Ignore Blank Cells' is checked (should be checked by default. Then click Transpose.
You'll get the values that were previously in cols D-N appearing in rows. e.g.
| A | B | C | D | E | F |
|----|----|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 |
Transposes to:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 4 |
| | | | 5 |
| | | | 6 |
You can then use the dropdown menu from the head of the 'new' column to choose
Edit cells->Join multi-value cells
You'll be asked what character you want to use to separate the characters in the joined cell. Probably in your use case you can delete the joining character and combine the cells without any joining characters. This will give you:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 456 |