pandas read IRS space-delimited txt data - pandas

I recently was working with IRS tax file data. It is space-delimited txt data like the following (full data are here):
There are some patterns in the way the data was stored. But to me, the data is not formatted in a standard way and it is not easy to read into Pandas. I was wondering how to get a dataframe like the following from the above txt data:
+------------+-------------+--------------------------+-----+-----+-----+------+
| fips_state | fips_county | name | c1 | c2 | c3 | c4 |
+------------+-------------+--------------------------+-----+-----+-----+------+
| 02 | 013 | Aleutians East Borough T | 145 | 280 | 416 | 1002 |
| 02 | 016 | Aleutians West Total Mig | 304 | 535 | 991 | 2185 |
| ... | ... | ... | ... | ... | ... | ... |
+------------+-------------+--------------------------+-----+-----+-----+------+

This will get you the data into columns in two separate dataframes within pandas or prior to creating your lists. After parsing merge the two dataframes.
import urllib.request # the lib that handles the url stuff
target_url='https://raw.githubusercontent.com/shuai-zhou/DataRepo/master/data/C9091aki.txt'
list_a = []
list_b = []
for line in urllib.request.urlopen(target_url):
if line.decode('utf-8')[0:2] != ' ':
print(line.decode('utf-8').strip())
list_a.append(line.decode('utf-8').strip())
if line.decode('utf-8')[0:5] == ' ':
print(line.decode('utf-8').strip())
list_b.append(line.decode('utf-8').strip())
dfa = pd.DataFrame(list_a)
dfb = pd.DataFrame(list_b)

Related

Split String into multiple columns in Informatica powercenter

my goal is to take querypaths from an excel sheet, split it and insert data into teradata table. . How can I do it?
Here an example of the scennario:
QUERYPATH:
/content/folder[#name='AAAAA AAAA']/folder[#name='A.B.C.']/folder[#name='AreaA']/folder[#name='Sub Area ABC']/folder[#name='GroupBB']/analysis[#name='Final elementJK']
/content/folder[#name='AAAAA AAAA']/folder[#name='A.B.C.']/folder[#name='AreaB']/folder[#name='Sub Area A.B.C.']/report[#name='Final elementHJ']
/content/folder[#name='AAAAA AAAA']/folder[#name='A.B.C.']/folder[#name='AreaC']/folder[#name='Sub BCD']/analysis[#name='Final elementFG']
id A| AAAAAAAAAA |idArea|Area |idSubArea| SubArea |idGroup | Group | Final Element |
112| AAAAAAAAAA | 22 |AreaC | 221 | Sub BCD | 2216 | GroupA | Final elementFG |
112| BDHDSKDDDD | 39 |AreaA | 393 | Sub ABC | 3931 | GroupBB | Final elementJK |
112| AAAAAAAAAA | 22 |AreaC | 222 | Sub BCD | 2217 |Final ElementLL| Final elementLL |
112| EEEEEEEEEE | 11 |AreaB | 114 |Sub A.B.C.| 1142 |Final elementHJ| Final elementHJ |
There's always an Area and SubArea Value, Group and FinalElement usually are "new values" and also usually there is no a group value, so in this case I copy the FinalElement valu (example: Final elementLL, Final elementHJ )
There is no split function in PowerCenter. You'll need to use a combination of INSTR and SUBSTR functions to extract the appropriate values.
Please also share (apart from the problem definition) your attempts on solving it and results you have. It will make it easier to help you achieve the results.

How to use dtype inside a for loop while traversing a dataframe column-wise ? So that I can traverse each row of a particular dtype column

For example, I have a dataframe df like this:
| Name | color | id | weight |
|------- |-------- |---- |-------- |
| john | blue | 67 | 70 |
| clara | yellow | - | 67 |
| diana | red | 89 | 56 |
Here the numeric columns like "id" and "weight" should have all numeric values, unlike the second value of "id" which is a '-'.
If I do df.dtypes, it returns:
| name | object
| color | object
| id | object
| weight | float
**How can I traverse through the dataframe column-wise, then check if the type of column is an object, then if it an object, then check if it is becoming an object because of the typo '-' like id- if yes then raise a flag **
Zip up the column name and the dtypes to make a tuple:
for col_name, col_type in zip(df.columns, df.dtypes):
if col_type == "object":
# do whatever here
pass

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

How to extract genotypes from public 1000G bigquery table

I would like to extract GENOTYPE information from bigquery-public-data:human_genome_variants.1000_genomes_phase_3_optimized_schema_variants_20150220 by SQL.
Considering how the data is organized in the table I find it to be a difficult task. I've used SQL occasionally but not for complex queries. I would be grateful for some guidance.
For this task the relevant columns are:
names (SNPid)
reference_bases
hom_ref_call (sample names that are homozygous
to reference bases. therefore each sample has genotype
'reference_bases'+'reference_bases').
By running the query:
SELECT ARRAY_TO_STRING(names, '') as SNP,
samples,
CONCAT(reference_bases, reference_bases) as GT
FROM `bigquery-public-data.human_genome_variants.1000_genomes_phase_3_optimized_schema_variants_20150220`
CROSS JOIN UNNEST(hom_ref_call) as samples
I get:
+-----+------------+---------+----+
| Row | SNP | samples | GT |
+-----+------------+---------+----+
| 1 | rs10158087 | HG00096 | GG |
| 2 | rs10158087 | HG00097 | GG |
| 3 | rs10465663 | HG00096 | CC |
| 4 | rs10465663 | HG00097 | CC |
+-----+------------+---------+----+
The result I am looking for should look like this:
+-----+------------+---------+---------+
| Row | SNP | HG00096 | HG00097 |
+-----+------------+---------+---------+
| 1 | rs10158087 | GG | GG |
| 2 | rs10465663 | CC | CC |
+-----+------------+---------+---------+
How should I structure the query to get the desired table? Thanks.
Note: There are 3500 samples so the column names (HG00096,...) should be generated automatically.
For your sample, you can use aggregation:
SELECT ARRAY_TO_STRING(names, '') as SNP,
samples,
MAX(CASE WHEN samples = 'HG00096' THEN reference_bases END) as HG00096,
MAX(CASE WHEN samples = 'HG00097' THEN reference_bases END) as HG00097
FROM `bigquery-public-data.human_genome_variants.1000_genomes_phase_3_optimized_schema_variants_20150220` CROSS JOIN
UNNEST(hom_ref_call) as samples
GROUP BY SNP;
Do note that you need to know the columns that you want in advance.

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.