Data Science/ dummy variables

EmoJoyFul

Hi guys! New to Pythonista, I am loving it so far!
I just started wrapping my head around objc, I’m a machine learning data scientist and python user.

I need to make a quick and dirty app for labeling a large amount of ML training data (it’s actually just for my mom, she’s a retired nurse dipping her toes into ML)

The first issue I’m having in Xcode, which is why I moved over to pythonista, is my dataset, it has 938 variables (most of which were created by original authors with one hot encoding. It has 550,000 rows, each a patient who may have any number of 1s in each category. Think type of meds, med history, etc.

I plopped it into Xcode as a CSV and it created just a dictionary, with row 1 being all headers. I think I can fix that, but I’m wondering if it’s possible with UI kit to display only column names where a 1 is present and not a zero.

The ultimate goal being to present info about a single ‘patient’ and then let my mom hit a button that labels them as needing surgery Or not, that info recording to a csv and then I can help her with the ML stuff outside of iOS.

I tried to get her to just label a csv but she’s all iPad and her eyes aren’t great.

EmoJoyFul

I re-read this and realize its a big question, sorry!

Really, my first question is; is it possible to create a view that will display boolean data conditionally?
i.e.
Patient 1: Has a 1 in columns 1, 19, 23, 28, and 923 - so Only display that information.

That's really step 1. I want to make sure the data I have is organized in a way that I can use it. Otherwise, it's really a bit of a pain in the rear to undummy.

cvp

@EmoJoyFul thus, if I correctly understand, you have
a first row like info1,info2,...,info938
a row per person like 0,0,1,0,1,1,...,0

and, for a person with a 1 in columns 1, 19, 23, 28, and 923, you want to
display a text like info1, info19, info23, info28, info923
assume rows are arrays of 938 elements:
info name for row 1,
1 or 0 for row > 1
try this:

row0 =  ['info1','info2','info3','info4']
row1 = [0,1,1,0]
t = '-'.join([row0[i] for i in range(len(row0)) if row1[i] == 1])
print(t)

mikael

@EmoJoyFul, really like your project.

See this gist for how you might present the data for labeling. TableView has the advantage of scrolling nicely if there’s a lot of 1s.

(I would have copied the code here, but there seems to be a new spam filter in town...)

EmoJoyFul

Thank you!! I’ll take a look. :)
Is there a way to use a search function or lambda with Regex and apply ? Or do you see no way around calling each line?

My Hope is to cluster without concatenating each group of variables..

So info19-info87 could all have columns labeled after different diseases. If the patient has that disease, they would have a one Elsif a 0.

Then 87-190 could be medication types.

I was thinking it could be beneficial to break the data set until clusters, replace 1s with the column names and 0s with none. My hope being the user, my mom, won’t have to learn binary lol.

I will run yours now mikael. I’m excited to try it. I love machine learning. I would love a way to incorporate small batch or stream training concurrent to labeling tasks. I have no idea if that’s computationally silly or helpful but it could potential
Y better inform how much manual labeling has to be done.

Thanks!
Anna

You guys have been very helpful!

EmoJoyFul

Ok that is a way to display info I never would have thought of! I love it!!!!!!
Mikael. Thank you thank you thank you!

Anna

mikael

@EmoJoyFul, happy if you can use it.

For the clusters, you could do some preprocessing, for example in the set_row_data method, where you replace selected symptoms(?) with the name of a condition(?) already before presenting the data for labeling.

E.g. at the beginning of the script, add rules like this:

clusters = {
    'condition1': ('info2', 'info3'),
    'condition2': ('info1', 'info4'),
}

Then, in set_row_data, add preprocessing:

for condition in clusters:
    cluster = set(clusters[condition])
    if all([
        symptom in self.row_data
        for symptom in cluster]
    ):
        self.row_data = [
            symptom 
            for symptom in self.row_data
            if symptom not in cluster
        ]
        self.row_data.append(condition)

ccc

@mikael said:

if all([
symptom in self.row_data
for symptom in cluster]
):

I would recommend that you lose the [] in this code. If cluster is massive then there will be a memory and performance penalty to pay for creating a list. Without the [], no list will be created and the very first compare that returns False will stop the rest of the calculation.

mikael

@ccc, thanks! I had managed to forget or maybe not even realize in the first place that x for x in y is a valid generator expression outside a list/tuple/set comprehension.

There are certainly other optimization opportunities as well - for example, if conditions tend to show as a large number of consecutive 1s in the source array, slicing with indexes could be much better than the more readable but verbose approach above.

EmoJoyFul

You guys are incredible! I wish I had asked sooner. I spent all Sunday away from my toddler trying to figure it out.

I really appreciate the help!
Best,Anna