Wrangling your data means getting your hands dirty

This is a quick post about why you need to adopt a mindset of iterative and interactive data wrangling.

Iterative data wrangling is the idea that you use your analysis output as a source of information to ‘redo’ you wrangling. Interactive data wrangling is the idea that you manually intervene in the process.

Some schools have the idea that data mining and cleaning should be as automated as possible. A lot of software has been purpose built for this process in the last couple of years. Because the DS community is quite friendly there are a few free software among the proprietary software like the exxy Alteryx and Trifacta.

These include Googles Openrefine and Quadrient DataCleaner.

The problem with data wrangling software is that it misses things. Things you don’t even know were there until they mess up your output. If you aren’t aware of what your data looks then you will have a hard time fixing issues or even diagnosing them. Often the conclusion is ‘ oh I better add some gradient boosting, my F1-score is a little low’. Alternatively, just clean your data better.

The other problem with automatic wrangling software is that the schema is defined for you. This means that you may end up over cleaning your data and therefore overfitting your models.

Iterative data wrangling is a skill though. The more iterations you do, the more likely it is that you are cleaning up your own mess after you spot what you have done to your data on output. Kind of like how you only see the stains on your favourite top after you ironed it and put it on. Shoulda spot cleaned.

For those new to the field, it is a process of failing. The more times you do dumb things, the more you learn and don’t do them again.

I’ll give you an example.

I have a set of tweets that use ‘coz’, ’cause’ and ‘cuz’ instead of ‘because’.

I thought that these words needed to be normalised, I could have probably just thrown these away but I was trying to be overly thorough. So I wrote a script to do this.

I didn’t check my output to often and only really looked at the top 50 words for each of the topics in the output. Later when I was visualising things I saw this:

Jabecausezi bebecause becausey

I realised that because I had manually altered these terms before I had tokenized (and, therefore, created a list, not a string to alter) I had inserted these words into ‘because’ rather than replace those words with ‘because’. I never picked it up because these words were made rare-ish. Not infrequent enough to be removed, but not frequent enough to turn up in the top 50 words for each topic.

Rare-ish is not a real word btw.

A silly example sure, but just one of many dumb mistakes i have made and never repeated. It actually took me a few weeks to pick up on this and it is the perfect example why you need to iteratively and interactively clean your data. My coherence scores (I was topic modelling at the time) jumped right up and even though I don’t really pay much attention to them now (because I review my topics qualitatively), my metric-happy collaborators got a lot more relaxed about the results.

Even Mark Wahlberg recommends it.

Marky is a neat freak

As always retweet this post @data_little for happy cleaning vibes.

Photo by averie woodard.Thanks Averie for the use of your work at Unsplash.

Every Masters of Data Science Degree in Australia – James Cook University

Over the next couple of months, we’re going to be rolling out the most comprehensive summary of every Masters of Data Science (MDS) offered in Australia.

We are so over reading posts from people who have no idea about the courses. You need to be informed about what career path is best for you, data science is a very broad field. When these sorts of posts flood forums like Whirlpool and Quora, the level of confusion and misinformation makes it difficult for prospective students to figure out what’s real and what isn’t.

Just because your mate’s sister’s boyfriend went there doesn’t mean you know if it’s good or not

– Caitie, every Janurary as she filters through DMs from prospective students.

The other reason is that some MDS (not all) are merely money makers for the universities, and that’s not a good thing for unsuspecting students. With thousands of international students coming to Australia per year to study MDS then finding it won’t help them get PR or even the job they were after, we thought it was time for some transparency. We know that the universities reputation is a critical factor in picking where students do their MDS, but the fact is that rankings don’t necessarily translate into the progressive, rewarding career in Data Science that you are after.

Finally, you need to be comfortable where you are studying. A standard MDS is two years. The University you choose needs to be the right fit for you.

LDS cannot guarantee all details here are up to date and complete. Universities often make changes to their courses and so we strongly encourage seeking clarification from the universities themselves. This article should be taken as an opinion piece.

This first summary focuses on James Cook University (JCU) in Queensland, the Sunshine State. Each article will focus on a different MDS ending in a final summary of all MDS in Australia.

James Cook University

James Cook University (JCU) offers an MDS that seems to be have been created for and with industry in mind. For a quick overview check out the promo video from 2017, when the degree commenced.

JCU MDS promo

Aside from Ron’s top acting skills, he makes it very clear that JCU is heavily aligned with industry. So for those of you who want to go into research, you should probably give this one a miss.

Delivery mode

The JCU is geared toward students who aim to go straight into the industry as well as those already working in the industry. The course is offered 100% online meaning you don’t have to move, you can access the course from anywhere. A great option if you have a family, you are already working or you are looking to start working while studying. We don’t know if that means you can get out of doing assignments though.

It’s not unheard of for students to start getting job offers and internships after their 1st year of study in an MDS. Being able to do your units online provides the flexibility to do this which isn’t possible for those doing a traditional contact-based degree.

JCU online students conduct their work via live chat sessions, recorded lectures and the materials provided on the aptly named JCU learning platform, ‘LearnJCU’. Online students are also are offered the support of a curiously titled ‘success advisor’.

Along with your tutors, your Success Advisor is there for you from your
first day as a student with JCU Online. They are readily available to help you navigate your online degree and keep you motivated throughout the duration of your study.

– JCU MDS brochure

Given it is a data science degree we wonder if the success advisor is a chatbot. You may in-fact prefer a face to face delivery and the company of other students for friendship and comradery throughout your course. If this is the case JCU isn’t for you.

Unit set up

Ok, stay with us here, this bit a little confusing. The JCU MDS is not a normal arrangement, which isn’t necessarily a bad thing.

JCU runs on a ‘carousel’ timetable which means that the units, which JCU formally call subjects, are shorter than the traditional 13-week semester.

Each subject will go for seven weeks and you will complete six subjects a year on a full-time load. These subjects commence in January, March, May, July, September and October.

We are massive fans of this set up as you would only do one subject at a time. This mode of study is far less stressful and avoids the enviable ‘assignment crush’ at the end of the semester followed by the hell that is exams of which there are none.

Yes, it’s true. The JCU MDS has no exams. Most of their subjects have a very similar assessment format:

  • 20% tests or quizzes online
  • 60% assignments
  • 20% computational laboratories/log books

Entry requirements

JCU ha no WAM or GPA requirements. So if you slacked off in your undergrad, this might be the course for you. You will require a Bachelor degree that is equivalent to AQF level 7. Preferably, JCU wants to see evidence of ” high numeracy skills equivalent to senior level mathematics that includes algebra and elementary differential calculus”. We assume the senior level mathematics is a reference to your final year of high school.

If you do not have this (math) you can enter the course as long as you can demonstrate at “least five years of relevant work experience in an IT or Data Science related industry. Industry experience will need to include some background in computing, data analysis or programming.”. This is probably >1% of applicants, and it’s a weird entry requirement alternative.

Now if you don’t come from a STEM background, which many MDS students do not, then you can still pursue the MDS at JCU.

If you have absolutely no mathematics and are petrified of the prospect of learning to code then first you need to ask yourself why you think data science is actually compatible with your life choices, but also be aware that JCU has you somewhat covered.

Students are offered a subject called MA5801:03 Essential Mathematics for Data Scientists – which they take before they even start the course! While it seems obvious, this is revolutionary. Time and time again MDS students come into these degrees and fall over straight away in their first mathematics, statistics or programming unit. Failing a unit can be devastating. Aside from the blow to your self-esteem, a fail can be a hard hit to your bank account, job prospects and for international students, your visa status.

Offering an elementary mathematics course is really quite unique and should, in our opinion be adopted by all universities offering an MDS to non-cognate (not form computer science or mathematics backgrounds) students.

JCU allows you to build your qualifications as you go. So if you don’t meet the entry requirements or only some of them then you start off with the subjects built into the Graduate Certificate or Graduate Diploma then continue to the MDS subjects. No, you can only exit with one qualification, not all three.

JCU qualification progression

If you had to start at the beginning, your MDS would take you 32 months instead of the advertised 24. Sneaky JCU. Always speak to your enrolment advisor before you start your classes about what’s most appropriate for you.

Be aware though, just because you can get away without doing a unit, doesn’t mean you should. Often those approving credit for prior learning, aren’t data scientists (what? no way). Frankly, you could lie through your teeth and say you have three years of programming experience. We know a lot of people who get fake reference letters from their former ‘managers’ and buff up their CVs to get out of one unit. I mean, who’s going to check? The problem though is that these students fail. Oh, how they fail, it is spectacular. If they finally graduate, their rock WAM is going to hinder their chances of getting a decent job.

What you will study

So the first downside to the JCU degree is that it’s quite rigid. There are no electives and you would be required to pass every subject to graduate on time. But unlike traditional semester-based units, if every unit is offered in every teaching block, then you should only be delayed by a few months due to the short teaching periods. Have a look at the handbook here.

JCU MDS students will gain membership of the SAS Academy of Data Science if they complete additional subjects. There appear to be two streams of the JCU-SAS Joint Certificate.

JCU-SAS Joint Certificate in Introductory Data Science JCU-SAS Joint Certificate in Advanced Data Science
MA5800 – Foundations of Data ScienceMA5821 – Advanced Statistical Methods for Data Scientists
MA5820 – Statistical Methods for Data ScienceCP5806 – Data and Information: Management, Security, Privacy and Ethics
CP5804 – Database SystemsMA5831 – Advanced Data Processing and Analysis using SAS
MA5830 – Data Visualisation MA5851 – Big Data: Processing and Analysis

It’s a bit confusing, but it looks like the certification is built into the degree. Be aware that the Data Science master classes and projects are not offered in 2019. This makes sense since the JCU-SAS joint certificate has now been implemented.

SAS is a dominant industry analytics platform and employers highly prize proficiency with it. Many graduates have reported that they are required to use it despite knowing other programming languages. In fact, a couple of us have been asked in interviews if we are ‘whizz bang with SAS’, yes that actually happened. So experience with SAS is great but since the material is can be accessed directly through SAS at a cost of $420 AUD per month (more than enough to get familiar with it), it hardly seems worth giving up valuable units that could have been spent on interesting electives.

As fancy as this all sounds, we are still a little sceptical.

Another reason being the level to which teaching staff are proficient with SAS. Most tutors and lecturers you will have are researchers. We know very few Data Scientists who lecture or tutor who are experts with SAS. This could prove problematic for the new JCU program.

Looking at the unit themselves, we see that they are all run out of the college of engineering. They foundations units are fairly standard and nothing here looks very different from other MDS courses. There is a mixture of Python, R and SQL at this level. It’s difficult to tell due to the carousel timetable, but it looks like students will be exposed to R before they take their foundation programming unit CP5805 – Programming and Data Analytics using Python. Although we hope, JCU hasn’t made this common mistake, those who have had no exposure to programming should consider doing a MOOC or two so they don’t find the experience to jarring.

Looking into MA5810- Introduction to Data Mining we read “Software packages will be adopted for hands-on data mining in real data sets.” Since there is no specific programming language mentioned, we wonder if this means that you would be using SAS rather than coding your own algorithms. After reading through all of the subject outlines we couldn’t see any more mention of programming languages other than SAS. This is not to say that there are no further programming requirements but it is safe to say that SAS is the core ‘language of this degree’. After getting our hands on some of the lectures and tutorial exercises for JCU MDS units we were interested to see just how much this degree is catered for industry. 110%. To new and commitment to SAS, it seems, is integrated seamlessly then into these materials.

Costs

Good news if you are a domestic student then you are going to love this. Full fees for domestic students are $52,800 AU which is on the lower end for domestic MDS fees but the best news is that JCU has CSP places. If you are awarded a CSP then you will only contribute $18,704 AU making it amongst the cheapest MDS in Australia.

International student fees are moderate at $63,000 AU. By comparison, Monash University domestic students pay $64,000 AU and international students at The University of Queensland pay $88,102 AU for a two year MDS degree.

English Language Requirements

English language requirements are standard at all universities and vary within a few points of each other dependent on the exam.

Academic IELTS – 6.5 (no component lower than 6.0)
TOEFL (paper-based) – 570 (with a minimum Test of Written English score of 4.5)
TOEFL (internet-based) – 90 (minimum writing score of 21)
Pearson (PTE Academic) – 64.

University standing

Full disclosure, we don’t really buy into rankings. While it’s true that Australian universities do vary in quality, financial ‘freedom’, facilities and research output, these characteristics are not correlated. Our editor has attended and worked for six universities and says

The worst teaching I ever received was at a Go8 university, the cruellest of research environments was too. The most supportive was at a private university and the most disorganised was a rural one.

LDS editor

Key statistics

JCU has 15.2k students of which 20% are international and 38% are postgraduate. JCU does have a gender imbalance, with 63 women studying for every 37 men. The staff to student ratio is 22.3 students to every staff member. We take these stats from the QS rankings.

The World University Rankings

JCU scores in the 201-250 band for the top 1000 universities in the world as ranked by The World University Rankings. Let’s look at that in detail. JCU is in 20.1% – 25% of universities ranked. JCU is considered a ‘young university’ it was established in 1961 university and ranks 28th in the world when compared to other universities of the same level of maturity. Why is this important? Well, older universities have more money.

More money, better everything.

It’s all relative. But when we look at what is relevant to MDS we can see a different story and not a good one.

The rankings are roughly worked out based on teaching, research, citations, industry income and international outlook. See our post on University Rankings for an explainer.

Based on Teaching, which we think is the most important score, JCU receives 23.5 which is not great. Their Industry income is modest at 41.9 and their international outlook is very good at 75.4

Looking at the rankings for Engineering and technology specifically. Teaching received an abysmal score of 19.9, an okay Industry income at 32.8 and a reasonable international outlook at 72.7.

However, their ranking here has slipped into the 301-400 band, down from the 251-300 band in 2018. Not a good sign.

QS Top Universities

Again JCU has slipped from equal 367th to equal 369th place. However, they rank 43rd in the top 50 universities under 50 years old. JCU is ranked 18th in Australia (out of 43rd). Unfortunately, we can’t give more granulated statistics here as they aren’t available.

Affiliations and accreditations

JCU is a member of the Innovative Research Universities Network which is a group of seven universities that undertake advocacy on issues related to higher education, research and university students.

Because we always get asked, no, JCU is not ACS accredited.

Summary

We feel that JCU is not actually offering an MDS. A better title would be business analytics. This is frustrating because it will produce students who are not equipt for data science roles and may either miss out on graduate opportunities or fail when they enter industry.

With so much focus on SAS students will miss out on other languages. Learning different languages is hard but over two years you become adaptable and with practice become proficient. Practice means every day for several hours. But it seems that JCU will not offer this.

We are impressed by the pre-degree mathematics unit but did not see enough probability and statistics throughout the course. The structure, however, is fantastic. Having one unit over seven weeks will allow students to fully concentrate on that unit. The traditional four units per 13 semesters are out of date and make it hard for working students to study, particularly when unit assessments overlap to an unreasonable extent.

Although the lack of exams seems appealing, we question their absence for units that are theory based such as those concerning ethics and policy.

However, hands-on units work better without exams, so scrapping them all together is the lesser of two evils. Practice is more important than an examination where you have to smash out as much as you can remember. Exams don’t have StackOverflow and so are not a reflection of working life. JCU has replaced exams with log books for programming, lab time and exercises.

The organisation of the website is a bit confusing. Changing unit names and not updating them uniformly is the number one way to irritate and confuse new students. Get it together JCU marketing team.

Oh and just a note on the JCU MDS webpage “One of the fastest Masters degree in this field in Australia” isn’t accurate, two years is standard and other MDS offer early exit based on credit.

Pros

  • Accessible to non-cognate students
  • 100% Online delivery
  • CSP places for domestic students. Low fees in general
  • No-exams
  • 7-week teaching blocks of one unit only

Cons

  • Minimal programming and statistics
  • Overreliance on SAS
  • Possibly not reflective of required industry skills
  • No research stream
  • Poor teaching reputation

And there you have it. JCU MDS may or may not be the right fit for you but that is now for you to make yo your mind about.

Retweet this article for happy enrolment vibes @data_little

Thanks to Nicole Honeywill for providing the feature image for LDS to use. Find Nicole on Unsplash.

Are you in or out? Deadline draws closer for your My Health Record creation

The Australian Government’s new $2 billion AU health record information system, My Health Record is set to improve how medical practitioners to access, share and create information about your medical history. The initiative aims to improve the flow of accurate and timely information to GPs and other specialists in order to improve health outcomes for all Australians.

Here’s Dr Caroline Yates providing is a great summary of why medical practitioners will be able to improve their quality of care if their patients are using My Health Record.

My Health Record is a fantastic initiative and could change the meaning of continuity of Care in Australia, at least in theory.
But I, like many people, have some significant concerns over privacy, security and data access. And, as a data scientist, I’m concerned about data use, accuracy, integrity and completeness.

If you haven’t decided whether to opt out, you only have a few days from today. So let’s unpack the ins and outs of My Health Record before you make your mind up.

What are the benefits of having a My Health Record?

The official My Health Record website states that the benefits of creating a record are as follows.

Better Connected Care

Do you see one doctor all the time?

I don’t.

When I’m sick, with say a chest infection I can’t shake, I don’t care who I see at my local practice as long as it is ASAP. I know that doctors who work within that practice can access my previous doctor’s notes as long as my regular doctor entered them on the practice system.

My Health Record is said to be helpful to patients as it reduces their life admin including transferring medical records when they move. I’ve moved six times in 6 years and had to transfer practices a few times. It’s a simple process. I call my old GP and ask for my medical records to be released to my new GP, that’s it! My medical record follows me. But some people don’t do this, and that’s where the My Health Record would benefit them, as doctors at all practices will be able to see their medical history.

Where it may be of benefit is if I’m travelling interstate and I can’t see my regular GP. I mean, who transfers their records for a random one-off visit?

But the scenario where this would have the most impact is in the event of a life-threatening emergency. If you are unconscious, you can’t tell the doctors trying to help you what you think is wrong. Another instance My Health Record would have a critical impact in this scenario is to prevent catastrophic drug reactions.

I went into anaphylactic shock and then my blood pressure bottomed out because they didn’t know I was on beta blockers. If they had known they would have given me other drugs as well to stop that.

– Patient who had an adverse reaction when given treatment

This is where My Health Record is helpful for adults in general. But there are specific groups of people where My Health Record will be even more beneficial.

Parents

minutes is difficult for parents, I can’t even do it for my health history in that time, let alone for a sick child. My Health Record is aimed at providing a holistic picture of a child’s health for this reason. Many a diagnosis has been delayed or missed due to incomplete childhood records, and My Health Record could prevent this. The autonomy of the child has also been taken into consideration. I’m quite impressed by this and children will be able to take control of their record after the age of 14.

Older Australians

The older you get, the denser your medical record becomes. As we age, our bodies are more susceptible to acute infections and chronic disease. For older Australians, My Health Record affords connected care while easing the burden of remembering all of these details. In cases where the individual is suffering cognitive decline, most likely in older age, this system will be beneficial to them and their health care practitioners.

New Australians

Non-English or non-fluent individuals may use My Health Record to overcome the issues stemming from the language barrier that could significantly impair their quality of care. This is all well and good, but I question the integrity of the data in instances where there may be information that is lost in translation but still entered. Additionally, I want to know how this information is made accessible to these patients online. Is it translated into their language?

People with Chronic disease

Up to 21% of drug-related hospital admissions are due to drug interactions in Australia. That’s pretty high. I have a family member with a fairly serious set of chronic diseases. They are meticulous with the information they give to their doctors and keep great track of the large volume of medications they have to take. But for them, their history and medications being accessible to all specialists would have saved them the pain of having to deal with the fallout from drug interactions as they have had various specialists prescribed counter-indicated drugs at the same time.

Can I control my own My Health Record?

Yep, you control your record. This means that you can choose what to delete and who can see your information. If you don’t want a doctor seeing this, you can change your access settings, but that still makes for an awkward conversion in an appointment. You can check who has accessed your record and even get automated notifications via email or SMS. Be aware that in an emergency your access controls may be overridden to you know, save your life. This is the interface you would see if you were tailoring your My Health Record to your preferred level of access control.

Setting up a My Health Record through MyGov

Is it secure?

Depends on who you ask. The system is said to have:

A multi-layered and strong safeguards in place to protect your information including encryption, firewalls, secure login, authentication mechanisms and audit logging.

– My Health Record website

I’m beyond sceptical of this statement.

In 2018 there were 35 breaches of the My Health Record system, which rose to 42 violations in 2018. The Australian Digital Health Agency (ADHA) said that none of these was malicious….ah?

The ADHA insists that these breachers were the result of situations such as the wrong parent having access to a child’s record when they do not have custody of the child and fraud against the Medicare system through individuals accessing records which are not theirs. I’m pretty sure these could be seen as malicious. Most concerningly the majority of breachers were where the Department of Human Services used the same record as Medicare which is a data integrity issue. In summary, there are system integrity and access control problems to overcome.

ADHA may say that My Health Record is secure, but this doesn’t mean all integrated systems are. The healthcare sector in Australia has a pretty shocking history of poor record management. For example, Family Planning NSW, a service that offers reproductive and sexual health services was hacked in April 2018. Up to 8,000 records were held captive under ransomware, and the agency only became aware of this after clients contacted them and journalist Lauren Ingram notified people via twitter.

Would you want your last STD check made publically available?

Lauren Ingram’s Twitter post

Given the sensitivity of this data, I’m not sure I would be forgiving of such a scenario.

Who can access my health Record?

People who can access your record include

  • GPs
  • pharmacies
  • pathology laboratories
  • hospitals
  • specialists
  • allied professionals
  • secondary providers

That’s not vague at all.

At this point, your brain should be blaring, Danger, Danger! Abort, abort!

Not all employees are as ethical as they should be. Even though there are safeguards which flag when a hospital employee accesses a family record or a specific record too frequently, people can easily override this, and they do. One way this happens is because departments don’t adopt their EMR software or lack training in how to use it, making circumnavigating the system much more attractive.

I have seen thousands of records be transferred into excel spreadsheets and become accessible by all in a department. This meant that to get to one record, the practitioners would scroll through the thousands of others and be able to read that sensitive information with no safeguards in place. The one password to the file, located on a shared drive wasn’t changed in 2 years. In small country towns, this becomes problematic as you learn information about your neighbours or even colleagues, that they would have never divulged to you. Friday night at the pub just became 100% more interesting (or awkward depending on how you see it).

Insurance agencies and employers are legally not allowed to access information on your My Health Record. One reason that the last deadline was extended was to allow for this legislation to pass through the Australian Parliament under the My Health Records Amendment (Strengthening Privacy) Bill 2018. Unauthorised civic access may result in a fine of $315,000, criminal conviction and up to 5 years jail time. Big woop, people do it all the time. Again, a systems issue.

This legislation also prevents privatisation of the My Health Record system. In saying that there seem to be some fair dodgy dealings going on that indicate some commercial conflicts of interest.

Law enforcement and government agencies access are a little less clear with the official My Health Record website stating

To date, the Agency’s official operating policy has been that no information within My Health Record can be released without an order from a judicial officer. The Agency has never received such a request and has never released information. Under new laws, no information can be released to  law enforcement or a government agency without your consent or an order from a judicial officer

– My Health Record

So I guess law enforcement can access your record.

The My Health Record website states that researchers are currently unable to access this data, but policies and frameworks are being drawn up to oversee this. It is likely that data will be used for public research after de-identification as early as 2020. However, media reports state that secondary use for research “and other purposes” is going ahead as well.

My Health Record information can be used for research and public health purposes in either a de-identified form, or in an identified form if the use is expressly consented to by the consumer

Department of Health spokesperson

The most concerning aspect of this is the “other purposes” part. This is actually a code for organisations such as pharmaceutical agencies. You need to make your mind up about how you feel about that. And no, this is not a conspiracy theory. Think about the 2017 Melbourne Datathon where 3 million rows of Pharmaceutical sales via banner pharmacies was given to participants. There is nothing wrong with this but in reality, you could do a lot of damage with this if you really wanted to.

Why did they delay the deadline?

The opt-out period, where you can choose not to have a record generated, was initially set to end in November 2018 but Federal Health minister Greg Hunt postponed this deadline until the 31st of January 2019 among concerns about Data Privacy and Security. Minister Hunt sought to reassure the public that all was well with the system and that even if they missed the deadline, they could still opt-out at any point. What Minister Hunt neglects to mention here is that a record would be created for you after the 31st of January, although you can permanently delete this at any time.

The decision came after legislation strengthening privacy protections for the electronic health record system was amended in the Senate to include the extension.

SBS News

There have also been rumblings that a ‘small’ technical glitch from 2016 that has not yet been fixed. This bug could leave patients information inaccessible or out of date. If it’s still around, and I’m not saying it is, then it needs fixing.

Then there’s that ‘small’ matter of two total system crashes in November 2018. The media attention prompted people to try and opt out on the deadline in November 2018, and the unprecedented surge in website traffic crashed the system. Given the Federal Government’s failure to load test this and other previous initiatives, shout out to #CensusFail, it is likely we’ll see this issue again on the 31st of January. I don’t think #MyRecordFail has the same ring to it through.

Get in early to avoid this

Is my information accurate?

This is probably my biggest issue with the My Health Record system. It’s not the system itself but the users. The sad reality is that some doctors are just bad at their job. Because of this, the accuracy of the information could be quite low. If another doctor is using it, then there could be severe implications for a patients health. Fortunately, the Australian Medical Association is treating My Health Record with a healthy dose “professional scepticism” stating that doctors would likely be handling the records the same way as hospital discharge notes, by planning for a 10% margin of error. Because that’s not concerning at all.

How do I opt out?

If you want to opt out, you need to visit the My Health Record page. It’s actually a pretty easy process. You’ll need your Medicare card and drivers licence (or passport). Here is a helpful video to guide you through.

Trials of nationally accessible EMRs began in 2015 when the system was called the Personally Controlled Electronic Health Record (PCEHR), and the eHealth record. Unfortunately, Australians didn’t respond well to this and location-specific trials recommenced in October 2016 with the system being renamed My Health Record. A little under 1 million Aussies registered in the opt-out participation trials. So if you lived in the Nepean region of the NSW Blue Mountains or Northern Queensland be aware that you may already have a My Health Record.

Hopefully, you have a little more understanding of some of the ins and outs of the My Health Record. I have opted out, but that may not be the best choice for you. Seek out further reading and make your mind up for yourself. You have two days.

Subscribe to LDS for more data related joy. Retweet us on Twitter and tag @data_little for happy recordkeeping times. Yes, recordkeeping is one word.

As always LDS cannot absolutely guarantee the correctness of this information and you should seek out further details through the many links in this article to the official My Health Record website.

Why your Twitter analysis sucks – and what you can do about it

I’ll start with my favourite word in the English language. CONTEXT CONTEXT CONTEXT!!!

My colleagues, friends, and students are at the point of rolling their eyes when I say it. But it is so important and the main reason your Twitter analysis sucks.

What do I mean by context? If you are going to datafy something – turn a tweet, a representation of a thought, emotion, idea into data –, then you need to think about the context of a) the user and why they tweeted, b) the dataset you are looking at, c) the problem you are trying to solve by datafying that tweet in the first place, and d) the tools you are using. Why?

BECAUSE NATURAL LANGUAGE PROCESSING NEEDS TO BE BESPOKE AND YOUR PRECONCEIVED ASSUMPTIONS WILL TRIP YOU UP.

NLP is like a game of chess, you need a strategy. This means knowing your next 15+ moves (this is my average number of function executions before until I have to rework my pipeline).

Time and time again I see this sort of standardised pipeline for preprocessing as a method to get to the juicy bits (modelling), but if you are using these sorts of pipelines then you are going to produce a subpar analysis. Let me give you an example.

Let’s say you are an analyst working on a political campaign and your boss, the candidate, wanted to know how people were feeling about a certain issue they are speaking about. You decide to do some sentiment analysis of a collection of tweets on #auspol and found that the sentiment towards the politician for was moderately positive towards that issue. Great! Now you can tell them they are doing the right thing and to ramp it up.

So they do, but they lose the election…and you get fired. WTF?

Did you think about the context of your sample? Let’s dissect what went wrong with this example, and explore why some critical and lateral thinking is a necessary ingredient in your analysis.

The English language is a nightmare

Tweets amplify the problematic issues around English, arguably the stupidest language on earth. Tweets are generally utter filth. They are noisy, short and use non-standard conventions. I have a love-hate relationship with them.

To demonstrate where this went wrong, I’ll use the Stanford Sentiment Analysis Treebank. Grey circles represent neutral sentiment, orange is negative and blue is positive. This is our original tweet (let’s say it was posted shortly after your candidate made a speech about the issue).

Original tweet

Now for the analysis. First, we make the string lowercase, then we have a few options that are ‘standard’ in the NLP preprocessing pipeline.

We can:

  1. Remove the punctuation and special characters, leaving the word that the hashtag is made of; or
  2. Remove the punctuation and special characters, getting rid of the hashtags completely.

Let’s see what happens when we do this.

Option 1. Remove the punctuation and special characters, leaving the word that the hashtag is made of.

Result = Negative sentiment

Great, the sentence is picked up as negative. No issues. What about option 2?

Option 2. Remove the hashtag completely.


Result = Positive sentiment

Ok, that’s not great. Clearly, this sentence carries negative sentiment, but it’s been picked up as positive. That’s because human emotion is incredibly difficult to datafy. People often use hashtags to convey sarcasm, which is inherently negative in sentiment.

You might be thinking, ‘ just leave the hashtags in there and that way I can capture the sarcasm’. Well yes maybe. But hashtags aren’t always singular words. Let’s expand that hashtag into #idiotpollies as in ‘idiot politicians’.

Typical hashtag

Result = Positive sentiment

And we are back to positive.

This is a hashtag issue. It’s almost impossible to assign sentiment to hashtags. You can split them but there are two main problems with this. Functions to split a hashtag into its root terms usually split the sting at the first capitalised letter they come accross e.g “#AustraliaDayWeekend” becomes “Australia day weekend”. You need to adjust your pre-processing pipeline to make sure you don’t decapitalise before this is done.

Splitting would be fine except:

  • It won’t work when the hashtag has no capitals e.g #auspol #melbourneweather
  • Some hashtags, once split, lose the original meaning that they had in their concatenated form;
  • If you are doing another analysis, like topic modelling, you would use the opportunity to treat them as a ‘word’ – hence different analysis needs different preprocessing (stop using the same data for a differnt model);
  • Non-standard or trending words like #ScoMo get caught in the crossfire. #ScoMo for Australian PM Scott Morrison gets turned to “sco mo” and all meaning is lost. Same for the long-running hashtag #auspol. Let’s look at this example.

#CaptianCook is a trending hashtag in Australia and relates to the debate about Australia Day. Not to get into it, but most tweets like this are sarcastic and negative, and also hilarious.

Now when we normalise the tweet by removing punctuation, splitting the hashtags on their capitals and then putting it into lowercase, we get:

“scott morrison has finally confirmed the dress code for australia day citizenship captain cook sco mo auspol”

And when we look for sentiment:

Result = Positive sentiment

Positive again. #ShockHorror.

The #CaptianCook is negative and sarcastic as a trending hashtag. But ‘Captian’ is positive. If you were working for the Australian PM, you may be in a bit of trouble.

Now what?

We need to think about what words may be tricking the model into thinking this is a positive sentence. Can you guess?

It’s the word “can’t”. Since we removed the punctuation, getting ready to tokenise, the root word “can” is kept. “Can” is labelled as a positive term, hence the positive sentiment.

At this point, if you are still trying to catch me out you might say we can concatenate “can” with the “t” and use “cant”. Ok, let’s try that.

Result = Neutral sentiment

Nope, that returns a neutral sentiment and doesn’t capture the sentiment of the user at all.

You might also say we can manually label hashtags and split according to a dictionary. Sure, you could. And you should, but remember you will never capture all trends – the Twitterverse is too dynamic, and they change too quickly. I’ll go into this in another post.

The answer

Contraction expansion.

Dismiss your flashbacks to year 6 English and stay with me here. By expanding the contraction to “cannot”, which will be separated to “can not”, we save the sentiment…and your job.

Result = Negative sentiment

Finally! “Not” supersedes the positive “can”.

The moral of the story is you need to know your problem, the context of your data, what it means for the pre-processing of your data, and what you may be tripped up on.

I hope this helps you think about new ways to improve your work, and hopefully not get fired!

As always, please share on Twitter and tag us as @data_little for happy coding binges.

Feature photo by JESHOOTS.COM on Unsplash

Thanks to Ed Farrel for his wonderful editing of yet another helpful post from The Little Data Scientist.

What happens when you can’t pay for your degree

Have you thought about how you are going to pay for your degree? You can just put it on HECS-HELP, right?

You would be forgiven for thinking this. I did. That is until I got an email saying I had a $96,000 HELP debt. I had reached the maximum that the government would loan me and I had seven units left to go, nearly a full year.

You might be reading this and think ‘ oh that’s not me, I don’t have that much’ but you would be surprised. Be honest, do you know what you owe? Right now? To the dollar? Yeah, didn’t think so.

FYI, the maximum limit in 2019 is $104,440. It seems like a lot, but let’s look at an average student from Australia, interested in pursuing a post-graduate degree in Data Science or IT. This table is fairly representative of the current fees. International students (except for Kiwis #fam) can’t access HECS-HELP, all the more reason to read this article.

DegreeDomestic FullDomestic CSPInternational
Ba Comp. Sci $97,500 $28,077 $124,200
Ba. Comp Sci (hons)$130,000 $37,436 $165,600
Ma. DataSci$65,000 N/A$82,800
Ma. IT$65,000 N/A$82,800

For a domestic student who has Commonwealth Supported Place (CSP) in their undergraduate degree and then goes on to do either a Masters of IT or Data Science will be indebted to the government for $93,077.

Have you thought about that? That’s a house deposit in Melbourne (sorry Sydney folks, keep saving). And if you do honours then that jumps up to $102,436. International fees are even scarier, $206,700 and $248,400 if you do honours.

So how did I hit that threshold? Two bachelors with honours and 1.5 years in law school. There was a small twist in my story, more on that later.

It’s frustrating because students are told they can ‘figure their pathway out at uni’, swapping between courses, but they can’t afford to do it this.

– Student fees officer at a Go8 University

Now don’t stress too much just yet. I’m going to take you through things you MUST be aware of if you are going after a post-graduate degree.

Find out how much you owe so far

You can check where you stand here at the UniAssist website. You will need your CHESSN. You CHESSN is your Commonwealth Higher Education Student Support Number’, these are found on the top of your CAN (Commonwealth assistance Notice). CANs are distributed after the census date each teaching period. I’ll cover census dates shortly.

Check that no mistakes have been made

If I had done this, it would have saved so many moments of panic, frustration and one broken coffee mug I may have thrown in anger; at the floor, not anyone in particular. Side note, if smashing crockery seems like reasonable anger management strategy, I recommend the Smash Room in Melbourne, where you can literally smash plates.

In early 2016 I got sick and was in the hospital for a month so couldn’t complete my semester. Because this happened after the census date, I had already been charged. Now when you are sick in the hospital the last thing on your mind is your accumulated HELP debt, so I didn’t pick up on it until the following semester.

Accumulated HELP debt — The total of any HECS-HELP, OS-HELP, FEE-HELP, VET FEE-HELP/VET Student Loans or SA-HELP debts you have incurred (including any Australian Government study loans incurred before 2005).

Fortunately, I could apply for a retrospective withdrawal and after much running around, I got an email saying it had been approved. Case closed I thought.

Rookie error. The University may have approved it, but didn’t mean the Government did.

Six months later, when I got that lovely letter saying I had run out of money, I realised that $14,800 I was still charged for that semester I couldn’t.

I had to contacted the Department of Education AND the Australian Taxation Office and after several genuinely pleasent discussions with various officers, I finally got a retraction. Shout out to the ATO becauase they were super across this.

While I was still pushing for the retraction the census date was looming for my next semester and I didn’t have the funds. This is what happened:

  • I nearly pulled out of my Masters and took the alternative exit option to get a Graduate Diploma in Data Science.
  • I hustled for a scholarship that apparently didn’t exist.
  • I begged, yeah, I begged hard.
  • My collection of plates faced some collateral damage.
  • I got a road-worthy ($120) to sell my car and pay upfront.

Fortunately, I was granted credit for my former research roles. It turns out my honours thesis was good for something after all.

How to prevent this happening to you

The following is s set of ahh….provocations for you to hopefully, avoid this senario and achieve your qualification without having to pay more than you expected.

Are you eligible for any credits (not exclusions)?

Credits mean you don’t have to do or be charged for a unit. These are usually given if you have done an equivalent unit at the same level e.g. post-graduate, or for appropriate work experience which you need to prove.

Credits saved me but you must speak with your university if they are the most appropriate option.

Can you take the path of least resistance?

Think about your strengths, is writing your forte, take more writing intensive units. Do you do poorly in exams relative to in semester assessments? Take units with no exams. Are you kind poor in the math department? Don’t take as many mathy units, ok that kind hurts me to write, but it’s true.

In the degree I teach we have two streams, one is quite rigid and has some heavy technical units. I see so many students pursuing this stream and failing unit after unit, racking up more and more debt. All because of they envision some sort of prestige attached to that stream

It’s not a minor. It doesn’t even go on your degree, no one will ever know you did that.

– Me, literally every semester when I have a failing student crying in my office

If you want to have a career in academia, you need to get a thesis under your belt, if not usually a capstone. A thesis usually has a hurdle which means that you need to get a certain WAM or do certain units. You really can’t afford to fail a unit. Take the easy route.

My course directors advice to me during this time, still sticks with me, a sign of my slowly developing the Machiavellian tendencies, a necessary requirement to succeed in academia (aledgidly).

Play the game.

– Jaded academic

Understand what Census dates mean

How many times have you seen that email or notification about census dates…..and promptly archived or deleted it? Stop doing that. Read them, for the love of god read them.

Census date — This date is set by providers and it is the legal deadline for various requirements, like making an upfront payment of your student contributions, applying for a HECS-HELP loan or formally withdrawing your enrolment so you do not incur a HELP debt.

Sometimes students enrol in units incorrectly or spend two weeks in them before deciding to bail. You have a period of time each semester before you get charged for a unit, so you must do this before the census date. This also goes for credit applications. See your faculty administrators to do this.

For CSP students (mainly undergraduates), you must submit enrolments and re-enrolments forms declaring yourself as a CSP student prior to the census date. If you don’t you are going to go from a $4,679 semester to a $16,250. That’s the price of a brand new small car.

For all Australian students, make sure you have a Tax File Number (TFN) well before your first semester. You will need this to apply for HECS-HELP. Apply for one here.

Remember each university, and sometimes even faculties with the university, have different census dates. You need to get across them early.

Can you afford to pay some upfront costs?

If you have run out of available funding can you afford to pay some of your units up front? Looking at all of your options early in your degree or even before, may allow you to save for a year or so until you exceed any remaining cap. If you are close to the cap but look like you can finish the degree, save for a unit anyway just in case you fail.

Try to get a Scholarship

If you have a good WAM (80%+) you may be offered a scholarship or partial scholarship for your course. For most universities this is the WAM of your last degree, not your current WAM. You will need to check with your university about these. There are often equity scholarships available for students from disadvantaged backgrounds as well. But for the rest of us, there is sweet FA.

Consider a loan

I don’t recommend this option, but if needs be. Personal loans for tuition fees and living costs are taken up by many students, but remember, you need to pay them back and that starts from the minute that money is deposited in your account. The interest rates can vary but they can be as high as 15% for unsecured loans. You should also think about how this will affect your future credit rating and quality of living (Mi goreng packet noodles are not a food group). Have a look at loan rates for tuition here. Applying for multiple loans is a one-way ticket to a Crappy credit score. Check your score here.

On the topic of loans, avoid payday loans. Actually avoid isn’t strong enough. Do NOT get a payday loan. Why? Because a 47% interest rate isn’t going to work out for you in the long run. Read the story of how a $600 loan turned into a financial nightmare.

Hopefully you have learned something helpful here. As always, please retweet our articles to share financial wisdom. 

Photo by Kevin McCutcheon on Unsplash Give some love to Kevin and check his work out.

All information given in this post should be considered as opinion only and not a reflection of any specific university. You should conduct further research and not regard this article as a comprehensive and full accurate article. Pretty much, I take no responsibility for what you do with the information gained from this article.

Jupyter Convert: How to get a Table of Contents

If you are like me you prefer a readable, aesthetically pleasing structure bit of code to rip off and make your own. Ahhh, I mean use as a template. But sometimes, especially when I first started out, I found I was discouraged from trying to understand what the author had by the hostile looking script they had produced. If I asked for help when I was really struggling I would get shown this sort of thing and die on the inside.

Nope!

I tried a few different IDEs and was flailing. About 6 months into learning how to program I came across iPython now known today as my beloved Jupyter Notebook. #Jupyter4Lyfe

What was this clear, intuitive and aesthetically pleasing sorcery? I played about and was an immediate convert. I now write my students’ assignment instructions in Jupyter and really wish other lecturers would adopt this practice rather than handing out instructions (you know who you are). But I digress.

I have had ‘real’ programmers scoff at me for not using a ‘real IDE a few times’. I laugh at them because screw linux and sublime. By using Jupyter I know I will be reaching a large group of people who aren’t programmers by trade and will now be able to get a transparent overview of what’s been done. Also, unless I’m doing a Github commit, I can’t be bothered with readme files until I am finished. I prefer to keep notes in the notebook.

Anyway, I love love love Jupyter Notebooks. But as I have started doing experiments requiring non-linear runs (changing what pre-processing strategy I am using for example) I find I have to write really long scripts. So I wanted to find a way to find the sections I am after without having to do a million scrolls. And so entered the humble dynamic table of contents (TOC). This is the magic of a TOC. Side note, if you are allowed to submit assignments in Jupyter, do it and install a TOC. Your tutors/ lecturers will thank you. Make their job easy and they will be inclined to give you a good mark.

Behold!

The Magic

Before we start, I am a Mac user and so use bash (Bourne Again SHell). Windows users will use cmd.exe and Powershell. If you are a window user I am so so sorry for your loss, your loss of assessing to 95% of your OS capabilities. So for Windows users, step one is get a mac.

Install pip

Mac users head to this pretty walkthrough to install pip.
Windows users head over here to install pip.

Installation of nbextensions

Alright, the jupyter_nbextensions_configurator is a type of extension that beefs up the capabilities of Jupyter and adds new buttons for formatting, adding enhancing functions etc. to the interface you see in Jupyter. After you add this extension it will load automatically and you won’t need to reset it every time. There are so many options in the nbextentions configurator that we will go through in another article but if you want all if the info, as always, read the docs.

If you are a conda user

In your terminal or cmd, type the following:

conda install -c conda-forge jupyter_nbextensions_configurator

If you prefer pip

I prefer pip, but there are two steps to nbextensions than conda.


Step 1. Install the pip package by copying this into your terminal or cmd.

pip install jupyter_nbextensions_configurator

Step 2. Configure the notebook server to load the extension.
This is done with something called a Jupyter sub-comand which just tells the actual notebook to install the server extension, kind of like when you use ‘import some-package’ in your note book after you have done ‘pip install some-package’ in your terminal or cmd. Now copy and paste this in:

jupyter nbextensions_configurator enable --user

You will get something that looks like this:

Yes I have a pink command line.

Choosing your options in your brand new nbextensions panel

Now open up a new jupyter notebook and you should see a new option called Nbextensions.

Shiny new extension.

Click on the panel and check the Table of Contents (2) and toc/loc configurations. Like this:

Required options for TOC.

Collapsing Sections

Before you start, I recommend ticking the collapsible headings option as well. It’s the fourth from the top of the first column of options. You will see why shortly.

Ok, we are ready to go. Open up a new notebook and enable table of contents. Like this.

Then you will need to so some little configuring of the ToC2 settings. When you click on the TOC button you will see a sidebar on the left-hand side called ‘Contents’ it will have a refresher icon and a settings icon. Click the settings icon. Then enable all of the settings you want.

I recommend all of these, particularly the ‘Leave h1 items out of ToC’ checkbox. Otherwise, you will a number 1 next to your TOC which looks bad and throws off the numbering.

Getting a TOC in your notebook

To put in your first heading, set the cell to markdown and then use two ‘##’ to call a main header.

To put in a subheading user three ‘###’.

You can keep putting in sub-headers with up to five ‘#’ so that you have four levels of sub-headings like this.

And there you are, automatically generated TOC in your notebook.

Related features

Colour coding

You will notice that the TOC is highlighted here in yellow. This indicates where you are in your notebook. When you run a cell this the section that a that cell is running will turn red. Here is an example:

This is pretty handy when you are running code that takes a while and you are messing about with something else in the notebook.

Navigator and Sidebar

Go up to your main set of tabs where File, Edit, View etc. Click on the ‘Navigate’ tab. You will have to expand the box to see the contents which you can click on. This option isn’t one I use too much because I like the sidebar. Speaking of the sidebar, you can move it around to wherever you want it to be.

And that’s it. You now have TOC which will make your work more easily navigated, and anyone you distribute the notebook will be happy that the TOC is in-place regardless of whether they use these extensions.

If you got something out of this walkthrough we would love if you could spread the work. Twitter is a great way to do this.

Credit to Bundo Kim for the amazing feature photo. For more of Bundo’s work please head to unsplash @bundo

Sorry, but you need a Masters degree to be a Data Scientist

You want to be a Data Scientist, and you’ve read that you might not actually need a Masters degree or even a PhD to get the job of your dreams. Let me guess? You’ve been told by this by various ‘experts’, and now you have some doubts.

First of all, there are some major sceptics out there who probably aren’t actually Data Scientists. Yep, we know about them, and we know better. You want the truth? You need a Masters degree and/or PhD to get a fulfilling, progressive and well-paid career in Data Science.

The idea of completing a bunch of online courses and MOOC’s then landing a six-figure position without a Masters is possible but improbable. Particularly in Australia, where a Masters is essential to getting a legitimate role with career progression and skill development. And unlike the US, PhDs are very very valuable.

“You need a Masters degree or PhD to get a fulfilling, progressive and well-paid career in Data Science.”

– Legitimately every Data Scientist

Something to remember is that the ‘field ‘is relatively new. Stories of people who have managed to build up their skills and now have a fantastic career in Data Science are often people who finished their Computer Science, Mathematics or Statistics related bachelors 8+ years ago. They have been able to transition into these roles as the position became more defined. These people are driven and are continually honing their skills through online courses, Kaggle competitions and seeking out skill development opportunities.

As the field developed more post-graduate degrees were marketed for Data Science. The first batches of graduates entered the Australian job market in 2017. Since then employers have come to expect a Masters degree in Data Science because there are so many people with that qualification, and the number is rising.
It is possible to get a job in Data Science without a post-graduate degree, but you have to be lucky, connected or a genius math nerd with pro-hacking skills. Now be honest with yourself, is that you?

Before we look into the job market, let’s address a significant motivator in why people do a Masters, money.

Data Science degrees are expensive. Really expensive. The average cost of a 2 year Master of Data Science in Australia is $65,000 for a domestic student and $80,000 for an international student. It’s a considerable investment, especially if you are already $45,000 in debt from your undergraduate degree. But the good news is that salaries for Post-graduates are pretty competitive. The Graduate Outcomes Survey statistics back this up. 


With an undergraduate degree in Computer Science or Information Systems, you can expect a median graduate salary of $60,000 pa. For Post-graduate that jumps to $96,000 pa. That’s a 60% increase! Graduates with a Bachelor in Science, by comparison, can expect a median graduate salary of $63,000, rising to $78,300 for those holding a post-graduate degree. The average full-time Australian wage is $82,500 in June 2018.

Comp.Sci and Info. System 2018 undergrad salary $60,000

Comp.Sci and Info. System 2018 postgrad salary $96,000

– 2018 Australian Graduate Outcome Survey

These numbers are compelling. Clearly, a post-graduate degree is valuable to employers.

To get a good overview of the Australian Data Science job market, we hopped on seek.com to search for jobs with the title ‘Data Scientist’ which returned 262 jobs. Breaking them up into pay bands, we start to get an understanding of the Data Science job market and why people say ‘you don’t need a Masters’.

The $50,000 – $60,000 pay band

Setting the pay band filter to reflect the low end of the median undergraduate salary, we immediately see a few scary trends.

Some recruiters are opportunistic

Reading these ads, the wise words of Darryl Kerrigan come to mind

“Experienced Data Scientist for 50-60k? Tell’em they’re dreaming”

– Darryl Kerrigan if he was a Data Scientist

For example, this is a job advertised for the Australian Federal Government, but it is more likely for a contractor. This job is a classic example of ridiculous opportunistic recruiters trying to rip off legitimate postings for senior roles. Ignore and don’t look at these recruiters again.

A job that is advertised for a ‘Data Scientist’ isn’t always for a Data Scientist

Unfortunately, with the hype around Big Data, companies and recruiters have taken to using the title a little too liberally. It might be disheartening to see jobs advertised as low as $50-60k, but in reality, these jobs will be meant for graduates of a business or commerce bachelors degree. If you took these jobs, you will likely be using proprietary tools and do little development work.

The $60,000 – $90,000 pay band

When we increase the pay band to get more legitimate roles. But there aren’t many. On first glance, people will look at this and see 12 Data Science jobs. But is there?

The liberal use of ‘Data Scientist’ is evident again. Out of the 12 that were returned we got :

2 x Junior Data Scientist/Data Analysts

1 x Digital Analyst

2 x Junior Software Engineers

1 x Junior Campaign Analyst

1 x Procurement Specialist

3 x Data Scientists

1 x Junior Data Scientist

We excluded the Junior Software Engineer and Procurement specialist roles, they are not relevant. The two Junior Data Scientist/Data Analyst, Junior Campaign Analyst, Digital Analyst and the Consultant/BI and Analytics jobs fall into the example of those meant for business or commerce undergrads. In fact, it even says:

“Degree with a strong quantitive focus such as; statistics, physics, psychology, economics or commerce”

– Badly targeted job ad

Little room to learn skills in Machine Learning, Programming and Workflow Development

These types of job posts list SAS and other proprietary software experience as a requirement. They don’t tend to require you to program in a broad range of languages or actually plan your own analyses. It is unlikely these jobs will set you up for hardcore Machine Learning and AI development roles in the future. Click on the image to read a few.

Some companies go fishing for a Masters without mentioning a Masters

Some companies will specify a specific type of undergraduate degree and also ask for a minimum of 2 to 3 years experience with Python, Scala, Spark or similar. These roles are almost impossible to get unless you have been working in the industry and since it is hard to get a role where you are actually developing these skills as an undergraduate (e.g. this role), your best bet is a Masters or higher to compensate. These types of recruiters are casting the net wide but know what they want, and they want it cheap.

From here, the remaining three positions for a Data Scientists are offered at $80,000 pa. All want a PhD or Master’s degree.

The $90,000 – $130,000 pay band

The Masters-Experience tradeoff

Moving up, we see 85% of the ads specify a required PhD or Masters, often with 2-8 years of experience tacked on for good measure. Think of it like a trade-off, a PhD with no industry experience already has 3 years working on a project with real deadlines, often with industry partners. A Masters student will be doing projects throughout the 2 years, and a lot of them have scored data analyst roles part-time throughout their degree based on their undergrad skills. Undergrads are unlikely to have that opportunity. 


Don’t be fooled by the words ‘preferred’ and ‘or’. ‘Prefered’ means ‘must’. The market is flooded with Masters graduates who have an advantage over undergrads. Unless you have 3 years of project management experience and some serious skills, then you can’t compete. Remember, HR and recruiters run the show, not the team you will work with. Here are some examples.

PhDs are valued in Australia

Actually, some recruiters are getting desperate for them. Unlike the American rhetoric that is generated from the post GFC job crash, a PhD will put you in good stead in Australia for a career in Data Science. It’s just a hard route to go. But clearly they are getting desperate and it’s kind of awkward.

The $130,000 + Pay Band

These roles are difficult to get, but there is a big demand for them, mainly because they require business acumen and consulting experience. However, we still see some odd marketing tactics with interesting use of emojis. This one is on for $150,000 pa.

“If you are a Data Scientist who could also walk into a high performing Senior Insights role with a Consulting firm (eg, presentation skills, customer engagement, strategy and planning, insights to find opportunities,  drive sales etc.) then stop what you are doing and send me your CV !- I need to speak with you immediately :-)”

– Desperado

In fact, some are even asking for journal publications.

Have you published?

I love reading this. As someone with a Masters in Data Science, now doing a PhD in Text Analytics and Organisational Performance Analytics, I know I have a good pay packet waiting for me if I choose to step away from academia.

So with that, I hope you now have a better idea of the Australian job market for Data Scientists and why the following articles (from the USA) are bull.

Cleverism – 4 Reasons Not To Get That Masters in Data Science

Forbes – You Can Get A Data Analytics Job Without A Masters In Data Science

Topbots – You Don’t Need a PhD to Master Machine Learning & Data Science

How to pull an ‘all-nighter’ The dry eyes edition

Ok, first of all, don’t.

Second, if you are, then you need to get your shit together and learn some time management skills. But that being said, life does get in the way and the 9 to 5 workday no longer exists. All-nighters are sadly a fact of life, and at some point, you will need to pull one. So if your gonna do something, you may as well do it right.

Today we are reviewing one of the most visible parts your body you will punish when you pull an all-nighter. Your eyes, your poor, sore, dry and blood-shot little eyeballs.

Our eyes, like the rest of our body, need sleep (between 7-9 hours a night). When you pull an all-nighter, you deprive them of the opportunity to rest and replenish.

Often you’ll find that the scratchy sore feeling develops 2-3 hours after what would have been your usual bedtime. This is especially true in a coding binge where you are focusing for hours on a screen right in front of you. Optometrists call this set of symptoms Asthenopia, also known as tired-eyes or eye-strain.

So why do they get this way?

Well, theres a couple of interconnected reasons. The first is that our little tiny extraocular muscles that control our eyeball movement and eyelid elevation (for blinking) get tired. This means they can’t maintain the rate that we usually blink. Less blinking means fewer tears spread out over your eye so it will dry out quicker than normal. You’ll notice your vision will blur and its hard to focus. This is a result of the smaller muscles that control pupilar reaction becoming fatigued. You will also have a slow down in the reaction time to low or super bright levels of light. All of this strain generates fluid buildup through inflammation making tear generation harder and harder.

The workaround

Eye drops

These will place artificial tears on the eye and provide the much-needed moisture they are craving. Some work better than others for different people, you might want to try a few. Make sure to change bottles every month or so to avoid giving yourself infections. Conjunctivitis is gross.

Step away from the computer


Every 20 – 30 minutes give those ballers a break. When you use a computer, your blink rate drops on average from 5-6 blinks per minute to just 2 – 3. Fewer blinks, more dryness. You also want to get your other muscles that are straining hard to focus a break.

Use a cool eye compress


Take 10-15 minutes to use a cool (not frozen) eye compress to reduce the inflammation in your eyes. This will aid in relieving any headaches you are getting because of the eye strain.

Put your specs on!


Ahh, the catch-cry your mum for those of you who wore them as a kid. Your glasses will help reduce the strain your eyes are under. That’s why you got them. If you wear contacts switch to your set of glasses if you can. Contacts worsen dry eyes when worn for long periods, like the 36 hours of hell you are getting through.

Adjust your lighting


A dim room will force your eyes to work harder than they need to. Natural light is prefered but, you know, it is probably night time. So use a bright bulb, 100 Watts (6500K), to light up your workspace. Some people prefer halogen bulbs over fluorescent citing improved focus. Agreed. There has been an excessive amount of organisational research into the effects of light on productivity. Light up your workspace!

Use a humidifier


Humidifiers put some moisture back in the air to help with the dryness in your eyes. Your nose and throat will also thank you as your mucal membranes will suffer from the lack of sleep and dry up as well. Internal heating and cooling, particularly evaporative airconditioning will suck the moisture out of the air so if you live in a hot or cold climate, think about buying a humidifier.

Right, I hope these tips aid in a successful and less painful all-nighter. Look out for more guides in the ‘How to pull an all-nighter’ series. In the meantime, I have three words for you, TIME MANAGEMENT SKILLS.

Data Science Podcasts we love

Ahhh Podcasts, the ear joy we love to listen to.

If you are anything like us, your AirPods will be permanently implanted in your ear, and your significant other will regularly get told ‘hold on, I just need to pause this podcast’ when they try and speak to you.

Whether you enjoy fiction, current affairs, comedy or learning about new skills, there is a podcast for almost everything. So we at the LDS thought we would share our most beloved Data Science podcasts with you.

Partially Derivative – USA

Sadly ending in 2017, Partially Derivative was a standout favourite of ours Each episode the three hosts, Jonathon, Vidya and Chris get together and talk about everything from the latest trends and innovations, failures and launches, social impact issues and hilarious examples of applications such as the ANN generated 8th Harry Potter book. Check out all 103 episodes here or on whatever platform you use.

Data Science Ethics – USA

Fricken love this podcast. Mainly because there is such a big call for people in the fields to start addressing these issues. These ladies explain ethical concepts we encounter every day with examples like Blockchain, Morality in Machines, NLP and Googles AI generated assistant. Head over to their site for all the episodes here.

Data Futurology – Australia

A relative newcomer hosted by Felipe Flores, a Data Scientist and self-described Data Futurologist. Felipe’s podcast is a winner when it comes to industry based Data Science, he interviews a variety of industry leaders in Australia and internationally. If you want to understand how Data Science is being employed in industry head here to listen.

Linear Digression – USA

Ben and Katie talk about various Machine Learning and Data Science tools and get you up to speed on the State of the Art. They spend 15 minutes or so to explain these tools and how the word in a way that isn’t basic, but it isn’t so difficult that you feel stupid (there is a lot of podcasts that over explain this way). Examples of topics include Word2Vec, Capsule Networks, Git for Data Science, Agile Data Science and Fractal Dimensions. Hear them gently explain everything here.

Data Skeptic – USA

A long-running podcast by Kyle Polich who covers everything from domain expert interviews, tutorials and advice. Kyle is a magnificent presenter and calls a spade a spade, he calls bullshit on a lot of the detritus floating about around Data Science. Episodes go between 15 to 60 minutes and cover technical concepts like NLP, time-series and gamification and are pretty much ‘Data Science easy listening ‘.

The O’Reilly Data Show – USA

A fairly technical podcast often reserved for long car trips or plane rides. Ben Lorica hosts the podcast, interviewing industry experts and getting into the nitty gritty of the latest trends in both academic and industry-based Data Science. Ben isn’t exactly the smoothest host, but you will definitely be well informed. Listen here.

Concerning AI – USA

Amazing discussions on the socio-technical implications (good and bad) that are emerging from the wonderful world of AI. They offer a debate/discussion style format where the two hosts Ted Sarvata and Brandon Sanders. Concerning AI is a bit of a trailblazer as it introduces the discussions of ‘existential risk’ of our craft in an assessable format. Have a listen here.

Data Stories – USA

This is a podcast with a focus on visualisation – super important. The hosts Enrico Bertini and Moritz Stefaner are experts in the field and interview guests from both industry and academia. They do an awesome job of allowing the people they interview to guide the story. This isn’t always the case in some podcasts. The narrative style makes for easy listening, particularly for those in academia. Head over here to listen.

Of course, there are many more, but these are the standouts that are pinned to our favourites list. If you aren’t a podcast listener we highly recommend this for passive learning. Our prefered platforms are Stitcher, Castbox and Podcast Player.

Now go get some ear joy!

Best MOOC’s for Data Science students in 2019

Starting a Data Science or Comp. Sci degree in 2019? Here are all the MOOCs that we recommend for you to get a head start on your study.

The NUMBER ONE question I get asked from incoming Data Science students is

” What materials or courses do you recommend to prepare for classes?”.

I got this question so frequently last year that I wrote up all the MOOCs (Massive Open Online Courses) I had personally completed in the last 3 years and gave it to the course director to send out. Considering that the number of Data Science students in Australia has doubled since I did mine, I figure there is a fair appetite for lists like these and so have updated mine just for you.

Now before you jump in I want you to keep the following in mind:

  1. There is no substitute for practice. You cannot read your way to becoming a decent programmer. You have to code.
  2. Both Python and R are great languages. If you are reading this, I’m guessing you haven’t started to specialise. Try and get across both languages. I used to be 100% R and now I hardly ever use it, preferring Python instead. You will need to get adaptive. Not everyone uses just 1, 2 or even 3 languages. I get caught out all the time with janky code that requires Java, Python, Matlab and C++ all for one installation.
  3. Cost doesn’t mean quality. Some of what I have put up is free but some have a price tag (no we do not get sponsorship). This also goes for the certificates. You don’t need them, especially if you are doing a university degree. Look out for price drops as well. Udemy has heaps of discounts throughout the year (hello black Friday) but it can be a bit hit and miss.

Now before I get to the best MOOC’s for Data Science students 2019, I need to talk about Kaggle. Kaggle is a platform where you can go and retrieve datasets to play with but also see what other people have done too. I am not ashamed to say that Kaggle is responsible for getting me through my first semester of programming.

Some lecturers I work with now incorporate ‘kaggle comps’ into their teaching syllabus and they are often used by Data Science clubs form mini Datathons (think Hackathon for Data Scientists). Kaggle was initially a platform to make some good money from businesses who needed insights and were willing to pay for it. When you’re ready you can join competitions and compete for real-life dollarydoos (the official currency of Australia for which all money on this site is expressed as).

Now for the best MOOC’s for Data Science students 2019

Top course when you don’t know how to code (yet!)

DataCamp – Introduction to Python and Introduction to R ($40 dollarydoos per month)

Datacamp is Data Science orientated and runs on a subscription model that can be a bit exxy.  They do have free materials so you can try before you buy, and they use a browser environment so you don’t need to download R or RStudio before you start. What I love about these courses is that they cover the very very basics then build you up through practice. Datacamp has gamified their courses, you earn experience points for correct code lines and can take hints. This gives you a sense of achievement, something you definitely need in your early programming education.

Udemy – Python for Data Science and Machine Learning Bootcamp ($14.99 dollarydoos)

This is a fairly cheap course that Jose Portilla runs this course. Jose is a well-regarded instructor with consistently high reviews. This course covers the very very basics including how to install Python and my prefered environment, Jupyter. I would recommend this course for anyone who is feeling a bit overwhelmed about learning to program, so pretty much everyone.

Coursera – Programming for Everybody (Getting started with Python (0$ dollarydoos if you audit)

Coursera is free if you don’t get the certificate. This is called auditing. I took this course a few years ago and enjoyed it. There is a good online community and they send you reminders that your ‘homework’ is due. The instructor is Associate Professor Charles Severance from the University of Michigan who is adorable. He even does ‘class outside’. Recommended for budding Python programmers.

Top courses when ‘I can code(ish), just give me the Data Science ‘

Coursera – John Hopkins University Data Science Specialisation course ($0 dollarydoos if you audit)

This is a well-known program in Python that gets consistently high reviews. Made up of 5 separate courses, the specialisation takes around 5-6 months to complete at a normal speed of 6-8 hours per week. What I appreciate the most about this course is the breadth. By the end, you will be proficient with machine learning libraries like scikit-learn, NLP libraries (NLTK), visualisation libraries (matplotlib) and network construction (networkx) for social network analysis. The course fairly lectures heavy so if you aren’t great at watching lectures, hop on a treadmill and kill two birds with one stone.

Udacity – Intro to Data Analysis ($0 dollarydoos)

A brief but clear dive into the cognitive mechanics behind Data Science through the actual doing of Data Science. The instructor emphasises communication and careful decision making. The course is in Python and part of the Data Analyst Nanodegree which I haven’t done any more of.

Udemy – Data Science and Machine Learning Bootcamp with R ($14.99 Dollarydoos)

This course was awesome because at the time I did it, I had never used Python and was seriously doubting my ability to understand algorithms that I had only just begun to hear of. Jose Portilla takes this course and does a great job guiding you through different classification algorithms, clustering algorithms, basic NLP and Neural Nets. Definitely a confidence booster.

Top courses for when you want to step it up and commit

These courses can be done in combination with your degree or work. But they are a bit heavier.

Coursera – Machine Learning ($0 dollarydoos if you audit)

I cannot recommend this course enough. You will get an amazing experience with 60 hours of jam-packed ML goodness. Offered by Stanford (so you know it’s legit), the course boasts 4.9/5 stars from 90k ratings. I did find I needed to brush up on a few things I was rusty on, like linear algebra. But it was so worth it. I finished feeling like I was finally becoming a real Data Scientist. Oh and the instructor Andrew Ng is literally the best, he doesn’t dress things up unnecessarily which I have found some (younger) instructors like to do. While you’re at it, do yourself a favour and listen to Andrew’s advice for building a career in ML.

Cognitive Class – Learning Paths for Data science ($0 dollarydoos, thanks to IBM)

Historically IBM has a lot to answer for but I’ll give them this, they have certainly figured out how to tailor education for industry. Here you can choose what you are interested in according to the ‘learning path’ (Scala for Data science, Hadoop programming, deep learning, blockchain etc) or your experience level. The platform is very on trend, with  ‘Containers, microservices, Kubernetesm and Istio on the cloud’ being one such course (lol, wot?). They work on a badge system with optional competitions. Equal parts pokemon and Kaggle. Definitely one for those who are keen to go corporate.

Dataquest – Interactive coding challenges (~$35 dollarydoos per month) 

Quite similar to Datacamp but I feel it’s a bit more …ahh….polished. Dataquest uses a hands-on learning approach where you are essentially given a project with problems to solve. Hands-on learning is the quickest way to get better and Dataquest steps it up with actual projects, not just isolated problems. It is subscription based and billed yearly.

Top courses for when you just want me to shut up and take your money

Literally, any reputable University degree where you get a Masters or Bachelors in Data Science.

We aren’t bashing tertiary qualification in Data Science. Many are amazing and produce extremely talented students. Realistically you do need some form of tertiary qualification to break into Data Science, particularly in Australia. Here is a list of tertiary institutions offering Data Science qualifications in Australia, USA and the UK.

But if a $64 – 75k price tag isn’t in your budget, there are some alternative qualifications increasingly being recognised in the industry. I haven’t taken the next two courses but I have heard that they both do a similar syllabus that is said to get you up to speed.

edX- MicroMasters Program in Data Science ( $1,746 dollarydoos)

Not having taken this course I can’t comment too much about it. However, the instructors have emphasised the statistical and mathematical aspects of machine learning which is always a winner with me. The breadth looks reasonable and UC San Diego has a great reputation. If you are planning on doing a further study you could be eligible for course credit. Ironically you couldn’t gain credit at UC San Diego, but you could at Curtin University in Perth, Western Australia (#hometown). I’ll let you make up your own mind about this course.

edX- Microsoft Professional Program in Data Science ($1,635 dollarydoos)

This one is aimed at people who are maybe managers or Business analysts using primarily Excel or similar. The course includes SQL, R, Python and foundational mathematics before moving into Machine learning and predictive analytics with Spark. The final quarter of the course is a capstone project resulting in a certificate which I’ve seen pop up a few times on LinkedIn.

For when you don’t want to be a ‘fake Data Scientist’

Ok, brace yourselves. To be a real Data Scientist, you will want to get some theoretical understanding of the mathematics and statistics behind you. Horrifying I know.

I have so many problems with people who have never taken a stats course calling themselves Data Scientists. Don’t leave yourself open to some bitch like me calling you out on your lack of understanding of the fundamentals. I’m not the worlds best and never will be, but my understanding of the fundamentals informs the choices I make when creating workflows, choosing models and interpreting my results. 

Unfortunately, I can’t recommend any courses online. I learnt my stats the old fashion way, Introduction to Statistical Learning, Introduction to Probability and Statistics for Engineers and Scientists and countless bottles of wine. If you are feeling particularly masochistic try The Elements of Statistical Learning.

Accumulating this knowledge is an ongoing battle. These days I continue to destroy my own sense of well being by reading multiple research papers concerning Bayesian inference, crying over what I call ‘scary math’ and spending the hour after every PhD supervision deeply questioning my life choices.

But since you are just starting out on your Data Science journey, this is very unlikely to be your story. Take the statistics and probability portion of whatever education you end up doing very seriously.

If you have any recommendations or proclamations of love for our site, leave a comment.

Good luck champs!