Why your Twitter analysis sucks – and what you can do about it

I’ll start with my favourite word in the English language. CONTEXT CONTEXT CONTEXT!!!

My colleagues, friends, and students are at the point of rolling their eyes when I say it. But it is so important and the main reason your Twitter analysis sucks.

What do I mean by context? If you are going to datafy something – turn a tweet, a representation of a thought, emotion, idea into data –, then you need to think about the context of a) the user and why they tweeted, b) the dataset you are looking at, c) the problem you are trying to solve by datafying that tweet in the first place, and d) the tools you are using. Why?

BECAUSE NATURAL LANGUAGE PROCESSING NEEDS TO BE BESPOKE AND YOUR PRECONCEIVED ASSUMPTIONS WILL TRIP YOU UP.

NLP is like a game of chess, you need a strategy. This means knowing your next 15+ moves (this is my average number of function executions before until I have to rework my pipeline).

Time and time again I see this sort of standardised pipeline for preprocessing as a method to get to the juicy bits (modelling), but if you are using these sorts of pipelines then you are going to produce a subpar analysis. Let me give you an example.

Let’s say you are an analyst working on a political campaign and your boss, the candidate, wanted to know how people were feeling about a certain issue they are speaking about. You decide to do some sentiment analysis of a collection of tweets on #auspol and found that the sentiment towards the politician for was moderately positive towards that issue. Great! Now you can tell them they are doing the right thing and to ramp it up.

So they do, but they lose the election…and you get fired. WTF?

Did you think about the context of your sample? Let’s dissect what went wrong with this example, and explore why some critical and lateral thinking is a necessary ingredient in your analysis.

The English language is a nightmare

Tweets amplify the problematic issues around English, arguably the stupidest language on earth. Tweets are generally utter filth. They are noisy, short and use non-standard conventions. I have a love-hate relationship with them.

To demonstrate where this went wrong, I’ll use the Stanford Sentiment Analysis Treebank. Grey circles represent neutral sentiment, orange is negative and blue is positive. This is our original tweet (let’s say it was posted shortly after your candidate made a speech about the issue).

Original tweet

Now for the analysis. First, we make the string lowercase, then we have a few options that are ‘standard’ in the NLP preprocessing pipeline.

We can:

  1. Remove the punctuation and special characters, leaving the word that the hashtag is made of; or
  2. Remove the punctuation and special characters, getting rid of the hashtags completely.

Let’s see what happens when we do this.

Option 1. Remove the punctuation and special characters, leaving the word that the hashtag is made of.

Result = Negative sentiment

Great, the sentence is picked up as negative. No issues. What about option 2?

Option 2. Remove the hashtag completely.


Result = Positive sentiment

Ok, that’s not great. Clearly, this sentence carries negative sentiment, but it’s been picked up as positive. That’s because human emotion is incredibly difficult to datafy. People often use hashtags to convey sarcasm, which is inherently negative in sentiment.

You might be thinking, ‘ just leave the hashtags in there and that way I can capture the sarcasm’. Well yes maybe. But hashtags aren’t always singular words. Let’s expand that hashtag into #idiotpollies as in ‘idiot politicians’.

Typical hashtag

Result = Positive sentiment

And we are back to positive.

This is a hashtag issue. It’s almost impossible to assign sentiment to hashtags. You can split them but there are two main problems with this. Functions to split a hashtag into its root terms usually split the sting at the first capitalised letter they come accross e.g “#AustraliaDayWeekend” becomes “Australia day weekend”. You need to adjust your pre-processing pipeline to make sure you don’t decapitalise before this is done.

Splitting would be fine except:

  • It won’t work when the hashtag has no capitals e.g #auspol #melbourneweather
  • Some hashtags, once split, lose the original meaning that they had in their concatenated form;
  • If you are doing another analysis, like topic modelling, you would use the opportunity to treat them as a ‘word’ – hence different analysis needs different preprocessing (stop using the same data for a differnt model);
  • Non-standard or trending words like #ScoMo get caught in the crossfire. #ScoMo for Australian PM Scott Morrison gets turned to “sco mo” and all meaning is lost. Same for the long-running hashtag #auspol. Let’s look at this example.

#CaptianCook is a trending hashtag in Australia and relates to the debate about Australia Day. Not to get into it, but most tweets like this are sarcastic and negative, and also hilarious.

Now when we normalise the tweet by removing punctuation, splitting the hashtags on their capitals and then putting it into lowercase, we get:

“scott morrison has finally confirmed the dress code for australia day citizenship captain cook sco mo auspol”

And when we look for sentiment:

Result = Positive sentiment

Positive again. #ShockHorror.

The #CaptianCook is negative and sarcastic as a trending hashtag. But ‘Captian’ is positive. If you were working for the Australian PM, you may be in a bit of trouble.

Now what?

We need to think about what words may be tricking the model into thinking this is a positive sentence. Can you guess?

It’s the word “can’t”. Since we removed the punctuation, getting ready to tokenise, the root word “can” is kept. “Can” is labelled as a positive term, hence the positive sentiment.

At this point, if you are still trying to catch me out you might say we can concatenate “can” with the “t” and use “cant”. Ok, let’s try that.

Result = Neutral sentiment

Nope, that returns a neutral sentiment and doesn’t capture the sentiment of the user at all.

You might also say we can manually label hashtags and split according to a dictionary. Sure, you could. And you should, but remember you will never capture all trends – the Twitterverse is too dynamic, and they change too quickly. I’ll go into this in another post.

The answer

Contraction expansion.

Dismiss your flashbacks to year 6 English and stay with me here. By expanding the contraction to “cannot”, which will be separated to “can not”, we save the sentiment…and your job.

Result = Negative sentiment

Finally! “Not” supersedes the positive “can”.

The moral of the story is you need to know your problem, the context of your data, what it means for the pre-processing of your data, and what you may be tripped up on.

I hope this helps you think about new ways to improve your work, and hopefully not get fired!

As always, please share on Twitter and tag us as @data_little for happy coding binges.

Feature photo by JESHOOTS.COM on Unsplash

Thanks to Ed Farrel for his wonderful editing of yet another helpful post from The Little Data Scientist.

Jupyter Convert: How to get a Table of Contents

If you are like me you prefer a readable, aesthetically pleasing structure bit of code to rip off and make your own. Ahhh, I mean use as a template. But sometimes, especially when I first started out, I found I was discouraged from trying to understand what the author had by the hostile looking script they had produced. If I asked for help when I was really struggling I would get shown this sort of thing and die on the inside.

Nope!

I tried a few different IDEs and was flailing. About 6 months into learning how to program I came across iPython now known today as my beloved Jupyter Notebook. #Jupyter4Lyfe

What was this clear, intuitive and aesthetically pleasing sorcery? I played about and was an immediate convert. I now write my students’ assignment instructions in Jupyter and really wish other lecturers would adopt this practice rather than handing out instructions (you know who you are). But I digress.

I have had ‘real’ programmers scoff at me for not using a ‘real IDE a few times’. I laugh at them because screw linux and sublime. By using Jupyter I know I will be reaching a large group of people who aren’t programmers by trade and will now be able to get a transparent overview of what’s been done. Also, unless I’m doing a Github commit, I can’t be bothered with readme files until I am finished. I prefer to keep notes in the notebook.

Anyway, I love love love Jupyter Notebooks. But as I have started doing experiments requiring non-linear runs (changing what pre-processing strategy I am using for example) I find I have to write really long scripts. So I wanted to find a way to find the sections I am after without having to do a million scrolls. And so entered the humble dynamic table of contents (TOC). This is the magic of a TOC. Side note, if you are allowed to submit assignments in Jupyter, do it and install a TOC. Your tutors/ lecturers will thank you. Make their job easy and they will be inclined to give you a good mark.

Behold!

The Magic

Before we start, I am a Mac user and so use bash (Bourne Again SHell). Windows users will use cmd.exe and Powershell. If you are a window user I am so so sorry for your loss, your loss of assessing to 95% of your OS capabilities. So for Windows users, step one is get a mac.

Install pip

Mac users head to this pretty walkthrough to install pip.
Windows users head over here to install pip.

Installation of nbextensions

Alright, the jupyter_nbextensions_configurator is a type of extension that beefs up the capabilities of Jupyter and adds new buttons for formatting, adding enhancing functions etc. to the interface you see in Jupyter. After you add this extension it will load automatically and you won’t need to reset it every time. There are so many options in the nbextentions configurator that we will go through in another article but if you want all if the info, as always, read the docs.

If you are a conda user

In your terminal or cmd, type the following:

conda install -c conda-forge jupyter_nbextensions_configurator

If you prefer pip

I prefer pip, but there are two steps to nbextensions than conda.


Step 1. Install the pip package by copying this into your terminal or cmd.

pip install jupyter_nbextensions_configurator

Step 2. Configure the notebook server to load the extension.
This is done with something called a Jupyter sub-comand which just tells the actual notebook to install the server extension, kind of like when you use ‘import some-package’ in your note book after you have done ‘pip install some-package’ in your terminal or cmd. Now copy and paste this in:

jupyter nbextensions_configurator enable --user

You will get something that looks like this:

Yes I have a pink command line.

Choosing your options in your brand new nbextensions panel

Now open up a new jupyter notebook and you should see a new option called Nbextensions.

Shiny new extension.

Click on the panel and check the Table of Contents (2) and toc/loc configurations. Like this:

Required options for TOC.

Collapsing Sections

Before you start, I recommend ticking the collapsible headings option as well. It’s the fourth from the top of the first column of options. You will see why shortly.

Ok, we are ready to go. Open up a new notebook and enable table of contents. Like this.

Then you will need to so some little configuring of the ToC2 settings. When you click on the TOC button you will see a sidebar on the left-hand side called ‘Contents’ it will have a refresher icon and a settings icon. Click the settings icon. Then enable all of the settings you want.

I recommend all of these, particularly the ‘Leave h1 items out of ToC’ checkbox. Otherwise, you will a number 1 next to your TOC which looks bad and throws off the numbering.

Getting a TOC in your notebook

To put in your first heading, set the cell to markdown and then use two ‘##’ to call a main header.

To put in a subheading user three ‘###’.

You can keep putting in sub-headers with up to five ‘#’ so that you have four levels of sub-headings like this.

And there you are, automatically generated TOC in your notebook.

Related features

Colour coding

You will notice that the TOC is highlighted here in yellow. This indicates where you are in your notebook. When you run a cell this the section that a that cell is running will turn red. Here is an example:

This is pretty handy when you are running code that takes a while and you are messing about with something else in the notebook.

Navigator and Sidebar

Go up to your main set of tabs where File, Edit, View etc. Click on the ‘Navigate’ tab. You will have to expand the box to see the contents which you can click on. This option isn’t one I use too much because I like the sidebar. Speaking of the sidebar, you can move it around to wherever you want it to be.

And that’s it. You now have TOC which will make your work more easily navigated, and anyone you distribute the notebook will be happy that the TOC is in-place regardless of whether they use these extensions.

If you got something out of this walkthrough we would love if you could spread the work. Twitter is a great way to do this.

Credit to Bundo Kim for the amazing feature photo. For more of Bundo’s work please head to unsplash @bundo

Sorry, but you need a Masters degree to be a Data Scientist

You want to be a Data Scientist, and you’ve read that you might not actually need a Masters degree or even a PhD to get the job of your dreams. Let me guess? You’ve been told by this by various ‘experts’, and now you have some doubts.

First of all, there are some major sceptics out there who probably aren’t actually Data Scientists. Yep, we know about them, and we know better. You want the truth? You need a Masters degree and/or PhD to get a fulfilling, progressive and well-paid career in Data Science.

The idea of completing a bunch of online courses and MOOC’s then landing a six-figure position without a Masters is possible but improbable. Particularly in Australia, where a Masters is essential to getting a legitimate role with career progression and skill development. And unlike the US, PhDs are very very valuable.

“You need a Masters degree or PhD to get a fulfilling, progressive and well-paid career in Data Science.”

– Legitimately every Data Scientist

Something to remember is that the ‘field ‘is relatively new. Stories of people who have managed to build up their skills and now have a fantastic career in Data Science are often people who finished their Computer Science, Mathematics or Statistics related bachelors 8+ years ago. They have been able to transition into these roles as the position became more defined. These people are driven and are continually honing their skills through online courses, Kaggle competitions and seeking out skill development opportunities.

As the field developed more post-graduate degrees were marketed for Data Science. The first batches of graduates entered the Australian job market in 2017. Since then employers have come to expect a Masters degree in Data Science because there are so many people with that qualification, and the number is rising.
It is possible to get a job in Data Science without a post-graduate degree, but you have to be lucky, connected or a genius math nerd with pro-hacking skills. Now be honest with yourself, is that you?

Before we look into the job market, let’s address a significant motivator in why people do a Masters, money.

Data Science degrees are expensive. Really expensive. The average cost of a 2 year Master of Data Science in Australia is $65,000 for a domestic student and $80,000 for an international student. It’s a considerable investment, especially if you are already $45,000 in debt from your undergraduate degree. But the good news is that salaries for Post-graduates are pretty competitive. The Graduate Outcomes Survey statistics back this up. 


With an undergraduate degree in Computer Science or Information Systems, you can expect a median graduate salary of $60,000 pa. For Post-graduate that jumps to $96,000 pa. That’s a 60% increase! Graduates with a Bachelor in Science, by comparison, can expect a median graduate salary of $63,000, rising to $78,300 for those holding a post-graduate degree. The average full-time Australian wage is $82,500 in June 2018.

Comp.Sci and Info. System 2018 undergrad salary $60,000

Comp.Sci and Info. System 2018 postgrad salary $96,000

– 2018 Australian Graduate Outcome Survey

These numbers are compelling. Clearly, a post-graduate degree is valuable to employers.

To get a good overview of the Australian Data Science job market, we hopped on seek.com to search for jobs with the title ‘Data Scientist’ which returned 262 jobs. Breaking them up into pay bands, we start to get an understanding of the Data Science job market and why people say ‘you don’t need a Masters’.

The $50,000 – $60,000 pay band

Setting the pay band filter to reflect the low end of the median undergraduate salary, we immediately see a few scary trends.

Some recruiters are opportunistic

Reading these ads, the wise words of Darryl Kerrigan come to mind

“Experienced Data Scientist for 50-60k? Tell’em they’re dreaming”

– Darryl Kerrigan if he was a Data Scientist

For example, this is a job advertised for the Australian Federal Government, but it is more likely for a contractor. This job is a classic example of ridiculous opportunistic recruiters trying to rip off legitimate postings for senior roles. Ignore and don’t look at these recruiters again.

A job that is advertised for a ‘Data Scientist’ isn’t always for a Data Scientist

Unfortunately, with the hype around Big Data, companies and recruiters have taken to using the title a little too liberally. It might be disheartening to see jobs advertised as low as $50-60k, but in reality, these jobs will be meant for graduates of a business or commerce bachelors degree. If you took these jobs, you will likely be using proprietary tools and do little development work.

The $60,000 – $90,000 pay band

When we increase the pay band to get more legitimate roles. But there aren’t many. On first glance, people will look at this and see 12 Data Science jobs. But is there?

The liberal use of ‘Data Scientist’ is evident again. Out of the 12 that were returned we got :

2 x Junior Data Scientist/Data Analysts

1 x Digital Analyst

2 x Junior Software Engineers

1 x Junior Campaign Analyst

1 x Procurement Specialist

3 x Data Scientists

1 x Junior Data Scientist

We excluded the Junior Software Engineer and Procurement specialist roles, they are not relevant. The two Junior Data Scientist/Data Analyst, Junior Campaign Analyst, Digital Analyst and the Consultant/BI and Analytics jobs fall into the example of those meant for business or commerce undergrads. In fact, it even says:

“Degree with a strong quantitive focus such as; statistics, physics, psychology, economics or commerce”

– Badly targeted job ad

Little room to learn skills in Machine Learning, Programming and Workflow Development

These types of job posts list SAS and other proprietary software experience as a requirement. They don’t tend to require you to program in a broad range of languages or actually plan your own analyses. It is unlikely these jobs will set you up for hardcore Machine Learning and AI development roles in the future. Click on the image to read a few.

Some companies go fishing for a Masters without mentioning a Masters

Some companies will specify a specific type of undergraduate degree and also ask for a minimum of 2 to 3 years experience with Python, Scala, Spark or similar. These roles are almost impossible to get unless you have been working in the industry and since it is hard to get a role where you are actually developing these skills as an undergraduate (e.g. this role), your best bet is a Masters or higher to compensate. These types of recruiters are casting the net wide but know what they want, and they want it cheap.

From here, the remaining three positions for a Data Scientists are offered at $80,000 pa. All want a PhD or Master’s degree.

The $90,000 – $130,000 pay band

The Masters-Experience tradeoff

Moving up, we see 85% of the ads specify a required PhD or Masters, often with 2-8 years of experience tacked on for good measure. Think of it like a trade-off, a PhD with no industry experience already has 3 years working on a project with real deadlines, often with industry partners. A Masters student will be doing projects throughout the 2 years, and a lot of them have scored data analyst roles part-time throughout their degree based on their undergrad skills. Undergrads are unlikely to have that opportunity. 


Don’t be fooled by the words ‘preferred’ and ‘or’. ‘Prefered’ means ‘must’. The market is flooded with Masters graduates who have an advantage over undergrads. Unless you have 3 years of project management experience and some serious skills, then you can’t compete. Remember, HR and recruiters run the show, not the team you will work with. Here are some examples.

PhDs are valued in Australia

Actually, some recruiters are getting desperate for them. Unlike the American rhetoric that is generated from the post GFC job crash, a PhD will put you in good stead in Australia for a career in Data Science. It’s just a hard route to go. But clearly they are getting desperate and it’s kind of awkward.

The $130,000 + Pay Band

These roles are difficult to get, but there is a big demand for them, mainly because they require business acumen and consulting experience. However, we still see some odd marketing tactics with interesting use of emojis. This one is on for $150,000 pa.

“If you are a Data Scientist who could also walk into a high performing Senior Insights role with a Consulting firm (eg, presentation skills, customer engagement, strategy and planning, insights to find opportunities,  drive sales etc.) then stop what you are doing and send me your CV !- I need to speak with you immediately :-)”

– Desperado

In fact, some are even asking for journal publications.

Have you published?

I love reading this. As someone with a Masters in Data Science, now doing a PhD in Text Analytics and Organisational Performance Analytics, I know I have a good pay packet waiting for me if I choose to step away from academia.

So with that, I hope you now have a better idea of the Australian job market for Data Scientists and why the following articles (from the USA) are bull.

Cleverism – 4 Reasons Not To Get That Masters in Data Science

Forbes – You Can Get A Data Analytics Job Without A Masters In Data Science

Topbots – You Don’t Need a PhD to Master Machine Learning & Data Science

How to pull an ‘all-nighter’ The dry eyes edition

Ok, first of all, don’t.

Second, if you are, then you need to get your shit together and learn some time management skills. But that being said, life does get in the way and the 9 to 5 workday no longer exists. All-nighters are sadly a fact of life, and at some point, you will need to pull one. So if your gonna do something, you may as well do it right.

Today we are reviewing one of the most visible parts your body you will punish when you pull an all-nighter. Your eyes, your poor, sore, dry and blood-shot little eyeballs.

Our eyes, like the rest of our body, need sleep (between 7-9 hours a night). When you pull an all-nighter, you deprive them of the opportunity to rest and replenish.

Often you’ll find that the scratchy sore feeling develops 2-3 hours after what would have been your usual bedtime. This is especially true in a coding binge where you are focusing for hours on a screen right in front of you. Optometrists call this set of symptoms Asthenopia, also known as tired-eyes or eye-strain.

So why do they get this way?

Well, theres a couple of interconnected reasons. The first is that our little tiny extraocular muscles that control our eyeball movement and eyelid elevation (for blinking) get tired. This means they can’t maintain the rate that we usually blink. Less blinking means fewer tears spread out over your eye so it will dry out quicker than normal. You’ll notice your vision will blur and its hard to focus. This is a result of the smaller muscles that control pupilar reaction becoming fatigued. You will also have a slow down in the reaction time to low or super bright levels of light. All of this strain generates fluid buildup through inflammation making tear generation harder and harder.

The workaround

Eye drops

These will place artificial tears on the eye and provide the much-needed moisture they are craving. Some work better than others for different people, you might want to try a few. Make sure to change bottles every month or so to avoid giving yourself infections. Conjunctivitis is gross.

Step away from the computer


Every 20 – 30 minutes give those ballers a break. When you use a computer, your blink rate drops on average from 5-6 blinks per minute to just 2 – 3. Fewer blinks, more dryness. You also want to get your other muscles that are straining hard to focus a break.

Use a cool eye compress


Take 10-15 minutes to use a cool (not frozen) eye compress to reduce the inflammation in your eyes. This will aid in relieving any headaches you are getting because of the eye strain.

Put your specs on!


Ahh, the catch-cry your mum for those of you who wore them as a kid. Your glasses will help reduce the strain your eyes are under. That’s why you got them. If you wear contacts switch to your set of glasses if you can. Contacts worsen dry eyes when worn for long periods, like the 36 hours of hell you are getting through.

Adjust your lighting


A dim room will force your eyes to work harder than they need to. Natural light is prefered but, you know, it is probably night time. So use a bright bulb, 100 Watts (6500K), to light up your workspace. Some people prefer halogen bulbs over fluorescent citing improved focus. Agreed. There has been an excessive amount of organisational research into the effects of light on productivity. Light up your workspace!

Use a humidifier


Humidifiers put some moisture back in the air to help with the dryness in your eyes. Your nose and throat will also thank you as your mucal membranes will suffer from the lack of sleep and dry up as well. Internal heating and cooling, particularly evaporative airconditioning will suck the moisture out of the air so if you live in a hot or cold climate, think about buying a humidifier.

Right, I hope these tips aid in a successful and less painful all-nighter. Look out for more guides in the ‘How to pull an all-nighter’ series. In the meantime, I have three words for you, TIME MANAGEMENT SKILLS.

Data Science Podcasts we love

Ahhh Podcasts, the ear joy we love to listen to.

If you are anything like us, your AirPods will be permanently implanted in your ear, and your significant other will regularly get told ‘hold on, I just need to pause this podcast’ when they try and speak to you.

Whether you enjoy fiction, current affairs, comedy or learning about new skills, there is a podcast for almost everything. So we at the LDS thought we would share our most beloved Data Science podcasts with you.

Partially Derivative – USA

Sadly ending in 2017, Partially Derivative was a standout favourite of ours Each episode the three hosts, Jonathon, Vidya and Chris get together and talk about everything from the latest trends and innovations, failures and launches, social impact issues and hilarious examples of applications such as the ANN generated 8th Harry Potter book. Check out all 103 episodes here or on whatever platform you use.

Data Science Ethics – USA

Fricken love this podcast. Mainly because there is such a big call for people in the fields to start addressing these issues. These ladies explain ethical concepts we encounter every day with examples like Blockchain, Morality in Machines, NLP and Googles AI generated assistant. Head over to their site for all the episodes here.

Data Futurology – Australia

A relative newcomer hosted by Felipe Flores, a Data Scientist and self-described Data Futurologist. Felipe’s podcast is a winner when it comes to industry based Data Science, he interviews a variety of industry leaders in Australia and internationally. If you want to understand how Data Science is being employed in industry head here to listen.

Linear Digression – USA

Ben and Katie talk about various Machine Learning and Data Science tools and get you up to speed on the State of the Art. They spend 15 minutes or so to explain these tools and how the word in a way that isn’t basic, but it isn’t so difficult that you feel stupid (there is a lot of podcasts that over explain this way). Examples of topics include Word2Vec, Capsule Networks, Git for Data Science, Agile Data Science and Fractal Dimensions. Hear them gently explain everything here.

Data Skeptic – USA

A long-running podcast by Kyle Polich who covers everything from domain expert interviews, tutorials and advice. Kyle is a magnificent presenter and calls a spade a spade, he calls bullshit on a lot of the detritus floating about around Data Science. Episodes go between 15 to 60 minutes and cover technical concepts like NLP, time-series and gamification and are pretty much ‘Data Science easy listening ‘.

The O’Reilly Data Show – USA

A fairly technical podcast often reserved for long car trips or plane rides. Ben Lorica hosts the podcast, interviewing industry experts and getting into the nitty gritty of the latest trends in both academic and industry-based Data Science. Ben isn’t exactly the smoothest host, but you will definitely be well informed. Listen here.

Concerning AI – USA

Amazing discussions on the socio-technical implications (good and bad) that are emerging from the wonderful world of AI. They offer a debate/discussion style format where the two hosts Ted Sarvata and Brandon Sanders. Concerning AI is a bit of a trailblazer as it introduces the discussions of ‘existential risk’ of our craft in an assessable format. Have a listen here.

Data Stories – USA

This is a podcast with a focus on visualisation – super important. The hosts Enrico Bertini and Moritz Stefaner are experts in the field and interview guests from both industry and academia. They do an awesome job of allowing the people they interview to guide the story. This isn’t always the case in some podcasts. The narrative style makes for easy listening, particularly for those in academia. Head over here to listen.

Of course, there are many more, but these are the standouts that are pinned to our favourites list. If you aren’t a podcast listener we highly recommend this for passive learning. Our prefered platforms are Stitcher, Castbox and Podcast Player.

Now go get some ear joy!

Best MOOC’s for Data Science students in 2019

Starting a Data Science or Comp. Sci degree in 2019? Here are all the MOOCs that we recommend for you to get a head start on your study.

The NUMBER ONE question I get asked from incoming Data Science students is

” What materials or courses do you recommend to prepare for classes?”.

I got this question so frequently last year that I wrote up all the MOOCs (Massive Open Online Courses) I had personally completed in the last 3 years and gave it to the course director to send out. Considering that the number of Data Science students in Australia has doubled since I did mine, I figure there is a fair appetite for lists like these and so have updated mine just for you.

Now before you jump in I want you to keep the following in mind:

  1. There is no substitute for practice. You cannot read your way to becoming a decent programmer. You have to code.
  2. Both Python and R are great languages. If you are reading this, I’m guessing you haven’t started to specialise. Try and get across both languages. I used to be 100% R and now I hardly ever use it, preferring Python instead. You will need to get adaptive. Not everyone uses just 1, 2 or even 3 languages. I get caught out all the time with janky code that requires Java, Python, Matlab and C++ all for one installation.
  3. Cost doesn’t mean quality. Some of what I have put up is free but some have a price tag (no we do not get sponsorship). This also goes for the certificates. You don’t need them, especially if you are doing a university degree. Look out for price drops as well. Udemy has heaps of discounts throughout the year (hello black Friday) but it can be a bit hit and miss.

Now before I get to the best MOOC’s for Data Science students 2019, I need to talk about Kaggle. Kaggle is a platform where you can go and retrieve datasets to play with but also see what other people have done too. I am not ashamed to say that Kaggle is responsible for getting me through my first semester of programming.

Some lecturers I work with now incorporate ‘kaggle comps’ into their teaching syllabus and they are often used by Data Science clubs form mini Datathons (think Hackathon for Data Scientists). Kaggle was initially a platform to make some good money from businesses who needed insights and were willing to pay for it. When you’re ready you can join competitions and compete for real-life dollarydoos (the official currency of Australia for which all money on this site is expressed as).

Now for the best MOOC’s for Data Science students 2019

Top course when you don’t know how to code (yet!)

DataCamp – Introduction to Python and Introduction to R ($40 dollarydoos per month)

Datacamp is Data Science orientated and runs on a subscription model that can be a bit exxy.  They do have free materials so you can try before you buy, and they use a browser environment so you don’t need to download R or RStudio before you start. What I love about these courses is that they cover the very very basics then build you up through practice. Datacamp has gamified their courses, you earn experience points for correct code lines and can take hints. This gives you a sense of achievement, something you definitely need in your early programming education.

Udemy – Python for Data Science and Machine Learning Bootcamp ($14.99 dollarydoos)

This is a fairly cheap course that Jose Portilla runs this course. Jose is a well-regarded instructor with consistently high reviews. This course covers the very very basics including how to install Python and my prefered environment, Jupyter. I would recommend this course for anyone who is feeling a bit overwhelmed about learning to program, so pretty much everyone.

Coursera – Programming for Everybody (Getting started with Python (0$ dollarydoos if you audit)

Coursera is free if you don’t get the certificate. This is called auditing. I took this course a few years ago and enjoyed it. There is a good online community and they send you reminders that your ‘homework’ is due. The instructor is Associate Professor Charles Severance from the University of Michigan who is adorable. He even does ‘class outside’. Recommended for budding Python programmers.

Top courses when ‘I can code(ish), just give me the Data Science ‘

Coursera – John Hopkins University Data Science Specialisation course ($0 dollarydoos if you audit)

This is a well-known program in Python that gets consistently high reviews. Made up of 5 separate courses, the specialisation takes around 5-6 months to complete at a normal speed of 6-8 hours per week. What I appreciate the most about this course is the breadth. By the end, you will be proficient with machine learning libraries like scikit-learn, NLP libraries (NLTK), visualisation libraries (matplotlib) and network construction (networkx) for social network analysis. The course fairly lectures heavy so if you aren’t great at watching lectures, hop on a treadmill and kill two birds with one stone.

Udacity – Intro to Data Analysis ($0 dollarydoos)

A brief but clear dive into the cognitive mechanics behind Data Science through the actual doing of Data Science. The instructor emphasises communication and careful decision making. The course is in Python and part of the Data Analyst Nanodegree which I haven’t done any more of.

Udemy – Data Science and Machine Learning Bootcamp with R ($14.99 Dollarydoos)

This course was awesome because at the time I did it, I had never used Python and was seriously doubting my ability to understand algorithms that I had only just begun to hear of. Jose Portilla takes this course and does a great job guiding you through different classification algorithms, clustering algorithms, basic NLP and Neural Nets. Definitely a confidence booster.

Top courses for when you want to step it up and commit

These courses can be done in combination with your degree or work. But they are a bit heavier.

Coursera – Machine Learning ($0 dollarydoos if you audit)

I cannot recommend this course enough. You will get an amazing experience with 60 hours of jam-packed ML goodness. Offered by Stanford (so you know it’s legit), the course boasts 4.9/5 stars from 90k ratings. I did find I needed to brush up on a few things I was rusty on, like linear algebra. But it was so worth it. I finished feeling like I was finally becoming a real Data Scientist. Oh and the instructor Andrew Ng is literally the best, he doesn’t dress things up unnecessarily which I have found some (younger) instructors like to do. While you’re at it, do yourself a favour and listen to Andrew’s advice for building a career in ML.

Cognitive Class – Learning Paths for Data science ($0 dollarydoos, thanks to IBM)

Historically IBM has a lot to answer for but I’ll give them this, they have certainly figured out how to tailor education for industry. Here you can choose what you are interested in according to the ‘learning path’ (Scala for Data science, Hadoop programming, deep learning, blockchain etc) or your experience level. The platform is very on trend, with  ‘Containers, microservices, Kubernetesm and Istio on the cloud’ being one such course (lol, wot?). They work on a badge system with optional competitions. Equal parts pokemon and Kaggle. Definitely one for those who are keen to go corporate.

Dataquest – Interactive coding challenges (~$35 dollarydoos per month) 

Quite similar to Datacamp but I feel it’s a bit more …ahh….polished. Dataquest uses a hands-on learning approach where you are essentially given a project with problems to solve. Hands-on learning is the quickest way to get better and Dataquest steps it up with actual projects, not just isolated problems. It is subscription based and billed yearly.

Top courses for when you just want me to shut up and take your money

Literally, any reputable University degree where you get a Masters or Bachelors in Data Science.

We aren’t bashing tertiary qualification in Data Science. Many are amazing and produce extremely talented students. Realistically you do need some form of tertiary qualification to break into Data Science, particularly in Australia. Here is a list of tertiary institutions offering Data Science qualifications in Australia, USA and the UK.

But if a $64 – 75k price tag isn’t in your budget, there are some alternative qualifications increasingly being recognised in the industry. I haven’t taken the next two courses but I have heard that they both do a similar syllabus that is said to get you up to speed.

edX- MicroMasters Program in Data Science ( $1,746 dollarydoos)

Not having taken this course I can’t comment too much about it. However, the instructors have emphasised the statistical and mathematical aspects of machine learning which is always a winner with me. The breadth looks reasonable and UC San Diego has a great reputation. If you are planning on doing a further study you could be eligible for course credit. Ironically you couldn’t gain credit at UC San Diego, but you could at Curtin University in Perth, Western Australia (#hometown). I’ll let you make up your own mind about this course.

edX- Microsoft Professional Program in Data Science ($1,635 dollarydoos)

This one is aimed at people who are maybe managers or Business analysts using primarily Excel or similar. The course includes SQL, R, Python and foundational mathematics before moving into Machine learning and predictive analytics with Spark. The final quarter of the course is a capstone project resulting in a certificate which I’ve seen pop up a few times on LinkedIn.

For when you don’t want to be a ‘fake Data Scientist’

Ok, brace yourselves. To be a real Data Scientist, you will want to get some theoretical understanding of the mathematics and statistics behind you. Horrifying I know.

I have so many problems with people who have never taken a stats course calling themselves Data Scientists. Don’t leave yourself open to some bitch like me calling you out on your lack of understanding of the fundamentals. I’m not the worlds best and never will be, but my understanding of the fundamentals informs the choices I make when creating workflows, choosing models and interpreting my results. 

Unfortunately, I can’t recommend any courses online. I learnt my stats the old fashion way, Introduction to Statistical Learning, Introduction to Probability and Statistics for Engineers and Scientists and countless bottles of wine. If you are feeling particularly masochistic try The Elements of Statistical Learning.

Accumulating this knowledge is an ongoing battle. These days I continue to destroy my own sense of well being by reading multiple research papers concerning Bayesian inference, crying over what I call ‘scary math’ and spending the hour after every PhD supervision deeply questioning my life choices.

But since you are just starting out on your Data Science journey, this is very unlikely to be your story. Take the statistics and probability portion of whatever education you end up doing very seriously.

If you have any recommendations or proclamations of love for our site, leave a comment.

Good luck champs!