Why you will fail to be a “GREAT” Data Scientist?

Rishabh Jain
8 min readJun 10, 2021

It’s pretty easy to be just a “good” data scientist. But being “great” is not a piece of cake. So, let me break the ice on the most lucrative job of the 21st century for you, being a data scientist.

The job that’s going to automate everything, potentially destroying millions of jobs every day in the process. The job that’s going to produce a super-powerful God-like ULTRON that will kill all the humans one day. Sounds cool, right?

Just leave the fantasy world of Hollywood fiction, and come down to mother earth. If you stick with me for a few minutes with this article, I am sure I can at least tell you how to be a “good” data scientist in the future, if not a “great” one. Just suggesting, I am not sure that I am even worthy enough to be called a “good” one myself. But as an alumnus, it is expected of me to get the shit sorted, right? Anyway, easy to brag advice, then following them. So, let me start from the very beginning.

What is a data scientist? Why a scientist? I am doing engineering, not some B.Sc from Delhi University (My 12th class result just sobs in the corner). Anyway, to define data science in the easiest terms, it is the cross-section of three important skills, Computer Science, Maths+Statistics, and Domain Knowledge. It is the ability to make data-driven decisions based on data or to develop algorithms that will just do it for you. In all, it is just a lazy guy’s asset to not look at every image in a bag of millions of images to tell whether it is of a cat or dog, and just automate this task by developing a model to do the same. Just to add some random coolness, this model is similar in structure to how the human brain works. However, it’s a debatable topic, and we perhaps won’t ever understand as we are not a “human-brain-biology” major. STOP ALL YOUR COOLNESS AND DEBATES HERE, PERIOD.

Perhaps, whatever websites you use these days predominantly, be it Instagram, Youtube, Google Search, Snapchat, etc. is highly dominated by data science in general, from developing feed to predicting suggestions, it’s all the magic wand of data science. As future engineers and managers (yes, the M in ABV-IIITM stands for management, and one should be proud of this), you are not expected to be manually developing recommendations for a user in a company that has billions of users, even from the countries and languages, you have never even heard the name of. It’s crazy that you develop a Japanese to Korean translator with mind-boggling correctness and efficiency, without even knowing a word of either Japanese or Korean. Damn, you are better performing than a graduate who majored in Japanese to Korean translation. I feel sad for him/her. This is the power of data science. It’s not that you understand the data, it’s teaching a machine to understand it for you. MACHINES ARE OUR SLAVES, or is it the other way round?

Data Science is a doomsday machine, even in its infancy. For hundreds of years, people were slowly mastering statistics and mathematics in just a single domain to predict the future. And now, whatever they did hardly a decade back in months, is done by a data science student in a few hours. So, should I just jump into the data science wagon, as is it all bloomy here? Remember, your 3rd standard Moral Science lecture, ALL THAT GLITTERS IS NOT GOLD.

So, cutting short, Data Science or Artificial Intelligence, in general, can only be a golden opportunity for a few people. Among those, even the majority are just good data scientists. Only a handful of them is actually great at what they do. Sure, you must be thinking that it is all about the experience, and soon every good one will turn out into a great one as years progress. It’s not that simple, buddy.

There are two common and contrasting thoughts behind data science, which most of you have already heard,

  1. The correct way of learning data science is to perfect the mathematics behind it. Only when you can understand and do lots of linear algebra and calculus can you call yourself an artificial intelligence enthusiast.
  2. Who cares about mathematics? It’s just Python anyway, with libraries and Github repos available for most of the tasks. It’s more of a “tweaking-the-code” job instead of being a mathematical one.

I disagree with both of them. Mathematics makes you understand the interactions between input and the parameters that the model is learning. It also makes you go through those state-of-the-art and daunting research papers a little easier. It helps you make variations in the architecture without doing random changes and facing size-mismatch errors a gazillion times. But, the world of mathematics is too deep. If you stick to the mathematical understanding of everything you apply, you will probably lag behind your batchmates and won’t have many arrows in your arsenal at the time of job-seeking. Mathematics makes you go deep, while libraries and quick projects make you go wide. In the end, it’s just a trade-off between breadth and depth (BFS vs. DFS). The hyperparameter tuning for this one belongs to you, the time you have for your goal, and your inclination towards mathematics in general. Sorry, NO GRID SEARCH AVAILABLE FOR LIFE. MAYBE IN SOME PARALLEL UNIVERSE.

For new data scientists, the culture is intimidating. With all these buzzwords, Computer Vision, Natural Language Processing, Reinforcement Learning, Markovian Chains, Boltzman Machines, etc, and rapid advancements in each of them, one simply does not have the time to know everything (Also, where can we get so much computational power from :P). The AI industry is continually evolving every second, the computation power is increasing even faster. The question is, how do you balance everything? In the end, you have to digest the fact that you cannot be good at everything simultaneously. Also, there is no point in binge-watching Siraj Raval’s videos or reading up medium articles, unless you develop real projects around them.

Speaking of application, there is no point in reading and learning about new research unless you can at least code it up. Leave the enormous computation power needed, do it for small parameters, at least make something working, even for a batch size of 1.

Yann Lecun, Yoshua Bengio, and Geoffrey Hinton, the Turing award (Nobel prize for computer science) recipients in 2019, had worked decades before becoming somewhat great at a sub-field in their field of interest. This is not what I am saying; this is what they say too. And yet, you believe you know everything about LSTMs, by reading a 10-minute blog of a guy working at a tech company. There is a massive difference between just solving an AI problem and solving an AI problem to perfection. One makes you a good one, the other makes you great. And yeah, perfection doesn’t mean just changing the model’s architecture, but getting as much raw information from the inputs as well. No model or hyperparameter tuning can overcome data imperfections as much as correct data processing can. To have an intuitive understanding of data, probably using visualizations is a must. Most of the breakthroughs are rather intuitive in nature, and can only be thought off if you can visualize the problem in the real world. TRY TO EXPLAIN YOUR SOLUTION TO A NON DATA SCIENCE BATCHMATE, AND SEE HOW IT FARES.

To solve real social problems that you are interested in, the skill and the patience of data scraping are a must. This is quite underappreciated by the data science folks who hate HTML syntaxes and are lazy in pulling up some data from the web. You want to do an analysis of Malaria outbreaks in various states of India, and how can they be solved? Try to scrape off data from some old-fashioned government websites; it will be quite a fun task and real value to the community.

To get a data scientist job, at most tech firms, you just need to be good. Not great at all. Interviews are not that tough, they mostly focus on some mid-level depth understanding of concepts and projects in your CV. And a set of some dozen standard questions, just as in software roles.

A combination of being passionate for your cause, solving real-world problems, improving people’s lives, proper understanding and intuition behind what you are applying, efficiently training and testing your algorithms, and explainability to your model makes you a great one. With the necessary requirements of data in some format, some domain knowledge, and, nonetheless, quite a computation power.

Artificial Intelligence is a big field; it has sub-fields depending on what and how much you want to pursue. Speaking of practicality, here are some tips from my side based on my and my friends’ journeys in the data science field. You can see them as tips to land a “good” data scientist job at most companies. If you slowly perform them with full earnesty, and listen to other advice from much-intelligent people than me, I am sure, you will one day gradually move forward to “greatness” as well.

  1. Don’t ignore competitive programming at all. Unless you can change your ideas to code, you can’t be termed as someone who knows computer science, let alone a data scientist. The knowledge of data structures, algorithms, and time complexities are equally relevant to a data science role as a developer’s role.
  2. There are specific roles for each part of an artificial intelligence pipeline. Some common names include data analysts, data engineers, and data scientists. If you are thinking of going into any of the first two, knowledge of databases and SQL is a must. Even with data scientists, it’s always good to have some SQL skills so that you don’t have to depend on someone else for your data needs.
  3. Computation power can be a big headache while doing data science projects. To solve this, either use free resources like Kaggle and Colab notebooks or intern somewhere that can provide cloud processing power to you. Internships are very important to learn the gap between WHAT YOU SUPPLY AND WHAT CONSUMERS WANT TO CONSUME.
  4. Do some courses on linear algebra and calculus. Also, implement whatever you are learning while doing lectures on computer vision, natural language processing, etc. Good implementation courses are often available on Udacity and other independent creators. Mathematical courses are available on Youtube (Stanford, UC Berkeley, MIT anything you like) and Coursera. SOLVING PROBLEM SETS AND HOMEWORK IS A BIG MUST.
  5. It’s equally important to read research papers as well as blogs of AI teams of Facebook, Google, Uber, etc. They give you approaches to solve real-world problems and expand your mind. If you don’t do any of the two, start doing them and STRIKE A BALANCE.
  6. Diversify your projects in terms of the areas they cover. It gives you a more comprehensive understanding of things, and the CV also fares well for most companies.

Remember, the best time to plant a tree was a few years ago. The second best time is now. And yet, like me, you know everything, but you will not persevere to be a “great” data scientist. Because it’s pretty easy to be just a “good” one.

I have bragged a lot, I guess. The one last piece of advice which I got from a highly competent senior of mine, which applies no matter whether you pursue data science or not, and I would like to pass on to you is: “Don’t waste your college life. It’s a golden opportunity to learn. Work together and sow the seeds of changing the world to a better place”.

STAY HUMBLE, HUSTLE HARD.

--

--

Rishabh Jain

Data Science Engineer at ShareChat, IIITM Gwalior, India. Curious about economics, politics, history, mythology and information; rishabhrjjain1997@gmail.com