Will Large Language Models replace Data Scientists?

Data Scientist David Andrés breaks this down.

Michael Spencer

and

David Andrés

Aug 29, 2023

Friends,

I find how LLMs are impacting the future or coding fascinating, but what about data science itself? I’ve rarely come across this topic. I asked Data scientist

David Andrés

to give us his take.

📶 From our sponsor: 📶

Doc Dog: AI Platform for Data Extraction from PDFs, Scanned documents. Automate Document Intensive Workflows Using AI

Still manually processing invoices, receipts, medical docs, shipping & logistic or other complex docs? Intelgic automates document workflows using AI & RPA. Source PDFs from emails/folders, extract data, and seamlessly integrate into your software.

Book a Demo

Want to sponsor this Newsletter so guest posts can be free always? Go here (Partnerships begin in September). Enjoy the content and want to support the channel and get access to 50% more content?

🧾Bio

David Andrés is a Data Scientist with 4 years of hands-on experience in the domain. Although his academic background is rooted in Aeronautical Engineering, he quickly transitioned to the world of Artificial Intelligence and Machine Learning.
David excels in Python, AWS, and other cutting-edge technologies. In the last three years, he has led transformative projects, ranging from revenue forecasting and anomaly detection to the development of sophisticated recommendation systems.
With a particular passion for time series forecasting and financial data, David consistently positions himself at the forefront of data science, always on the lookout for groundbreaking challenges that propel business growth and technological advancement.

His passion for time series forecasting and data science drives him to post regularly on Twitter, curate a newsletter, and maintain a blog. These platforms not only enhance his learning but also hold him accountable, as he now has an audience to engage and inspire.

Based in London, England his articles on his Newsletter; are about Python, Machine Learning and Time Series forecasting on his blog.

You can follow his Newsletter 🤖💊

Machine Learning Pills

Who will be most impacted by LLMs in your opinion?

Let’s finally dive into the guest post.

David Andrés

August, 2023. Machine Learning Pills.

What are Large Language Models?

Lately, there's been a lot of buzz about big computer programs that can understand and generate human language. These programs are called Large Language Models, or LLMs for short. Some popular examples are GPT-4, Llama 2, and PaLM 2. They have sparked curiosity about their potential impact on employment and society at large.

The original purpose of these LLMs was to understand the text and be able to extract some information such as sentiment, semantic similarities, summarisation… However, researchers discovered that when these models were trained with huge amounts of data, they could also learn about the information contained within it, or even learn how to program or translate to another language! This discovery has been a turning point. Companies have started training these models with domain texts like health or medical text to develop a doctor assistant; law-related text to create a law advisor; code in multiple repositories to create a programmer assistant… So now these LLMs can do a lot of things apart from only understanding and processing text. They can chat with you, help lawyers, doctors, and even assist programmers in writing code.

From a general LLM to a specific application. Source: Renaissance Rachel

LLMs able to code

Some more advanced tools, like Copilot and AlphaCode, help coders by suggesting pieces of code. While the latter have showcased their prowess in automating aspects of code development, they are yet to master the creation of comprehensive software or programs from scratch.

However, tools such as AutoGPT and LangChain are bridging these gaps. They address some of LLMs' primary limitations, including mathematical calculations, internet connectivity, and the execution of intricate, prolonged tasks. These tools adeptly break down challenges into a sequence of subtasks, paving the way for the achievement of the set objective. They are yet not perfect and they fail to achieve satisfactory results many times. One example of this is hallucination, which refers to making up facts or details that are not true. However, this revolution has just started, and in only a couple of years, big progress has been achieved.

Example of hallucination of ChatGPT. That response is not true, there were 705 survivors.

Development of LLMs

This leads to the question of what will happen in some years. The top tech companies like Google, Meta, Microsoft, NVIDIA, and OpenAI are putting enormous effort into developing more and more powerful LLMs. We will reach a moment in which these LLMs will not only become more powerful but also more optimised making them less computationally expensive, a lot more tools would be available to interact with them and extend their capabilities…

New LLMs in the last 3 years developed by multiple top tech companies. Source: Fiddler

Many think that they pose a great danger to many professionals, especially those working in some fields whose work can be easily automatised like clerks, analysts, counsellors, teachers, or even judges.

But there is one special type of professional that may be at risk, I am referring to those who have developed and made available these Language Models: Data Scientists, Machine Learning Engineers, Data Engineers and Computer Scientists in general. Let’s focus on Data Scientists.

The Future of Data Scientists

At the moment the LLMs are just tools that serve as great aids for Data Scientists. They help them in multiple ways:

They allow them to be more efficient, especially in repetitive tasks or easy but tedious ones.
Find pieces of code that easily adapt to their needs in a couple of minutes
Resolve errors in their code
Document the code
Improve variables names and structure of code
Optimise the efficiency of their code

What will happen when LLMs further develop?

So far they are just great tools that help them in their day-to-day jobs, but we will conceivably reach a moment in which they will become more than that. We said that there are already powerful tools that are able to automatise coding. Imagine that they become really proficient in that. In addition, more powerful versions or tools like AutoGPT or LangChain are released, which can easily be combined with the previous coding tools to develop models or programs from start to end. LLMs will also be capable of optimising the algorithms or even developing new ones! This may lead even to faster development of things that will accelerate the whole industry and contaminate others.

one-year improvement in accuracy from GPT-3.5 to GPT-4. Source: Eloundou et al.

Let’s see an example. The development of OpenAI's GPT has seen remarkable advancements within just a year, achieving heightened accuracy on a range of assessments. GPT-4 has delivered outstanding results across numerous tests compared to its previous version, GPT-3.5. Such progress suggests that subsequent iterations will likely witness even more substantial leaps in performance.

The question is, will these LLMs-trained models generate something new, innovative and creative? Or they will be able only to reproduce similar things that they learnt during their own training? Moving to a more visual example, we can think of the image generation models like DALL-E2, Midjourney, or Stable Diffusion. They are very good at generating images with the concepts they learnt during training or even reproducing styles (like creating images in the style of Picasso). But they are still not capable of developing something really innovative or transcendental.

Example of a DALL-E 2 AI-generated image: a person making coffee in the style of Picasso. Source: Tongesy

However, we are still in the early stages of this revolution, so we may discover in some years that yes, they will be capable of this too. If this happens that means that they will also be able to come up with some innovative solutions to improve new and more advanced LLMs when training them. However, this is something that we still don’t know if will be possible and we can only imagine it. If that ever happens it is to think that they will certainly pose a threat to the job of many professionals, including Data Scientists.

I believe that by then Data Scientists would have already transformed into another type of professional, possibly they will be in charge of judging the capabilities of the models trained by LLMs and selecting and limiting them to ensure the best model for the right application and preventing any possible issues or dangers.

How can the role be transformed?

However, there is something really important to note here. Thanks to LLMs like ChatGPT Data Scientists will be able to generate pieces of code that will allow them to train a model from beginning to end using English instead of a programming language like Python. This is good and bad at the same time. Why am I saying this? English and any other natural language can be ambiguous and it is sometimes hard to describe something exactly as you want. This is the reason why programming languages are so powerful, through them, you can give exact and precise instructions of what needs to be done. We can foresee that LLMs will get better and better, but they will always have this limitation, as long as the instructions are given to them in natural language. This means that there will always need to be someone who needs to code or review the code before actually using the model or program the LLM creates. Because we will want a model that exactly obeys our objective, not something that more or less works as intended.

So possibly, in this case, the job of a Data Scientist will evolve into something more like a code assistant or reviewer. In some cases even add some domain knowledge to create new variables or additional information (for example, calculate the surface of a house from its dimensions) to the model so it can improve the accuracy.

There is also another possibility, a different kind of programming language oriented to giving instructions to LLMs will be developed. This would be some hybrid between natural language and programming language. It will add precision while adding the naturality of English. At the moment this is already starting, and it is called “prompts”. So far they are not very specific and depend a lot on the model you are using. Prompts are structured text that can be interpreted and understood by an LLM thanks to adding some specific words that guide the model towards the direction we want. What I foresee is something more specific, which will add further control to the Data Scientist when communicating with the LLM.

Having said that, let’s think about the best and worst cases from a Data Scientist's point of view.

Want to subscribe to A.I. Supremacy as a team or group, you can do so at a steep discount and asking your Learning and development budget to cover it.

Get 30% off a group subscription

Possible scenarios

The best scenario

In this scenario, we can think of an LLM as an assistant or tool for a Data Scientist. They are optimised to review the code, document it or even test it for errors or optimisation. There is some limitation (either in the models or by law) that prevents them from training themselves to become more and more advanced and completely replace the tasks of a Data Scientist.

They work as if they were another program on the computer, integrated into programming environments. This is a situation similar to what happened with the industrial revolution. People were then scared that machines were going to take their jobs and completely replace them, but the reality was that they made their jobs more efficient, allowing them to do more in less time, and therefore allowing companies to progress faster. This led to the creation of multiple jobs and the adaptation of some of the existing ones.

So this scenario could lead to improving the efficiency of the Data Scientist, that now will be able to develop more models in less time, allowing them to optimise them and push them to their limits. Helping companies optimise their processes, marketing campaigns, budgeting allocation…

Also, this scenario could lead to the transformation of the Data Science role as mentioned before. They would become those who steer the model in the expected direction by using a special LLM-programming language, preventing them from achieving different from the intended goals. This could involve modifying the code the LLMs create, improving certain aspects to make them more tolerant or less biased towards certain communities, researching and adding additional details or information that may improve the accuracy of the models, etc.

The worst scenario

In this scenario, the LLMs would learn how to code exactly what is required and develop more powerful models that could further develop themselves even to achieve AGI (Artificial General Intelligence). This could lead to the feared “Singularity”, which is a hypothetical future point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization. But let’s not think about this extreme and unlikely outcome. We can focus on the idea of LLMs being able to create Machine Learning models by themselves, without the assistance of a Data Scientist, just by mentioning the objective or goal to achieve.

This scenario would completely transform the role of Data Scientists. They will become more prompt engineers, simply stating in natural language what their bosses told them they wanted. Or even the bosses would directly chat with an LLM and get the model they require without the intervention of any other person. This scenario would not only affect Data Scientists but many others. This would lead to a completely different society. But this is way more complex than the question we are trying to respond to here, which is how LLMs could affect or transform Data Scientists' jobs.

Three possible scenarios

Which is the most likely case?

My opinion

In my opinion, the most likely scenario is something in between. Possibly LLMs will be capable of training models, however, Data Scientists will still be needed to steer the LLMs in the right direction, provide the right sources of data, modify the generated code, create additional features or provide them with more information.

But this is my opinion, let’s see how other sources imagine the future of Data Science.

Other sources’ opinion

Various articles and experts have weighed in on this topic, with most suggesting that while LLMs will reshape the role of Data Scientists, they won't replace them entirely.

In the article “Is Data Science a Dying Field?” from Analytics India Magazine, Lokesh Choudhary believes that Data science will remain crucial as long as we rely on data, and despite automation concerns, the demand for Data Scientists is expected to grow, especially in sectors like finance, healthcare, and governance.

In the article “Predictions On The Future Of Data Science” in Forbes, Fabio Moioli suggests that Data Scientists' roles will evolve as AI and automation augment their tasks, especially with the rise of low-code and no-code platforms. He thinks that the focus of this role is shifting from merely building models to effectively utilizing them post-creation.

Natassha Selvaraj, the author of the KDNuggets article titled “Will ChatGPT Replace Data Scientists?”, believes that LLMs can't fully replace human roles in data product creation, but they reduce the number of people needed for certain tasks. Despite their advancements, LLMs cannot match human creativity and decision-making.

In the article “What’s the Future of Data Science and AI in a LLM world?”, Brian Tarran shared the statements shared by Osama Rahman, director of the Data Science Campus, at the UK Office for National Statistics. He shared his opinion in a shorter timeframe, but he thinks Data science and AI may not drastically change the world in the next 5-10 years. However, AI will enhance our ability to solve analytical problems more efficiently.

The authors of the scientific paper titled “What Should Data Science Education Do with Large Language Models?” think that LLMs like ChatGPT are currently reshaping Data Scientists' roles from hands-on tasks to overseeing AI-driven analyses, mirroring the shift from software engineers to product managers. This shift demands a reimagining of Data Science education, blending the advantages of LLMs with human expertise and creativity.

My audience’s opinion

I also asked my followers on Twitter regarding their beliefs about the impact of LLMs on the role of Data Scientists in the next 20 years. I offered them three choices:

LLMs as a Tool: While LLMs will serve as a valuable tool, Data Scientists will utilize them occasionally to enhance their efficiency, allowing them to tackle more tasks effectively.
Data Scientists Guiding LLMs: Although LLMs can handle the bulk of the work, including training models, the expertise of Data Scientists remains crucial. They will direct LLMs by suggesting relevant features, supplying the necessary data, and verifying outcomes.
The Future of Data Scientists in an LLM-Dominated World: LLMs may eventually take over many tasks traditionally handled by Data Scientists, from training models to selecting and deploying data. As a result, Data Scientists might need to pivot, potentially transitioning more towards data engineering roles.

This is what they answered:

My audience on Twitter is quite optimistic about the role of Data Scientists. Over 50% think that LLMs will remain as a tool and will just help Data Scientists and not take any of their main responsibilities. Around of third of them thought that Data Scientists would need to give way to some of their responsibilities while keeping the ones more oriented to steering the LLM to achieve the expected outcome. Finally, only 10% thought the worst case could happen, in which the role of Data Scientists would be completely replaced by LLMs.

Can the development of LLMs be stopped?

I believe the LLMs and AI in general can be slowed down by governments and companies by imposing regulations and limitations. However, I don’t think it can be stopped, their development is inevitable.

Personally, I think the advancement of generative AI systems like LLMs, is both implausible and inadvisable. While there are worries about AI's societal dangers, job losses, and the difficulty in differentiating between AI-produced and human-made content, the potential upsides, like progress in science and enhanced services, are also significant.

Conclusion

Large Language Models (LLMs) have revolutionized the tech landscape with their advanced capabilities, particularly in the realm of Data Science. While there's a spectrum of opinions on their impact, the prevailing sentiment suggests that LLMs will reshape, but not replace, the role of Data Scientists. As these models continue to evolve, it's crucial to harness their potential responsibly, ensuring they complement human expertise rather than overshadow it.

Subscribe to Machine Learning Pills

More from this Author - Best pieces

What to do with missing data?: Newsletter issue in which a theoretical introduction is shared followed by a practical exercise.

Machine Learning Pills

DIY #2 - What to do with missing data?

Welcome to the second issue of DIY (Do It Yourself). In this section, every week, a key concept in Data Science will be introduced to you. After that, you will be able to practice what you learned! We continue this section with missing data. I hope you enjoy it…

2 years ago · 4 likes · David Andrés

Forecast if the price of Bitcoin will increase or decrease tomorrow with ChatGPT: https://medium.com/datadriveninvestor/forecast-if-the-price-of-bitcoin-will-increase-or-decrease-tomorrow-with-chatgpt-79ead80c2cd2
Advanced Time Series Forecasting Methods: a theoretical introduction to the diverse techniques to forecast time series data. https://mlpills.dev/time-series/advanced-time-series-forecasting-methods/
Data Job Search Tips!: DSBoost issue focused on giving data science job search tips.

DSBoost

Data Job Search Tips! - DSBoost #29

💬 “Interview” of the week [David] This time our “interview” will be a little bit different. The other day, an intern in my company asked me about Data Science. Here is our conversation: What are some common mistakes people make when learning, and how can they be avoided…

2 years ago · 4 likes · David Andrés and Levi

A basic introduction to LangChain: MLPills article in which LangChain is introduced in combination with the OpenAI’s API of GPT. https://mlpills.dev/nlp/a-basic-introduction-to-langchain/

Author Socials

X / Twitter: https://twitter.com/daansan_ml

LinkedIn: https://www.linkedin.com/in/davidandressanchez/

MLPills newsletter:

DSBoost newsletter:

Medium: https://medium.com/@andressanchezdavid

Notes: https://substack.com/@mlpills

Thanks for reading!

A guest post by

David Andrés

💼 Data Scientist • 🐍 Python enthusiast

AI Supremacy

Discussion about this post