Data Science, Data Analysis, and Business Intelligence are some of the sexiest jobs currently. They bring value to a company because they bring insights about the company based on the data gathered. A lot of people want to enter the field but don’t know where to start to get “job-ready” and be competitive in the market. Here, I will list the most important tools in Data Science, and Data Analysis so you can start your journey to becoming a data scientist or data analyst. For everything mentioned in this article, you can find tutorials and resources online that can help you master them and I will be covering them in-depth in future posts.
Statistics could be described as the science of collecting, organizing, analyzing, and summarizing data (Bluman, 2007). Statistics are at the heart of any real analysis of data and getting information from it. Therefore, I would suggest having at least one college course in Statistics before you jump into Data Analysis in general. This will help you understand probabilities, and terms like the mean, quartiles, median, etc. that are used to summarize data. In short, if you don’t have a sense of statistics, you won’t be able to understand and communicate what the data can tell about the business.
Believe it or not, a lot of data analysis can be done with Excel without having to jump into Python or R. Most of the time, a pivot table in Excel can answer the business questions needed by management to make timely decisions. This is a tool that is overlooked by many people in Data Analysis because it has some limitations but you will need to be working with very large datasets to find Excel limits. In addition, it could serve as a starting point in your career in Data Analysis while you learn how to code and it is easier to understand programs like Tableau if you have experience with Excel. Also, this might be the only tool available to you at your job. Everything depends on the size of the firm and how much data is collected by the company.
Tableau or Power BI
Tableau is like Excel’s big brother in Data Analysis and it is more focused on analysis and visualization of the data while Excel is broader in terms of features. I would suggest learning Excel first, then jump into Tableau because most of the knowledge from Excel will transfer to this new skill. Power BI was developed by Microsoft and it is the direct competition to Tableau. If your Organization is Microsoft dependent, Power BI might be a better option for Tableau.
For a company, the data is stored in some form. Some companies might use CSV files or text files. Nevertheless, most of the data are stored in relational databases. If you need to extract data from a relational database, it is necessary to learn SQL or Structured Query Language. This is a must-learn programming language because you could do data exploration directly into the database using SELECT statements or even perform queries on the database to extract portions or all the data.
Python or R
Python and R are two programming languages that are popular for Data Analysis and Data Science. If I have to pick one language, I would choose Python because it has more tools available for us. For example, Python is a general-purpose programming language and you will find even frameworks for creating web applications. Therefore, you don’t have to leave Python when you need to create something from scratch. Also, the number of open source libraries and resources is still growing. R is a more specialized programming language created for statistical computing and graphics. So, it has all the tools you need to get insights from data. But, the downside is that is not as popular as Python and you won’t find as many resources.
Python Libraries: Numpy, Pandas, Matplotlib, Seaborn
As I mentioned in the previous section, Python has a lot of resources available for free. In the last 6 years, Python is becoming the defacto standard for Data Analysis with libraries like Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn, etc. If you can master these tools, you will have all you need to tackle most problems in your career.
If you know enough statistics, you will realize that it is always a good idea to visualize the data to find anomalies like outliers or communicating to management. It is very difficult to communicate just numbers and most people won’t understand when you just tell them the numbers. You have to tell the story using graphics. Data Visualization comes to play when we use those graphics to bring the message and demonstrate trends or differences between numbers in a way that can be easily absorbed by the majority.
The Next Step
For the next step, I will suggest machine learning because it helps us to find patterns and solving problems based on the data. With machine learning, we can predict with a certain accuracy the price of a house for example depending on its features. There are different algorithms used for machine learning but this has to be the last step in the journey to becoming a data analyst or data scientist. In most cases, if you are a data analyst, your work won’t include machine learning because you will be exploring, describing, or explaining the data. Machine learning goes beyond data analysis and enters into the territory of data science. But, this is not set in stone because your title might vary depending on the company while performing the same job. For instance, you could be called a data scientist while somebody else doing the same job in another company might be called a data analyst.
Bluman, A. G. (2007). Elementary Statistics: A Step by Step Approach. Mcgraw-Hill.