Analysis of Texas Public School Spending and STAAR Performance

Project Overview

A1. Research Question

The research questions this project sought to answer is whether we can predict if a Texas public school district would perform well or do poorly on the STAAR test based on how they allocate or spend their budget with consideration to how racially diverse a district is. This project also sought to analyze which financial expenditure items are important predictors of whether a district will perform well or do poorly on the STAAR test. The purpose of this analysis is to provide counsel to school administrators and policymakers so that they can make data-driven decisions instead of relying on anecdotal evidence alone.

A2. Project Scope

This project’s scope was to use a Jupyter notebook to build a model that predicts whether a Texas public school district would perform well or do poorly on the STAAR test. This model took several inputs of financial expenditure data and then outputted a prediction related to that district’s performance on the STAAR test. This project examined academic year 2018-2019.

A3. Solution Overview – Tools

The Jupyter notebook that contains the model was coded in the Python programming language and relied on several key inputs from the end-user. The first input is a csv file called Enrollment Report_Statewide_Districts_Grade_Ethnicity_2018-2019 (Texas Education Agency, 2019), which contains demographic information on white and non-white population of the district. The second key input is tidy_campstaar1_2012to2019 and tidy_campstaar2_2012to2019 (Texas Education Agency, 2022), which contains information on how many students in the district performed well (meets or masters a subject) or did poorly (approaches comprehension of a subject). The third key input is the 2007-2021-summaried-peims-financial-data (Texas Education Agency, 2022), which contains information on how a district spent their budget. The notebook then automatically performed cleansing and transformation of the data to make it suitable for analysis and modeling. The notebook then constructed a model that predicted how well a district would perform or do poorly on the STAAR test.

A4. Solution Overview – Methodologies

There were four different types of methodologies used in this project: Project, Data Collection, Analytical, and Statistical. Each methodology played an important role in the planning, execution, and verification of the project.

Project Plan

            In this project, the author executed the plans without change. All goals, objectives, and deliverables listed below were completed as describe in Task 2 apart from the dates or time frame for each task. The timeline projected in Task 2 were very conservative and the actual project completed earlier than planned.

B1. Project Execution

The goal for this project was to create a model that would predict how a district would perform on a STAAR test. The objective of this goals are:

  • Determine which districts are performing or underperforming in relation to how much funding they receive, how racially diverse a district is, and how well the students in the district perform on the STAAR test overall.
  • Determine which type of expenditures contribute to the weighted performance of the district.
  • Predict how a district will perform based on how they allocate their budget.

B2. Project Planning Methodology

The project planning methodology used was PAPEM-DM (De Dios, 2020). The seven steps involved in this method include: Planning, acquisition, preparation, exploration, modeling, delivery, and maintenance.

  1. Project Planning – answered the why of the project, addressed the how, delivered a list of research questions to answer and criteria of success.
  2. Acquisition of Data – acquired the data from the TEA and Kaggle websites.
  3. Preparation of Data – prepared the data which includes cleaning the data and transforming it into a format suitable for analysis and modeling.
  4. Exploratory Data Analysis – explored the data using descriptive statistics, visualize the relationship between variables using matplotlib and seaborn. The deliverables of this phase would be a few graphs ready for presentation to stakeholders.
  5. Modeling – created, fitted, and used the model for training and inference. The deliverables were a pickle file for the trained model.
  6. Delivery of Results – delivered the results by completing Task 3 of this capstone project.
  7. Maintenance – while important, this project did not include this phase. However, the Jupyter notebook can be used as a basis for deployment into production (for inferencing) using any of the cloud technology platforms like AWS, Azure, or GCP.

B3. Project Timeline and Milestones

PhaseDescriptionDeliverableStartEndDuration
PlanningProject planningProject plan8/18/37 days
AcquisitionGetting the dataRaw datasets8/48/41 day
PreparationCleansing and transformationClean datasets8/58/2113 days
ExplorationEDAVisualizations8/58/2113 days
ModelingTrain and inferPickle file8/58/2113 days
DeliveryTask 3 ReportPDF Document8/228/287 days
MaintenanceDeployment to productionNot ApplicableN/AN/AN/A

Methodology

C. Data Selection and Collection Process

The author downloaded two of the datasets on the TEA website and while the third data could have also downloaded from the same place, the author chose to use the Kaggle version of the data because it has already been collected in a format that would not necessitate me using web scraping to mine the information from the website.

Data selection and collection didn’t vary much from the proposed plan. The only one addition the author made was the inclusion of ethnicity data that the author downloaded from the same TEA website. This data was necessary to examine and analyze the effect of racial diversity within the districts.

One obstacle the author encountered was in the same dataset that the author added during the actual execution of the project. The author encountered several errors when reading the csv file and realized that formatting is off on the actual csv itself. The author adjusted the read_csv parameter “skiprows” and was able to bypass the header information that was needed in the analysis.

There really were no data governance issues because the dataset is already in the public domain. The one concern would have been anonymity of the students in schools or school districts that are too small that the anonymity of the students would have been impossible to mask. This issue was already taken care by the Texas Education Agency before they even published the dataset.

For example, for those districts that have small subpopulations, the values for the numerator values have been coded as -999. The author simply had to replace those values with a 0 for the dataframe to be processed cleanly. Another example of the datasets’ cleanliness is the fact that there were no missing values in any of the columns.

Therefore, cleanliness and sanitation were the obvious advantages of the datasets used. However, their limitation was the fact that when merged together, the dataset only encompasses the 2018-2019 school year. This is by design, Hence, not really that big of a limitation to warrant much concern.

D. Data Extraction and Preparation Process

The author utilized the power of pandas to read the csv files into a dataframe. The method read_csv() was appropriate because the data is in a tabular format delimited by a comma.

For data preparation or data wrangling, the author used many techniques to handle missing and duplicate data, as well as remove extraneous columns and rows. Some of these techniques include replacing null placeholder values with true numeric NaN values, drop_duplicates() and drop() columns, as well as, string methods to clean up the or standardized the columns names. The author also used locs, ilocs, and np.where to filter columns and rows. Finally, for merging the datasets, the author a combination of concat() and merge() methods to execute inner and outer joins of the data. Utilizing pandas methods made sense because of their efficiency, versatility, and simplicity.

E. Data Analysis Process

For exploratory data analysis, the author used several visualizations to examine the relationship between variables. For example, the author used a heatmap to visualize the correlation coefficients via a correlation matrix using the Pearson method. Also, a pairplot was used to examine whether any of the relationships between variables had a positive or a negative trend. Both were essential in determining and validating whether any of the variables suffered from multicollinearity. Knowing this is important because it would determine what type of machine learning algorithm can or cannot be used if the author choose not to handle the related variables individually. Finally, several violin plots were used to examine the distribution and density of each variable belonging to both passing and failing classes.

To complement the visualizations, the author implemented a few statistical tests to analyze the dataset. These tests were the Levene’s test, the Shapiro test, and the Mann-Whitney U test. The author used Levene’s and Shapiro to test for variance and normality so that the author can validate whether to use parametric or non-parametric methods of statistical testing. As a result, the author used Mann-Whitney U to test for significance.

However, statistical testing does have its own limitations. For one, qualitative factors of the study are not considered because “statistical method cannot study phenomena that cannot be expressed, captured, or measured in quantitative terms.“ For example, we considered the efficiency of districts while regarding for racial diversity, but on must be careful in assigning weights to the imbalance because it is simply not a simple black and white matter. One cannot simply assign a numerical value on racial disparity and there is no clear-cut formula that defines racial inequity neither.

In the author’s opinion, both visualizations and statistical testing techniques were both needed at the same time. Visualization, although limited to eye-balling techniques can reveal hidden patterns in the data while statistical testing can validate assumptions made by eyeballing said data.

Results

F. Project Success

F1. Statistical Significance

Calculating the statistical significance of this project was a reasonably straight-forward task. First, I had defined a formula to define the measure of efficiency with respect to racial diversity. I called the measure E for efficiency, and E = (B/(P/S))/X where B is the total operating program budget of the district, P is the total number of passing (meets or masters a subject) students, S is the total number of students taking the test, and X is for the ratio of non-white student to white student enrollment. Based on this formula, I then calculated the first quartile of created two subsets of data based on that quartile. Those equal or below the quartile contained district data that had performed well on the STAAR test and those above the quartile contained districts that did poorly on the STAAR test. I then tested for the statistical difference between the two group samples. Since the p-value is below 0.05, the difference between the two groups were found to be statistically significant. With this, the author can reject the null hypothesis and confirm the original hypothesis that the “Payroll” expenditure (and other variables for that matter) of districts that performed well on the STAAR test is statistically different than those districts that performed poorly.

F2. Practical Significance

The practical significance of the difference in certain expenditures will result in a significant amount of money saved by each district. For example, if a school district decides to spend more on operations versus paying more for capital investments, the district would have realized more of a return on this investment in terms of higher student performance scores on the STAAR test.

F3. Overall Success

The author believes this project a success. All three criteria laid out in task two were met. First, the formula for calculating the efficiency of a district is appropriate and applicable. Second, the model has an accuracy and AUC scores of more than 70%. And third, the project lists the most important features for high-performing districts.

G. Key Takeaways

This project set out to create Jupyter notebook that contains a model that would predict whether a Texas public school district would perform well or do poorly on the STAAR test. This model needed to be based on the weighted definition of efficiency and also need to have an accuracy or AUC score of more than 70% to be considered a success. The following chart summarizes whether the Jupyter notebook accomplished its first two metric.

Criterion/MetricRequired DataSuccess
Is the formula for calculating the efficiency appropriate and applicable?E = (B/(P/S))/XYES
Does the model have an accuracy and AUC scores of more than 70%?Accuracy 90.93% / AUC 96%YES
Does the project list the most important features for high-performing districts?Bar graphYES

The table above is a straightforward way to visualize and summarize the accomplishments of each objective.

The graph above shows the Receiver-Operator curve (ROC) which visualizes just how skilled the model is (blue line) compared to an unskilled model (by chance, red line).

The graph below shows the most important features of the model and is a graphical way to convey how each different variables affects the model based on the F score.

            Based on the findings above, the author suggests that a study be conducted on how particularly racial diversity affects the efficiency of Texas public school districts as it has been shown that it does influence the model. In addition, further study needs to be conducted to determine whether having a bilingual program is an indication of positive or negative correlation on performance.

Sources

Baron, E. J. (2021). School Spending and Student Outcomes: Evidence from Revenue Limit Elections in Wisconsin. American Economic Journal: Economic Policy, 14(1), 1-39. Retrieved from https://www.aeaweb.org/articles?id=10.1257/pol.20200226

Carhart, A. E. (2016). SCHOOL FINANCE DECISIONS AND ACADEMIC PERFORMANCE: AN ANALYSIS OF THE IMPACTS OF SCHOOL EXPENDITURES ON STUDENT PERFORMANCE. Master Thesis. Retrieved from chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.csus.edu/college/social-sciences-interdisciplinary-studies/public-policy-administration/_internal/_documents/thesis-bank/thesis-bank-2016-carhart.pdf

Texas Education Agency. (2022, July 31). PEIMS Financial Data Downloads. Retrieved from Texas Education Agency: https://tea.texas.gov/finance-and-grants/state-funding/state-funding-reports-and-data/peims-financial-data-downloads

Texas Education Agency. (2022, July 31). Texas Education Agency Data 2012-2019. Retrieved from Kaggle: https://www.kaggle.com/datasets/9e3ce42f60ded3ba2a6dd890993493f2c4b284c5cfa035d711bd98fa3359924c?resource=download

Thompson, J., Young, J. K., & and Shelton, K. (2021). “Evaluating Educational Efficiency in Texas Public Schools Utilizing Data Envelopment Analysis. School Leadership Review, 16(1), Article 9. Retrieved from https://scholarworks.sfasu.edu/slr/vol16/iss1/9

Farewell

I have left TaskUs.

I would like to express gratitude for all the experience and support that TaskUs has given me. Being part of TaskUs’ success and seeing it grow from a startup to a publicly traded company is something I feel very proud of.

It’s been a pleasure working with a great team in the Business Insights and Data Science department. I am blessed to have worked with talented people, in particular, Scott Gamester, Rachel Perez, and Darcy Delamore. I have built a lasting friendship with colleagues that I will continue to cherish. I am grateful for my teammates, William Li, Dahlia Curtin, Sabrina Castillo, Antonio Morena, Tim Reyna, Sanjana Putchala, Priyanka Manchanda, and countless others.

I am sad to leave. At the same time, I am excited that everything I have learned during my time with TaskUs will help shape the rest of my career. Being with TaskUs afforded me the opportunity to apply data science to bring about actionable insights.

Lastly, I would like to offer my sincerest gratitude to Shauna Zamarippa and Tom Flynn for taking a chance on me.

Democratize Data Science

Every once in a while, I would come across an article that decries online data science courses and boot camps as pathways towards getting a data science job. Most of the articles aim not to discourage but serve as a reminder to take a hard look in the mirror first and realize what we’re up against. However, a few detractors have proclaimed that the proliferation of these online courses and boot camps have caused the degradation of the profession.

To the latter, I vehemently disagree.

Bridging the Skill Gap

Data science have captured popular imagination ever since Harvard Business Review dubbed data scientist as the sexiest job of the 21st century. More than seven years later, data science remains one of the most highly sought-after job markets today. In fact, due to the dynamics of supply and demand, “the United States alone is projected to face a shortfall of some 250,000 data scientists by 2024¹.”

As a result, capitalism and entrepreneurship answered the call and companies like Codeup have vowed to “help bridge the gap between companies and people wanting to enter the field.”²

In addition, AutoML libraries like PyCaret are “democratizing machine learning and the use of advanced analytics by providing free, open-source, and low-code machine learning solution for business analysts, domain experts, citizen data scientists, and experienced data scientists”³.

The availability of online courses, boot camps, and AutoML libraries has led a lot of data scientists to raise their brows. They fear that boot camp alumni and self-taught candidates would somehow lower the overall caliber of data scientists and disgrace the field. Furthermore, they are afraid that the availability of tools like AutoML would allow anyone to be a data scientist.

I mean, God forbid if anyone thinks that they too can be data scientists! Right?

Wrong.

The Street Smart Data Scientist

Alumni of boot camps and self-taught learners, like myself, have one thing going for them: our rookie smarts. To quote Liz Wiseman, author of the book Rookie Smarts:

In a rapidly changing world, experience can be a curse. Being new, naïve, and even clueless can be an asset. — Liz Wiseman

Rookies are unencumbered. We are alert and constantly seeking like hunter-gatherers, cautious but quick like firewalkers, and hungry and relentless like frontiersmen⁴. In other words, we’re street smart.

Many are so bogged down by “you’ve got to learn this” and “you’ve got learn that” that they forget to stress the fact that data science is so vast that you can’t possibly know everything about anything. And that’s okay.

We learn fast and adapt quickly.

At the end of the day, it’s all about the value that we bring to our organizations. They are, after all, the ones paying our bills. We don’t get paid to memorize formulas or by knowing how to code an algorithm from scratch.

We get paid to solve problems.

And this is where the street smart data scientist excels. We don’t suffer from analysis paralysis or be bothered with theories, at least not while on the clock. Our center of focus is based on pragmatic solutions to problems, not on academic debate.

This is not to say we’re not interested in the latest research. In fact, it’s quite the contrary. We are voracious consumers of the latest development in machine learning and AI. We drool over the latest development in natural language processing. And we’re always on the lookout for the latest tool that will make our jobs easier and less boring.

And AutoML

So what if we have to use AutoML? If it gets us to an automatic pipeline where analysts can get the results of machine learning without manual intervention by a data scientist, the better. We’re not threatened by automation, we’re exhilarated by it!

Do not let perfection be the enemy of progress. — Winston Churchill

By building an automatic pipeline, there’s bound to be some tradeoffs. But building it this way will free our brain cells and gives us more time to focus on solving other higher-level problems and produce more impactful solutions.

We’re not concerned about job security, because we know that it doesn’t exist. What we do know is that the more value we bring to a business, the better we will be in the long run.

Maybe They’re Right?

After all this, I will concede a bit. For the sake of argument, maybe they’re right. Maybe online courses, boot camps, and low-code machine learning libraries really do produce low-caliber data scientists.

Big maybe.

But still, I argue, this doesn’t mean we don’t have value. Data science skills lie on a spectrum and so does companies’ maturity when it comes to data. Why hire a six-figure employee when your organization barely has a recognizable machine learning infrastructure?

Again, maybe.

The Unicorn

Maybe, to be labeled as a data scientist, one must be a unicorn first. A unicorn data scientist is a data scientist who excels at all facets of data science.

Image for post
Hckum / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Data science has long been described as the intersection between computer science, applied statistics, and business or domain knowledge. To this, they ask, how can one person possibly accumulate all those knowledge into just a few months? To this, we also ask the same question, how can a college grad?

Unicorns do exist I believe, but they also have had to start from somewhere.

So why can’t we?

Conclusion

A whole slew of online courses and tools promise to democratize data science, and this is a good thing.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article From Slacker to Data Scientist: My journey into data science without a degree.

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Harvard Business Review. (June 3, 2020). Democratizing Data Science in Your Organization. https://hbr.org/sponsored/2019/04/democratizing-data-science-in-your-organization

[2] San Antonio Express-News. (June 3, 2020). Software development bootcamp Codeup launching new data science program. https://www.mysanantonio.com/business/technology/article/Software-development-bootcamp-Codeup-launching-13271597.php

[3] Towards Data Science. (June 4, 2020). Machine Learning in Power BI Using PyCaret. https://towardsdatascience.com/machine-learning-in-power-bi-using-pycaret-34307f09394a

[4] The Wiseman Group. (June 4, 2020). Rookie Smarts
Why Learning Beats Knowing in the New Game of Work. 
https://thewisemangroup.com/books/rookie-smarts/

This article was first published in Towards Data Science’ Medium publication.

The Slacker’s Guide to Rebranding Yourself as a Data Scientist

Opinionated advice for the rest of us. Love of math, optional.


Since my article about my journey to data science, I’ve had a lot of people ask me for advice regarding their own journey towards becoming a data scientist. A common theme started to emerge: aspiring data scientists are confused about how to start, and some are drowning because of the overwhelming amount of information available in the wild. So, what’s another, right?

Well, let’s see.

I urge aspiring data scientists to slow it down a bit and take a step back. Before we get to learning, let’s take care of some business first: the fine art of reinventing yourself. Reinventing yourself takes time, so we better get started early on in the game.

In this post, I will share a very opinionated approach to do-it-yourself rebranding as a data scientist. I will assume three things about you:

  • You’re broke, but you’ve got grit.
  • You’re willing to sacrifice and learn.
  • You’ve made a conscious decision to become a data scientist.

Let’s get started!


First Things First

I’m a strong believer in Yoda’s wisdom: “Do or do not, there is no try.” For me, either you do something or you don’t. Failure for me was not an option, and I took comfort in knowing that I won’t really fail unless I quit entirely. So first bit of advice: don’t quit. Ever.

Do or do not, there is no try.

Yoda

Begin with the End in Mind

Let’s get our online affairs in order and start thinking about SEO. SEO stands for search engine optimization. The simplest way to think about is the very fine art of putting as much “stuff” as you can on the internet with your real professional name out there so that when somebody searches for you, all they will find are the stuff that you want them to find.

In our case, we want the words “data science” or “data scientist” to appear whenever your name appears in the search results.

So let’s start littering the interweb!

  1. Create a professional Gmail account if you don’t already have one. Don’t make your username be sexxydatascientist007@gmail.com. Play it safe, the more boring, the better. Start with first.last@gmail.com, or if your name is a common one, append it with “data” like first.name.data@gmail.com. Avoid numbers at all costs. If you have one already, but it doesn’t follow the aforementioned guidelines, create another one!
  2. Create a LinkedIn account and use your professional email address. Put “Data Scientist in Training” in the headline. “Data Science Enthusiast” is too weak. We’ve made a conscious decision and committed to the mission, remember? While we’re at it, let’s put the app on our phone too.
  3. If you don’t have a Facebook account yet, create one just so you could claim your name. If you already have one, put that thing on private pronto! Go the extra mile and also delete the app on your phone so you won’t get distracted. Do the same for other social networks like Twitter, Instagram, and Pinterest. Set them to private for now, we’ll worry about cleaning them up later.
  4. Create a Twitter account if you don’t already have one. We can take a little bit of leeway in the username. Make it short and memorable but still professional, so you don’t offend anybody’s sensibilities. If you already have one, decide if you want to keep it or start all over. The main thing to ask yourself: is there any content in your history that can be construed as unprofessional or mildly controversial? Err on the side of caution.
  5. Start following the top voices in data science on LinkedIn and Twitter. Here are a few suggestions: Cassie Kozyrkov, Angela Baltes, Sarah N., Kate Strachnyi, Kristen Kehrer, Favio Vazquez, and of course, my all-time favorite: Eric Weber.
  6. Create a Hootsuite account and connect your LinkedIn and Twitter accounts. Start scheduling data science-related posts. You can share interesting articles from other people about data science or post about your own data science adventures! If you do share other people’s posts, please make sure you give the appropriate credit. Simply adding a URL is lazy and no bueno. Thanks to Eric Weber for this pro-tip!
  7. Take a professional picture and put it as your profile picture in all of your social media accounts. Aim for a neutral background, if possible. Make sure it’s only you in the picture unless you’re Eric (he’s earned his chops so don’t question him! LOL.)
  8. Create a Github account if you don’t have one already. You’re going to need this as you start doing data science projects.
  9. BONUS: if you can spare a few dollars, go to wordpress.org and get yourself a domain that has your professional name on it. I was fortunate enough to have an uncommon name, so I have ednalyn.com, but if your name is common, be creative and make one up that’s recognizably yours. Maybe something like janesmithdoesdatascience.com. Then you can start planning on having your resumé online or maybe even have a blog post or two about data science. As for me, I started with writing my experience when I first started to learn data science.
  10. Clean-up: when time permits, start auditing your social media posts for offensive, scandalous, or unflattering content. If you’re looking to save time, try a service like brandyourself.com. Warning! It can get expensive, so watch where you click.

Do Your Chores

No kidding! When you’re doing household chores, taking a walk, or maybe even while driving, listen to podcasts that talk about data science topics like Linear Digression and TwiML. Don’t get too bogged down about committing what they say to memory. Just go along with the flow, and sooner or later, the terminology and concepts that they discuss will start to sound familiar. Just remember not to get too caught up with the discussions that you start burning whatever you’re cooking or miss your exit like I have many times in the past.

Meat and Potatoes

Now that we’ve taken care of the preliminaries of living and breathing data science, it’s time to take care of the meat and potatoes: actually learning about data science.

There’s no shortage of opinions about how to learn data science. There are so many of them that it can overwhelm you, especially when they start talking about learning the foundational math and statistics first.

Blah!

Tell me and I forget,
teach me and I remember,
involve me and I learn.

Old Chinese Adage

While important, I don’t see the point of studying theory first when I may soon fall asleep or worst, get too intimidated by the onslaught of mathematical formulas that I get so exasperated, and ended up quitting!

What I humbly propose, rather, is to employ the idea of “minimum viable knowledge” or MVK as described by Ken Jee. in his article: How I Would Learn Data Science (If I Had to Start Over). Ken Jee describes minimum viable knowledge as learning “just enough to be able to learn through doing.”² I suggest checking it out:

My approach to MVK is pretty straight-forward: learn just enough SQL to be able to get the data from a database, learn enough Python so that you could have program control and be able to use the pandas library, and then do end-to-end projects, from simple ones to increasingly more challenging ones. Along the way, you’d learn about data wrangling, exploratory data analysis, and modeling. Other techniques like cross-validation and grid search would surely be a part of your journey as well. The trick is never to get too comfortable and always push yourself slowly.

To the list-oriented, here is my process:

  1. Learn enough SQL and Python to be able to do end-to-end projects with increasing complexity.
  2. For each project, go through the steps of the data science pipeline: planning, acquisition, preparation, exploration, modeling, delivery (story-telling/presentation). Be sure to document your efforts on your Github account.
  3. Rinse and repeat (iterate).

For a more in-depth discussion of the data science pipeline, I recommend the following article: PAPEM-DM: & Steps Towards a Data Science Win.

For each iteration, I suggest doing an end-to-end project that practices each of these following data science methodologies:

  • regression
  • classification
  • clustering
  • time-series analysis
  • anomaly detection
  • natural language processing
  • distributed ML
  • deep learning

And for each methodology, practice its different algorithms, models, or techniques. For example, for natural language processing, you might want to practice these following techniques:

  • n-gram ranking
  • named-entity recognition
  • sentiment analysis
  • topic modeling
  • text classification

Just Push It

As you do end-to-end projects, it’s a good practice to push your work publicly on Github. Not only will it track your progress, but it also backups your work in case your local machine breaks down. Not to mention, it’s a great way to showcase your progress. Note that I said progress, not perfection. Generally, people understand if our Github repositories are a little bit messy. In fact, most expect it. At a minimum, just make sure that you have a great README.md file for each repo.

What to put on a Github Repo README.md:

  • Project name
  • What goal or purpose of the project
  • Background on the project
  • How to use the project (if somebody wants to try it for themselves)
  • Mention your keywords: “data science,” “data scientist,” “machine learning,” et cetera.

Don’t ignore this note: don’t make the big mistake or hard-coding your credentials or any passwords in your public code. Put them in an .env file and .gitignore them. For reference, check out this documentation from Github.

For a great in-depth tutorial on how to use Git and Github, check out
Anne Bonner’s guide: Getting Started with Git and Github: the complete beginner’s guide.

For the Love of Math

And finally, as you get better with employing different techniques and you begin to do hyper-parameter tuning, I believe at this point that you’re ready to face the necessary evil that is math. And more than likely, the more you understand and develop intuition, the less you’ll hate it. And maybe, just maybe, you’ll even grow to love it.

I have one general recommendation when it comes to learning the math behind data science: take it slow. Be gentle on yourself and don’t set deadlines. Again, there’s no sense in being ambitious and tackling something monumental if it ends up driving you insane. There’s just no fun in it.

There are generally two approaches to learning math.

One is to take the structured approach, which starts on learning the basics first and then incrementally take on the more challenging parts. For this I recommend KhanAcademy. Personalize your learning towards calculus, linear algebra, and statistics. Take small steps and celebrate small wins.

The other approach is slightly geared for more hands-on involvement and takes a little bit of reverse engineering. I call it learning backward. You start with finding out what math concept is involved in a project and breaking down that concept into more basic ideas and go from there. This approach is better suited for those who prefer to learn by doing.

A good example of learning by doing is illustrated by a post on Analytics Vidhya.

Supplemented by this article.

Take a Break

Well, learning math sure is hard! It’s so powerful and intense that you’d better take a break often or risk overheating your brain. On the other hand, taking a break does not necessarily mean taking a day off. After all, there is no rest for the weary!
Every once in a while, I strongly recommend supplementing your technical studies with a little bit of understanding of the business side of things. For this, I suggest the classic book: Thinking with Data by Max Shron. You can also find a lot of articles here on Medium.

For example, check out Eric Kleppen’s article.

Talk to People

Taking a break can be lonely sometimes, and being alone with only your thoughts can be exhausting. So you may decide to finally talk with your family, the problem is, you’re so motivated and gung-ho about data science that it’s all you can talk about. Sooner or later, you’re going to annoy your loved ones.

It happened to me.

This is why I decided to talk to other people with similar interests. I went on Meetups and started networking with people who are either already practicing data science or people like you who are aspiring to be a data scientist as well. In this post-COVID (hopefully) age that we’re in, having group video calls are more prevalent. This is actually more beneficial because now, geography won’t be an issue anymore.

A good resource to start is LinkedIn. You can use the social network to find others with similar interests or even find local data scientists who can still spare an hour or two every month to mentor motivated learners. Start with companies in your local municipality. Find out if they have a data scientist that works there, and if you do find one, kindly send them a personalized message with a request to connect. Give them the option to refuse gracefully and just ask them to repoint or recommend you to another person who does have the time to mentor.

The worst that can happen is they said no. No hard feelings, eh?

Conclusion

Thanks for reading! This concludes my very opinionated advice on rebranding yourself as a data scientist. I hope you got something out of it. I welcome any feedback. If you have something you’d like to add, please post it in the comments or responses.

Let’s continue this discussion!


If you’d like to connect with me, you can reach me on Twitter or LinkedIn. I love to connect, and I do my best to respond to inquiries as they come.

Stay tuned, and see you in the next post!

If you want to learn more about my journey from slacker to data scientist, check out this article.


[1] Quote Investigator. (June 10, 2020). Tell Me and I Forget; Teach Me and I May Remember; Involve Me and I Learn. https://quoteinvestigator.com/2019/02/27/tell/

[2] Towards Data Science. (June 11, 2020). How I Would Learn Data Science (If I Had to Start Over). https://towardsdatascience.com/how-i-would-learn-data-science-if-i-had-to-start-over-f3bf0d27ca87

This article was first published in Towards Data Science’ Medium publication.

From Slacker to Data Scientist

My journey into data science without a degree.


Butterflies in my belly; my stomach is tied up in knots. I know I’m taking a risk by sharing my story, but I wanted to reach out to others aspiring to be a data scientist. I am writing this with hopes that my story will encourage and motivate you. At the very least, hopefully, your journey won’t be as long as mine.

So, full speed ahead.


I don’t have a PhD. Heck, I don’t even have any degree to speak of. Still, I am very fortunate enough to work as a data scientist in a ridiculously good company.

How I did it? Hint: I had a lot of help.

Never Let Schooling Interfere With Your Education — Grant Allen

Formative Years

It was 1995 and I had just gotten my very first computer. It was a 1982 Apple IIe. It didn’t come with any software but it came with a manual. That’s how I learned my very first computer language: Apple BASIC.

My love for programming was born.

In Algebra class, I remember learning about the quadratic equation. I had a cheap graphic calculator then, a Casio, that’s about half the price of a TI-82. It came with a manual too so I decided to write a program that will solve the quadratic equation for me without much hassle.

My love for solving problems was born.

In my senior year, my parents didn’t know anything about financial aid but I was determined to go to college so I decided to join the Navy so that I could use MGIB pay for my college. After all, four years of service didn’t seem that long.

My love for adventure was born.

Later in my career in the Navy, I was promoted as the ship’s financial manager. I was in charge of managing multiple budgets. The experience taught me bookkeeping.

My love for numbers was born.

After the Navy, I ended volunteering for a non-profit. They eventually recruited me to start a domestic violence crisis program from scratch. I had no social work experience but I agreed anyway.

My love for saying “Why not?” was born.

Rock Bottom

After a few successful years, my boss retired and the new boss fired me. I was devastated. I fell into a deep state of clinical depression and I felt worthless.

I recall crying very loudly in the kitchen table. It has been more than a year since my non-profit job and I’m nowhere near close as having a prospect for the next one. I was in a very dark space.

Thankfully, the crying fit was a cathartic experience. It gave me a jolt to do some introspection, stop whining, and come up with a plan.

“Choose a Job You Love, and You Will Never Have To Work a Day in Your Life. “ — Anonymous

Falling in Love, All Over Again

To pay the bills, I’ve been working as a freelance web designer/developer but I wasn’t happy. Frankly, the business of doing web design bored me. It was frustrating working with clients who think and act like they’re the expert on design.

So I started thinking, “what’s next?”.

Searching the web, I’ve stumbled upon the latest news in artificial intelligence. It led me to machine learning which in turn led me to the subject of data science.

I was infatuated.

I signed up for Andrew Ng’s machine learning course on Coursera. I listened to TwitML, Linear Digression, and a few other podcasts. I revisited Python and got reacquainted with git on Github.

I was in love.

It was at this time that I made the conscious decision to be a data scientist.

Leap of Faith

Learning something new was fun for me. But still, I had that voice in my head telling me that no matter how much I study and learn, I will never get a job because I don’t have a degree.

So, I took a hard look at the mirror and acknowledge that I need help. The question now is where to start looking.

Then one day out of the blue, my girlfriend asked me what data science is. I jumped off my feet and starting explaining right away. Once I stopped explaining to catch a breath, I managed to ask her why she asked. And that’s when she told me that she’d seen a sign on the billboard. We went for a drive and saw the sign for myself. It was a curious billboard with two big words “data science” and a smaller one that says “Codeup.” I went to their website and researched their employment outcome.

I was sold.

Preparation

Before the start of the class, we were given a list of materials to go over.

Given that I had only about two months to prepare, I was not expected to finish the courses. I was basically told to just skim over the content. Well, I did them anyway. I spent day and night going over the courses and materials. Did the tests, got the certificates!

Bootcamp

Boot camp was a blur. We had a saying in the Navy about the boot camp experience: “the days drag on but the weeks fly by.” This was definitely true for the Codeup boot camp as well.

Codeup is described as a “fully-immersive, project-based 18-week Data Science career accelerator that provides students with 600+hours of expert instruction in applied data science. Students develop expertise across the full data science pipeline (planning, acquisition, preparation, exploration, modeling, delivery), and become comfortable working with real, messy data to deliver actionable insights to diverse stakeholders.”¹

We were coding in Python, querying the SQL database, and making dashboards in Tableau. We did projects after projects. We learned about different methodologies like regression, classification, clustering, time-series, anomaly detection, natural language processing, and distributed machine learning.

More importantly, the experience taught us the following:

  1. Real data is messy; deal with it.
  2. If you can’t communicate with your stakeholders, you’re useless.
  3. Document your code.
  4. Read the documentation.
  5. Always be learning.

Job Hunting

Our job hunting process started from day one of boot camp. We updated our LinkedIn profile and made sure that we’re pushing to Github almost every day. I even spruced up my personal website to include the projects we’ve done during class. And of course, we made sure that our resumé is in good shape.

Codeup helped me with all of these.

In addition, Codeup also helped prepare us for both technical and behavioral interviews. We practiced answering questions following the S.T.A.R. format (Situation, Task, Action, Result). We optimized our answers to highlight our strengths as high-potential candidates.

Post-Graduation

My education continued even after graduation. In between filling out applications, I would code every day and try out different Python libraries. I regularly read the news for the latest development in machine learning. While doing chores, I listen to a podcast, a TedTalk, or a LinkedIn learning video. When bored, I listened to or read books.

There’s a lot of good technical books out there to read. But for the non-technical ones, I recommend the following:

  • Thinking with Data by Max Shron
  • Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neill
  • Invisible Women: Data Bias in a World Designed for Men by Caroline Criado Perez
  • Rookie Smarts: Why Learning Beats Knowing in the New Game of Work by Liz Wiseman
  • Grit: The Power of Passion and Perseverance by Angela Duckworth
  • The First 90 Days: Proven Strategies for Getting Up to Speed Faster and Smarter by Michael Watkins

Dealing with Rejection

I’ve had a lot of rejections. The first one was the hardest but after that, it kept getting easier. I developed a thick skin and just moved on.

Rejection sucks. Try not to take it personally. Nobody likes to fail, but it will happen. When it does, fail up.

Conclusion

It took me 3 months after graduating from boot camp to get a job. It took a lot of sacrifices. When I finally got the job offer, I felt very grateful, relieved, and excited.

I could not have done it without Codeup and my family’s support.


Thanks for reading! I hope you got something out of this post.

To all aspiring data scientists out there, just don’t give up. Try not to listen to all the haters out there. If you must, hear what they have to say, take stock of your weaknesses, and aspire to learn better than yesterday. But never ever let them discourage you. Remember, data science skills lie on a spectrum. If you’ve got the passion and perseverance, I’m pretty sure that there’s a company or organization out there that’s just the right fit for you.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Codeup Alumni Portal. (May 31, 2020). Resumé — Ednalyn C. De Dioshttps://alumni.codeup.com/uploads/699-1562875657.pdf

This article was first published in Towards Data Science‘ Medium publication.